Transcriptome Assembly
Modules included in this section
trinity
*
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for RNA_seq assembly using the Trinity assembler.
Attention
This module was tested on release 2.5.x. It should also work with 2.4.x
For old versions of Trinity, you might need to use trinity_old
module.
The main difference between the modules is that trinity
creates an output directory with the word trinity in it as required by the newer release of Trinity.
In order to run on the cluster, you need to install HpcGridRunner.
Requires
fastq
files in at least one of the following slots:
sample_data[<sample>]["fastq.F"]
sample_data[<sample>]["fastq.R"]
sample_data[<sample>]["fastq.S"]
bam
file for Genome Guided assembly in:sample_data["bam"]
sample_data[<sample>]["bam"]
Output:
puts
fasta
output files in the following slots:
for sample-wise assembly:
sample_data[<sample>]["fasta.nucl"]
sample_data[<sample>]["Trinity.contigs"]
for project-wise assembly:
sample_data["fasta.nucl"]
sample_data["Trinity.contigs"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
scope |
sample|project |
Set if project-wide fasta slot should be used |
skip_gene_to_trans_map |
Set to skip construction of the transcript map. You can use a dedicated module, |
|
get_Trinity_gene_to_trans_map |
Path to get_Trinity_gene_to_trans_map.pl. If not passed, will try guessing from Trinity path |
|
TrinityStats |
block with ‘path:’ set to TrinityStats.pl executable |
|
genome_guided |
Use if you have a project level BAM file with reads mapped to a reference genome and it is coordinate sorted |
|
Group_by |
Name of the Column in the grouping file to use for grouping |
Only works in project scope: Will create a sample file for Trinity |
Lines for parameter file
trinity1:
module: trinity
base: trin_tags1
script_path: {Vars.paths.Trinity}
qsub_params:
node: sge213
-pe: shared 20
redirects:
--grid_exec: "{Vars.paths.hpc_cmds_GridRunner} --grid_conf {Vars.paths.SGE_Trinity_conf} -c"
--grid_node_CPU: 40
--grid_node_max_memory: 80G
--max_memory: 80G
--seqType: fq
--min_kmer_cov: 2
--full_cleanup:
TrinityStats:
path: {Vars.paths.TrinityStats}
References
Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.
Trinity_gene_to_trans_map
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for creating a gene vs. transcript map for a Trinity based assembly.
Requires
fasta
files in at least one of the following slots:
sample_data[<sample>]["fasta.nucl"]
(ifscope = sample
)
sample_data["project_data"]["fasta.nucl"]
(ifscope = project
)
Output:
puts gene to trans map in:
sample_data[<sample>]["gene_trans_map"]
(ifscope = sample
)
sample_data["project_data"]["gene_trans_map"]
(ifscope = project
)
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
scope |
sample|project |
Use sample or project scope assembly. |
Lines for parameter file
Gene_Trans_Map:
module: Trinity_gene_to_trans_map
base: trinity
script_path: {Vars.paths.get_Trinity_gene_to_trans_map.pl}
References
Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.
trinity_mapping
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for running align_and_estimate_abundance.pl
on a Trinity assembly and the raw reads.
Tested on versions 2.4.0 and 2.5.0 of Trinity.
See the align_and_estimate_abundance.pl script documentation.
Requires
fastq
files in at least one of the following slots:
sample_data[<sample>]["fastq.F"]
sample_data[<sample>]["fastq.R"]
sample_data[<sample>]["fastq.S"]
A Trinity assembly in one of (depending on
scope
)
sample_data[<sample>]["fasta.nucl"]
sample_data["fasta.nucl"]
Output:
Puts output files in the following slots:
sample_data[<sample>]["bam"]
sample_data[<sample>]["unsorted_bam"]
(If--coordsort_bam
is passed in redirects)sample_data[<sample>]["isoforms.results"]
sample_data[<sample>]["genes.results"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
scope |
sample|project |
Set if project-wide fasta slot should be used |
redirects: –gene_trans_map |
path or empty |
If empty, use internal gene_trans_map. If path, use path as gene_trans_map for all samples. If not passed, performs analysis on isoform level only |
redirects: –trinity_mode |
If set, will create a gene_trans_map for each sample and store it as sample gene_trans_map |
Lines for parameter file
trin_map1:
module: trinity_mapping
base: trinity1
script_path: {Vars.paths.align_and_estimate_abundance}
redirects:
--est_method: RSEM
--aln_method: bowtie
--trinity_mode:
--seqType: fq
References
Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.
trinity_statistics
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A class that defines a module for running abundance_estimates_to_matrix.pl
on genes or isoforms counts tables produced by align_and_estimate_abundance.pl
See the script documentation here.
This conversion makes sense at the project level - combining all sample matrices into a single, normalized, comparison table. However, for completeness, we included a sample scope option for running the script in each sample separately.
Note
scope
is not defined for this module. It only makes sense to run abundance_estimates_to_matrix
when comparing many samples against a single assembly
Requires
Either
genes.results
orisoforms.results
files in the following slots:sample_data[<sample>]["genes.results"]
sample_data[<sample>]["isoforms.results"]
Output:
Creates the following files in the following slots:
<project>.counts.matrix
inself.sample_data["project_data"]["counts.matrix"]
<project>.not_cross_norm.fpkm.tmp
inself.sample_data["project_data"]["not_cross_norm.fpkm.tmp"]
<project>.not_cross_norm.fpkm.tmp.TMM_info.txt
inself.sample_data["project_data"]["not_cross_norm.fpkm.tmp.TMM_info.txt"]
<project>.TMM.fpkm.matrix
inself.sample_data["project_data"]["TMM.fpkm.matrix"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
use_genes |
Use ‘genes.results’ matrix. If not passed, use ‘isoforms.results’ |
|
redirects: –gene_trans_map |
path or ‘none’ |
If path, use path as gene_trans_map for all samples. If ‘none’, does not produce gene level estimates. In order to use an internal gene_trans_map, do not pass this parameter! |
Lines for parameter file
trin_map_stats:
module: trinity_statistics
base: trin_map1
script_path: /path/to/abundance_estimates_to_matrix.pl
use_genes:
redirects:
--est_method: RSEM
References
Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.
RSEM
- Authors
Liron Levin
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
Short Description
A module for running RSEM
Requires
- fastq file in
self.sample_data[sample]["fastq.F"]
self.sample_data[sample]["fastq.R"]
self.sample_data[sample]["fastq.S"]
- or bam file in
self.sample_data[sample]["bam"]
Output
- puts output bam files (if the input is fastq) in:
self.sample_data[sample]["bam"]
- puts the location of RSEM results in:
self.sample_data[sample]["RSEM"]
self.sample_data[sample]["genes.results"]
self.sample_data[sample]["isoforms.results"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
mode |
transcriptome/genome |
Is the reference is a genome or a transcriptome? |
gff3 |
None |
Use if the mode is genome and the annotation file is in gff3 format |
Lines for parameter file
Step_Name: # Name of this step
module: RSEM # Name of the module used
base: # Name of the step [or list of names] to run after [must be after a bam file generator step or merge with fastq files]
script_path: # Command for running the RSEM script
qsub_params:
-pe: # Number of CPUs to reserve for this analysis
mode: # transcriptome or genome
export_transcriptome: # In genome mode set the extracted transcriptome as the new project level fasta.nucl and extract the ranscript-to-gene-map file as project level gene_trans_map
annotation: # For Genome mode: the location of GTF file [the default] , for GFF3 use the gff3 flag. For Transcriptome mode: transcript-to-gene-map file.
# If annotation is set to Trinity the transcript-to-gene-map file will be generated using the from_Trinity_to_gene_map script
# If not set will use only the reference file as unrelated transcripts
from_Trinity_to_gene_map_script_path: # If the mode is transcriptome and the reference was assembled using Trinity it is possible to generate the transcript-to-gene-map file automatically using this script
# If annotation is set to Trinity and this line is empty or missing it will try using the module's associated script
gff3: # Use if the mode is genome and the annotation file is in gff3 format
mapper: # bowtie/bowtie2/star
mapper_path: # Location of mapper script
rsem_prepare_reference_script_path: # Location of preparing reference script
plot_stat: # Generate statistical plots
plot_stat_script_path: # Location of statistical plot generating script
reference: # The reference genome/transcriptome location [FASTA file]
rsem_generate_data_matrix_script_path: # Location of the final matrix generating script
# If this line is empty or missing it will try using the module's associated script
redirects:
--append-names: # RSEM will append gene_name/transcript_name to the result files
--estimate-rspd: # Enables RSEM to learn from the data how the reads are distributed across a transcript
-p: # Number of CPUs to use in this analysis
--bam: # Will use bam files and not fastq
--no-bam-output:
--output-genome-bam: # Alignments in genomic coordinates (only if mode is genome)
References
Li, Bo, and Colin N. Dewey. “RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.” BMC bioinformatics 12.1 (2011): 323.
quast
*
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A module for running quast on fasta assemblies:
QUAST is executed on the fasta file along the following lines:
If ‘scope’ is specified, the appropriate fasta will be used. An error will occur if the fasta does not exist.
If ‘scope’ is not specified, if a project-wide fasta exists, it will be used. Otherwise, sample-wise fasta files will be used. If none exist, an error will occur.
Note
With compare_mode
, you tell the module to run quast on multiple assemblies. This is done in one of three ways:
If
scope
is sample and a single base step defined, will compare between the samples.If
scope
is sample and there is more than one base step defined, will compare between the assemblies found in the base steps for each sample separately.If
scope
is project, will compare between the assemblies found in the base steps at the project level.
Requires
fasta files in one of the following slots:
sample_data["fasta.nucl"]
sample_data[<sample>]["fasta.nucl"]
Output
- Puts output directory in one of:
self.sample_data["project_data"]["quast"]
self.sample_data[<sample>]["quast"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
scope |
project | sample |
Indicates whether to use a project or sample contigs file. |
compare_mode |
If ‘scope’ is ‘sample’, specifies whether to analyse each sample separately or to create a single comparison report for all samples. |
Lines for parameter file
A quast report for each sample separately:
quast1:
module: quast
base: spades1
script_path: /path/to/quast.py
scope: sample
redirects:
--fast:
A quast report comparing the sample assemblies:
quast1:
module: quast
base: spades1
script_path: /path/to/quast.py
compare_mode:
scope: sample
redirects:
--fast:
A quast report comparing the project assemblies from different stages of the analysis:
quast1:
module: quast
base:
- spades1
- megahit1
script_path: /path/to/quast.py
compare_mode:
scope: project
redirects:
--fast:
References
Gurevich, A., Saveliev, V., Vyahhi, N. and Tesler, G., 2013. QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), pp.1072-1075.
htseq_count
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A module for running htseq-count:
See htseq-count documentation.
Requires
fastq files in one of the following slots:
sample_data[<sample>]["bam"]
sample_data[<sample>]["sam"]
Output
- Puts the output file in:
self.sample_data[<sample>]["HTSeq.counts"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
gff |
path to bowtie1 index |
If not given, will look for a project bowtie1 index and then for a sample bowtie1 index |
-f|–format |
sam | bam |
In redirects. Tells htseq-count which file to use. If not specified, will use whichever file exists. |
Lines for parameter file
For external index:
htseq_c1:
module: htseq_count
base: samtools_STAR1
script_path: /storage16/app/bioinfo/python_packages/bin/htseq-count
gtf: /fastspace/bioinfo_databases/STAR_GRCh38_Gencode21/gencode.v21.annotation.gtf
redirects:
--format: bam
-s: 'no'
-m: intersection-nonempty
References
Anders, S., Pyl, P.T. and Huber, W., 2015. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics, 31(2), pp.166-169.
Comments