Transcriptome Assembly

Modules included in this section

trinity ^*
add_trinity_tags ^*
Trinity_gene_to_trans_map
trinity_mapping
trinity_statistics
RSEM
quast ^*
htseq_count

`trinity` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for RNA_seq assembly using the Trinity assembler.

Attention

This module was tested on release 2.5.x. It should also work with 2.4.x

For old versions of Trinity, you might need to use trinity_old module.

The main difference between the modules is that trinity creates an output directory with the word trinity in it as required by the newer release of Trinity.

In order to run on the cluster, you need to install HpcGridRunner.

Requires

fastq files in at least one of the following slots:

sample_data[<sample>]["fastq.F"]

sample_data[<sample>]["fastq.R"]

sample_data[<sample>]["fastq.S"]

bam file for Genome Guided assembly in:
- sample_data["bam"]
- sample_data[<sample>]["bam"]

Output:

puts fasta output files in the following slots:

for sample-wise assembly:

sample_data[<sample>]["fasta.nucl"]

sample_data[<sample>]["Trinity.contigs"]

for project-wise assembly:

sample_data["fasta.nucl"]

sample_data["Trinity.contigs"]

Parameters that can be set

Parameter	Values	Comments
scope	sample\|project	Set if project-wide fasta slot should be used
skip_gene_to_trans_map		Set to skip construction of the transcript map. You can use a dedicated module, `Trinity_gene_to_trans_map`. Both put the map in the same slot (gene_trans_map)
get_Trinity_gene_to_trans_map		Path to get_Trinity_gene_to_trans_map.pl. If not passed, will try guessing from Trinity path
TrinityStats		block with ‘path:’ set to TrinityStats.pl executable
genome_guided		Use if you have a project level BAM file with reads mapped to a reference genome and it is coordinate sorted
Group_by	Name of the Column in the grouping file to use for grouping	Only works in project scope: Will create a sample file for Trinity

Lines for parameter file

trinity1:
    module:                 trinity
    base:                   trin_tags1
    script_path:            {Vars.paths.Trinity}
    qsub_params:
        node:               sge213
        -pe:                shared 20
    redirects:
        --grid_exec:        "{Vars.paths.hpc_cmds_GridRunner} --grid_conf {Vars.paths.SGE_Trinity_conf} -c" 
        --grid_node_CPU:    40 
        --grid_node_max_memory: 80G 
        --max_memory:        80G 
        --seqType:          fq
        --min_kmer_cov:     2
        --full_cleanup:
    TrinityStats:
        path:           {Vars.paths.TrinityStats}

References

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.

`add_trinity_tags` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for adding the tags required by Trinity to the ends of the read names. See the Strand specific assembly section of the Trinity manual.

The module uses awk, so you don’t need to pass a script_path. Since you must pass a script_path, just leave it blank.

Attention

The awk command is set to remove all text in title following any whitespace. Make sure the said information is not important. If it is, you can do the mapping step using the base of add_trinity_tags.

Requires

fastq files in at least one of the following slots:

sample_data[<sample>]["fastq.F"]

sample_data[<sample>]["fastq.R"]

sample_data[<sample>]["fastq.S"]

Output:

puts fastq output files (with added tags) in the following slots:

sample_data[<sample>]["fastq.F"]

sample_data[<sample>]["fastq.R"]

sample_data[<sample>]["fastq.S"]

Lines for parameter file

trintags:
    module:      add_trinity_tags
    base:        trim1
    script_path: NOT_USED

`Trinity_gene_to_trans_map`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for creating a gene vs. transcript map for a Trinity based assembly.

Requires

fasta files in at least one of the following slots:

sample_data[<sample>]["fasta.nucl"] (if scope = sample)

sample_data["project_data"]["fasta.nucl"] (if scope = project)

Output:

puts gene to trans map in:

sample_data[<sample>]["gene_trans_map"] (if scope = sample)

sample_data["project_data"]["gene_trans_map"] (if scope = project)

Parameters that can be set

Parameter	Values	Comments
scope	sample\|project	Use sample or project scope assembly.

Lines for parameter file

Gene_Trans_Map:
    module:     Trinity_gene_to_trans_map
    base:       trinity
    script_path: {Vars.paths.get_Trinity_gene_to_trans_map.pl}

References

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.

`trinity_mapping`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for running align_and_estimate_abundance.pl on a Trinity assembly and the raw reads.

Tested on versions 2.4.0 and 2.5.0 of Trinity.

See the align_and_estimate_abundance.pl script documentation.

Requires

fastq files in at least one of the following slots:

sample_data[<sample>]["fastq.F"]

sample_data[<sample>]["fastq.R"]

sample_data[<sample>]["fastq.S"]

A Trinity assembly in one of (depending on scope)

sample_data[<sample>]["fasta.nucl"]

sample_data["fasta.nucl"]

Output:

Puts output files in the following slots:
- sample_data[<sample>]["bam"]
- sample_data[<sample>]["unsorted_bam"] (If --coordsort_bam is passed in redirects)
- sample_data[<sample>]["isoforms.results"]
- sample_data[<sample>]["genes.results"]

Parameters that can be set

Parameter	Values	Comments
scope	sample\|project	Set if project-wide fasta slot should be used
redirects: –gene_trans_map	path or empty	If empty, use internal gene_trans_map. If path, use path as gene_trans_map for all samples. If not passed, performs analysis on isoform level only
redirects: –trinity_mode		If set, will create a gene_trans_map for each sample and store it as sample gene_trans_map

Lines for parameter file

trin_map1:
    module:               trinity_mapping
    base:                 trinity1
    script_path:          {Vars.paths.align_and_estimate_abundance}
    redirects:
        --est_method:     RSEM
        --aln_method:     bowtie
        --trinity_mode:
        --seqType:        fq

References

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.

`trinity_statistics`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for running abundance_estimates_to_matrix.pl on genes or isoforms counts tables produced by align_and_estimate_abundance.pl

See the script documentation here.

This conversion makes sense at the project level - combining all sample matrices into a single, normalized, comparison table. However, for completeness, we included a sample scope option for running the script in each sample separately.

Note

scope is not defined for this module. It only makes sense to run abundance_estimates_to_matrix when comparing many samples against a single assembly

Requires

Either genes.results or isoforms.results files in the following slots:
- sample_data[<sample>]["genes.results"]
- sample_data[<sample>]["isoforms.results"]

Output:

Creates the following files in the following slots:
- <project>.counts.matrix in self.sample_data["project_data"]["counts.matrix"]
- <project>.not_cross_norm.fpkm.tmp in self.sample_data["project_data"]["not_cross_norm.fpkm.tmp"]
- <project>.not_cross_norm.fpkm.tmp.TMM_info.txt in self.sample_data["project_data"]["not_cross_norm.fpkm.tmp.TMM_info.txt"]
- <project>.TMM.fpkm.matrix in self.sample_data["project_data"]["TMM.fpkm.matrix"]

Parameters that can be set

Parameter	Values	Comments
use_genes		Use ‘genes.results’ matrix. If not passed, use ‘isoforms.results’
redirects: –gene_trans_map	path or ‘none’	If path, use path as gene_trans_map for all samples. If ‘none’, does not produce gene level estimates. In order to use an internal gene_trans_map, do not pass this parameter!

Lines for parameter file

trin_map_stats:
    module:             trinity_statistics
    base:               trin_map1
    script_path:        /path/to/abundance_estimates_to_matrix.pl
    use_genes:       
    redirects:
        --est_method:   RSEM

References

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.

`RSEM`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Short Description

A module for running RSEM

Requires

fastq file in
self.sample_data[sample]["fastq.F"] self.sample_data[sample]["fastq.R"] self.sample_data[sample]["fastq.S"]

or bam file in
self.sample_data[sample]["bam"]

Output

puts output bam files (if the input is fastq) in:
self.sample_data[sample]["bam"]

puts the location of RSEM results in:
self.sample_data[sample]["RSEM"] self.sample_data[sample]["genes.results"] self.sample_data[sample]["isoforms.results"]

Parameters that can be set

Parameter	Values	Comments
mode	transcriptome/genome	Is the reference is a genome or a transcriptome?
gff3	None	Use if the mode is genome and the annotation file is in gff3 format

Comments

This module was tested on:
RSEM v1.2.25 bowtie2 v2.2.6

Lines for parameter file

Step_Name:                                                   # Name of this step
    module: RSEM                                             # Name of the module used
    base:                                                    # Name of the step [or list of names] to run after [must be after a bam file generator step or merge with fastq files]
    script_path:                                             # Command for running the RSEM script 
    qsub_params:
        -pe:                                                 # Number of CPUs to reserve for this analysis
    mode:                                                    # transcriptome or genome
    export_transcriptome:                                    # In genome mode set the extracted transcriptome as the new project level fasta.nucl and extract the ranscript-to-gene-map file as project level gene_trans_map
    annotation:                                              # For Genome mode: the location of GTF file [the default] , for GFF3 use the gff3 flag. For Transcriptome mode: transcript-to-gene-map file.
                                                             # If annotation is set to Trinity the transcript-to-gene-map file will be generated using the from_Trinity_to_gene_map script
                                                             # If not set will use only the reference file as unrelated transcripts
    from_Trinity_to_gene_map_script_path:                    # If the mode is transcriptome and the reference was assembled using Trinity it is possible to generate the transcript-to-gene-map file automatically using this script
                                                             # If annotation is set to Trinity and this line is empty or missing it will try using the module's associated script
    gff3:                                                    # Use if the mode is genome and the annotation file is in gff3 format
    mapper:                                                  # bowtie/bowtie2/star 
    mapper_path:                                             # Location of mapper script
    rsem_prepare_reference_script_path:                      # Location of preparing reference script
    plot_stat:                                               # Generate statistical plots
    plot_stat_script_path:                                   # Location of statistical plot generating script
    reference:                                               # The reference genome/transcriptome location [FASTA file]
    rsem_generate_data_matrix_script_path:                   # Location of the final matrix generating script
                                                             # If this line is empty or missing it will try using the module's associated script
    redirects:
        --append-names:                                      # RSEM will append gene_name/transcript_name to the result files
        --estimate-rspd:                                     # Enables RSEM to learn from the data how the reads are distributed across a transcript
        -p:                                                  # Number of CPUs to use in this analysis
        --bam:                                               # Will use bam files and not fastq
        --no-bam-output:
        --output-genome-bam:                                 # Alignments in genomic coordinates (only if mode is genome)

References

Li, Bo, and Colin N. Dewey. “RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.” BMC bioinformatics 12.1 (2011): 323.‏

`quast` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running quast on fasta assemblies:

QUAST is executed on the fasta file along the following lines:

If ‘scope’ is specified, the appropriate fasta will be used. An error will occur if the fasta does not exist.
If ‘scope’ is not specified, if a project-wide fasta exists, it will be used. Otherwise, sample-wise fasta files will be used. If none exist, an error will occur.

Note

With compare_mode, you tell the module to run quast on multiple assemblies. This is done in one of three ways:

If scope is sample and a single base step defined, will compare between the samples.
If scope is sample and there is more than one base step defined, will compare between the assemblies found in the base steps for each sample separately.
If scope is project, will compare between the assemblies found in the base steps at the project level.

Requires

fasta files in one of the following slots:
- sample_data["fasta.nucl"]
- sample_data[<sample>]["fasta.nucl"]

Output

Puts output directory in one of:
- self.sample_data["project_data"]["quast"]
- self.sample_data[<sample>]["quast"]

Parameters that can be set

Parameter	Values	Comments
scope	project \| sample	Indicates whether to use a project or sample contigs file.
compare_mode		If ‘scope’ is ‘sample’, specifies whether to analyse each sample separately or to create a single comparison report for all samples.

Lines for parameter file

A quast report for each sample separately:

quast1:
    module: quast
    base: spades1
    script_path: /path/to/quast.py
    scope: sample
    redirects:
        --fast: 

A quast report comparing the sample assemblies:

quast1:
    module: quast
    base: spades1
    script_path: /path/to/quast.py
    compare_mode: 
    scope: sample
    redirects:
        --fast: 

A quast report comparing the project assemblies from different stages of the analysis:

quast1:
    module: quast
    base: 
        - spades1
        - megahit1
    script_path: /path/to/quast.py
    compare_mode: 
    scope: project
    redirects:
        --fast: 

References

Gurevich, A., Saveliev, V., Vyahhi, N. and Tesler, G., 2013. QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), pp.1072-1075.

`htseq_count`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running htseq-count:

See htseq-count documentation.

Requires

fastq files in one of the following slots:
- sample_data[<sample>]["bam"]
- sample_data[<sample>]["sam"]

Output

Puts the output file in:
self.sample_data[<sample>]["HTSeq.counts"]

Parameters that can be set

Parameter	Values	Comments
gff	path to bowtie1 index	If not given, will look for a project bowtie1 index and then for a sample bowtie1 index
-f\|–format	sam \| bam	In redirects. Tells htseq-count which file to use. If not specified, will use whichever file exists.

Lines for parameter file

For external index:

htseq_c1:
    module:         htseq_count
    base:           samtools_STAR1
    script_path:    /storage16/app/bioinfo/python_packages/bin/htseq-count
    gtf:            /fastspace/bioinfo_databases/STAR_GRCh38_Gencode21/gencode.v21.annotation.gtf
    redirects:
        --format:   bam
        -s:         'no'
        -m:         intersection-nonempty

References

Anders, S., Pyl, P.T. and Huber, W., 2015. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics, 31(2), pp.166-169.