Transcriptome Assembly

trinity *

Authors:Menachem Sklarz
Affiliation:Bioinformatics core facility
Organization:National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for RNA_seq assembly using the Trinity assembler.

Attention

This module was tested on release 2.5.x. It should also work with 2.4.x

For old versions of Trinity, you might need to use trinity_old module.

The main difference between the modules is that trinity creates an output directory with the word trinity in it as required by the newer release of Trinity.

In order to run on the cluster, you need to install HpcGridRunner.

Requires

  • fastq files in at least one of the following slots:

    • sample_data[<sample>]["fastq.F"]
    • sample_data[<sample>]["fastq.R"]
    • sample_data[<sample>]["fastq.S"]

Output:

  • puts fasta output files in the following slots:

    • for sample-wise assembly:

      • sample_data[<sample>]["fasta.nucl"]
      • sample_data[<sample>]["Trinity.contigs"]
    • for project-wise assembly:

      • sample_data["fasta.nucl"]
      • sample_data["Trinity.contigs"]

Parameters that can be set

Parameter Values Comments
scope sample|project Set if project-wide fasta slot should be used
skip_gene_to_trans_map   Set to skip construction of the transcript map. You can use a dedicated module, Trinity_gene_to_trans_map. Both put the map in the same slot (gene_trans_map)
get_Trinity_gene_to_trans_map   Path to get_Trinity_gene_to_trans_map.pl. If not passed, will try guessing from Trinity path
TrinityStats   block with ‘path:’ set to TrinityStats.pl executable

Lines for parameter file

trinity1:
    module:                 trinity
    base:                   trin_tags1
    script_path:            {Vars.paths.Trinity}
    qsub_params:
        node:               sge213
        -pe:                shared 20
    redirects:
        --grid_exec:        "{Vars.paths.hpc_cmds_GridRunner} --grid_conf {Vars.paths.SGE_Trinity_conf} -c" 
        --grid_node_CPU:    40 
        --grid_node_max_memory: 80G 
        --max_memory:        80G 
        --seqType:          fq
        --min_kmer_cov:     2
        --full_cleanup:
    TrinityStats:
        path:           {Vars.paths.TrinityStats}

References

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.

add_trinity_tags *

Authors:Menachem Sklarz
Affiliation:Bioinformatics core facility
Organization:National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for adding the tags required by Trinity to the ends of the read names. See the Strand specific assembly section of the Trinity manual.

The module uses awk, so you don’t need to pass a script_path. Since you must pass a script_path, just leave it blank.

Attention

The awk command is set to remove all text in title following any whitespace. Make sure the said information is not important. If it is, you can do the mapping step using the base of add_trinity_tags.

Requires

  • fastq files in at least one of the following slots:

    • sample_data[<sample>]["fastq.F"]
    • sample_data[<sample>]["fastq.R"]
    • sample_data[<sample>]["fastq.S"]

Output:

  • puts fastq output files (with added tags) in the following slots:

    • sample_data[<sample>]["fastq.F"]
    • sample_data[<sample>]["fastq.R"]
    • sample_data[<sample>]["fastq.S"]

Lines for parameter file

trintags:
    module:      add_trinity_tags
    base:        trim1
    script_path: NOT_USED

Trinity_gene_to_trans_map

Authors:Menachem Sklarz
Affiliation:Bioinformatics core facility
Organization:National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for creating a gene vs. transcript map for a Trinity based assembly.

Requires

  • fasta files in at least one of the following slots:

    • sample_data[<sample>]["fasta.nucl"] (if scope = sample)
    • sample_data["project_data"]["fasta.nucl"] (if scope = project)

Output:

  • puts gene to trans map in:

    • sample_data[<sample>]["gene_trans_map"] (if scope = sample)
    • sample_data["project_data"]["gene_trans_map"] (if scope = project)

Parameters that can be set

Parameter Values Comments
scope sample|project Use sample or project scope assembly.

Lines for parameter file

Gene_Trans_Map:
    module:     Trinity_gene_to_trans_map
    base:       trinity
    script_path: {Vars.paths.get_Trinity_gene_to_trans_map.pl}

References

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.

trinity_mapping

Authors:Menachem Sklarz
Affiliation:Bioinformatics core facility
Organization:National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for running align_and_estimate_abundance.pl on a Trinity assembly and the raw reads.

Tested on versions 2.4.0 and 2.5.0 of Trinity.

See the align_and_estimate_abundance.pl script documentation.

Requires

  • fastq files in at least one of the following slots:

    • sample_data[<sample>]["fastq.F"]
    • sample_data[<sample>]["fastq.R"]
    • sample_data[<sample>]["fastq.S"]
  • A Trinity assembly in one of (depending on scope)

    • sample_data[<sample>]["fasta.nucl"]
    • sample_data["fasta.nucl"]

Output:

  • Puts output files in the following slots:

    • sample_data[<sample>]["bam"]
    • sample_data[<sample>]["unsorted_bam"] (If --coordsort_bam is passed in redirects)
    • sample_data[<sample>]["isoforms.results"]
    • sample_data[<sample>]["genes.results"]

Parameters that can be set

Parameter Values Comments
scope sample|project Set if project-wide fasta slot should be used
redirects: –gene_trans_map path or empty If empty, use internal gene_trans_map. If path, use path as gene_trans_map for all samples. If not passed, performs analysis on isoform level only
redirects: –trinity_mode   If set, will create a gene_trans_map for each sample and store it as sample gene_trans_map

Lines for parameter file

trin_map1:
    module:               trinity_mapping
    base:                 trinity1
    script_path:          {Vars.paths.align_and_estimate_abundance}
    redirects:
        --est_method:     RSEM
        --aln_method:     bowtie
        --trinity_mode:
        --seqType:        fq

References

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.

trinity_statistics

Authors:Menachem Sklarz
Affiliation:Bioinformatics core facility
Organization:National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for running abundance_estimates_to_matrix.pl on genes or isoforms counts tables produced by align_and_estimate_abundance.pl

See the script documentation here.

This conversion makes sense at the project level - combining all sample matrices into a single, normalized, comparison table. However, for completeness, we included a sample scope option for running the script in each sample separately.

Note

scope is not defined for this module. It only makes sense to run abundance_estimates_to_matrix when comparing many samples against a single assembly

Requires

  • Either genes.results or isoforms.results files in the following slots:

    • sample_data[<sample>]["genes.results"]
    • sample_data[<sample>]["isoforms.results"]

Output:

  • Creates the following files in the following slots:

    • <project>.counts.matrix in self.sample_data["project_data"]["counts.matrix"]
    • <project>.not_cross_norm.fpkm.tmp in self.sample_data["project_data"]["not_cross_norm.fpkm.tmp"]
    • <project>.not_cross_norm.fpkm.tmp.TMM_info.txt in self.sample_data["project_data"]["not_cross_norm.fpkm.tmp.TMM_info.txt"]
    • <project>.TMM.fpkm.matrix in self.sample_data["project_data"]["TMM.fpkm.matrix"]

Parameters that can be set

Parameter Values Comments
use_genes   Use ‘genes.results’ matrix. If not passed, use ‘isoforms.results’
redirects: –gene_trans_map path or ‘none’ If path, use path as gene_trans_map for all samples. If ‘none’, does not produce gene level estimates. In order to use an internal gene_trans_map, do not pass this parameter!

Lines for parameter file

trin_map_stats:
    module:             trinity_statistics
    base:               trin_map1
    script_path:        /path/to/abundance_estimates_to_matrix.pl
    use_genes:       
    redirects:
        --est_method:   RSEM

References

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.

RSEM

Authors:Liron Levin
Affiliation:Bioinformatics core facility
Organization:National Institute of Biotechnology in the Negev, Ben Gurion University.

Short Description

A module for running RSEM

Requires

  • fastq file in
    self.sample_data[sample]["fastq.F"] self.sample_data[sample]["fastq.R"] self.sample_data[sample]["fastq.S"]
  • or bam file in
    self.sample_data[sample]["bam"]

Output

  • puts output bam files (if the input is fastq) in:
    self.sample_data[sample]["bam"]
  • puts the location of RSEM results in:
    self.sample_data[sample]["RSEM"] self.sample_data[sample]["genes.results"] self.sample_data[sample]["isoforms.results"]

Parameters that can be set

Parameter Values Comments
mode transcriptome/genome Is the reference is a genome or a transcriptome?
gff3 None Use if the mode is genome and the annotation file is in gff3 format

Comments

  • This module was tested on:
    RSEM v1.2.25 bowtie2 v2.2.6

Lines for parameter file

Step_Name:                                                   # Name of this step
    module: RSEM                                             # Name of the module used
    base:                                                    # Name of the step [or list of names] to run after [must be after a bam file generator step or merge with fastq files]
    script_path:                                             # Command for running the RSEM script 
    qsub_params:
        -pe:                                                 # Number of CPUs to reserve for this analysis
    mode:                                                    # transcriptome or genome
    annotation:                                              # For Genome mode: the location of GTF file [the default] , for GFF3 use the gff3 flag. For Transcriptome mode: transcript-to-gene-map file.
                                                             # If annotation is set to Trinity the transcript-to-gene-map file will be generated using the from_Trinity_to_gene_map script
    from_Trinity_to_gene_map_script_path:                    # If the mode is transcriptome and the reference was assembled using Trinity it is possible to generate the transcript-to-gene-map file automatically using this script
                                                             # If annotation is set to Trinity and this line is empty or missing it will try using the module's associated script
    gff3:                                                    # Use if the mode is genome and the annotation file is in gff3 format
    mapper:                                                  # bowtie/bowtie2/star 
    mapper_path:                                             # Location of mapper script
    rsem_prepare_reference_script_path:                      # Location of preparing reference script
    plot_stat:                                               # Generate statistical plots
    plot_stat_script_path:                                   # Location of statistical plot generating script
    reference:                                               # The reference genome/transcriptome location [FASTA file]
    rsem_generate_data_matrix_script_path:                   # Location of the final matrix generating script
                                                             # If this line is empty or missing it will try using the module's associated script
    redirects:
        --append-names:                                      # RSEM will append gene_name/transcript_name to the result files
        --estimate-rspd:                                     # Enables RSEM to learn from the data how the reads are distributed across a transcript
        -p:                                                  # Number of CPUs to use in this analysis
        --bam:                                               # Will use bam files and not fastq
        --no-bam-output:
        --output-genome-bam:                                 # Alignments in genomic coordinates (only if mode is genome)

References

Li, Bo, and Colin N. Dewey. “RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.” BMC bioinformatics 12.1 (2011): 323.‏

quast *

Authors:Menachem Sklarz
Affiliation:Bioinformatics core facility
Organization:National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running quast on fasta assemblies:

QUAST is executed on the fasta file along the following lines:

  • If ‘scope’ is specified, the appropriate fasta will be used. An error will occur if the fasta does not exist.
  • If ‘scope’ is not specified, if a project-wide fasta exists, it will be used. Otherwise, sample-wise fasta files will be used. If none exist, an error will occur.

Note

With compare_mode, you tell the module to run quast on multiple assemblies. This is done in one of three ways:

  • If scope is sample and a single base step defined, will compare between the samples.
  • If scope is sample and there is more than one base step defined, will compare between the assemblies found in the base steps for each sample separately.
  • If scope is project, will compare between the assemblies found in the base steps at the project level.

Requires

  • fasta files in one of the following slots:

    • sample_data["fasta.nucl"]
    • sample_data[<sample>]["fasta.nucl"]

Output

  • Puts output directory in one of:
    • self.sample_data["project_data"]["quast"]
    • self.sample_data[<sample>]["quast"]

Parameters that can be set

Parameter Values Comments
scope project | sample Indicates whether to use a project or sample contigs file.
compare_mode   If ‘scope’ is ‘sample’, specifies whether to analyse each sample separately or to create a single comparison report for all samples.

Lines for parameter file

  1. A quast report for each sample separately:
quast1:
    module: quast
    base: spades1
    script_path: /path/to/quast.py
    scope: sample
    redirects:
        --fast: 
  1. A quast report comparing the sample assemblies:
quast1:
    module: quast
    base: spades1
    script_path: /path/to/quast.py
    compare_mode: 
    scope: sample
    redirects:
        --fast: 
  1. A quast report comparing the project assemblies from different stages of the analysis:
quast1:
    module: quast
    base: 
        - spades1
        - megahit1
    script_path: /path/to/quast.py
    compare_mode: 
    scope: project
    redirects:
        --fast: 

References

Gurevich, A., Saveliev, V., Vyahhi, N. and Tesler, G., 2013. QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), pp.1072-1075.

htseq_count

Authors:Menachem Sklarz
Affiliation:Bioinformatics core facility
Organization:National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running htseq-count:

See htseq-count documentation.

Requires

  • fastq files in one of the following slots:

    • sample_data[<sample>]["bam"]
    • sample_data[<sample>]["sam"]

Output

  • Puts the output file in:
    self.sample_data[<sample>]["HTSeq.counts"]

Parameters that can be set

Parameter Values Comments
gff path to bowtie1 index If not given, will look for a project bowtie1 index and then for a sample bowtie1 index
-f|–format sam | bam In redirects. Tells htseq-count which file to use. If not specified, will use whichever file exists.

Lines for parameter file

For external index:

htseq_c1:
    module:         htseq_count
    base:           samtools_STAR1
    script_path:    /storage16/app/bioinfo/python_packages/bin/htseq-count
    gtf:            /fastspace/bioinfo_databases/STAR_GRCh38_Gencode21/gencode.v21.annotation.gtf
    redirects:
        --format:   bam
        -s:         'no'
        -m:         intersection-nonempty

References

Anders, S., Pyl, P.T. and Huber, W., 2015. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics, 31(2), pp.166-169.