Transcriptome Assembly

trinity *

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for RNA_seq assembly using the Trinity assembler.

Attention

This module was tested on release 2.5.x. It should also work with 2.4.x

For old versions of Trinity, you might need to use trinity_old module.

The main difference between the modules is that trinity creates an output directory with the word trinity in it as required by the newer release of Trinity.

In order to run on the cluster, you need to install HpcGridRunner.

Requires

  • fastq files in at least one of the following slots:

    • sample_data[<sample>]["fastq.F"]

    • sample_data[<sample>]["fastq.R"]

    • sample_data[<sample>]["fastq.S"]

  • bam file for Genome Guided assembly in:

    • sample_data["bam"]

    • sample_data[<sample>]["bam"]

Output:

  • puts fasta output files in the following slots:

    • for sample-wise assembly:

      • sample_data[<sample>]["fasta.nucl"]

      • sample_data[<sample>]["Trinity.contigs"]

    • for project-wise assembly:

      • sample_data["fasta.nucl"]

      • sample_data["Trinity.contigs"]

Parameters that can be set

Parameter

Values

Comments

scope

sample|project

Set if project-wide fasta slot should be used

skip_gene_to_trans_map

Set to skip construction of the transcript map. You can use a dedicated module, Trinity_gene_to_trans_map. Both put the map in the same slot (gene_trans_map)

get_Trinity_gene_to_trans_map

Path to get_Trinity_gene_to_trans_map.pl. If not passed, will try guessing from Trinity path

TrinityStats

block with ‘path:’ set to TrinityStats.pl executable

genome_guided

Use if you have a project level BAM file with reads mapped to a reference genome and it is coordinate sorted

Group_by

Name of the Column in the grouping file to use for grouping

Only works in project scope: Will create a sample file for Trinity

Lines for parameter file

trinity1:
    module:                 trinity
    base:                   trin_tags1
    script_path:            {Vars.paths.Trinity}
    qsub_params:
        node:               sge213
        -pe:                shared 20
    redirects:
        --grid_exec:        "{Vars.paths.hpc_cmds_GridRunner} --grid_conf {Vars.paths.SGE_Trinity_conf} -c" 
        --grid_node_CPU:    40 
        --grid_node_max_memory: 80G 
        --max_memory:        80G 
        --seqType:          fq
        --min_kmer_cov:     2
        --full_cleanup:
    TrinityStats:
        path:           {Vars.paths.TrinityStats}

References

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.

add_trinity_tags *

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for adding the tags required by Trinity to the ends of the read names. See the Strand specific assembly section of the Trinity manual.

The module uses awk, so you don’t need to pass a script_path. Since you must pass a script_path, just leave it blank.

Attention

The awk command is set to remove all text in title following any whitespace. Make sure the said information is not important. If it is, you can do the mapping step using the base of add_trinity_tags.

Requires

  • fastq files in at least one of the following slots:

    • sample_data[<sample>]["fastq.F"]

    • sample_data[<sample>]["fastq.R"]

    • sample_data[<sample>]["fastq.S"]

Output:

  • puts fastq output files (with added tags) in the following slots:

    • sample_data[<sample>]["fastq.F"]

    • sample_data[<sample>]["fastq.R"]

    • sample_data[<sample>]["fastq.S"]

Lines for parameter file

trintags:
    module:      add_trinity_tags
    base:        trim1
    script_path: NOT_USED

Trinity_gene_to_trans_map

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for creating a gene vs. transcript map for a Trinity based assembly.

Requires

  • fasta files in at least one of the following slots:

    • sample_data[<sample>]["fasta.nucl"] (if scope = sample)

    • sample_data["project_data"]["fasta.nucl"] (if scope = project)

Output:

  • puts gene to trans map in:

    • sample_data[<sample>]["gene_trans_map"] (if scope = sample)

    • sample_data["project_data"]["gene_trans_map"] (if scope = project)

Parameters that can be set

Parameter

Values

Comments

scope

sample|project

Use sample or project scope assembly.

Lines for parameter file

Gene_Trans_Map:
    module:     Trinity_gene_to_trans_map
    base:       trinity
    script_path: {Vars.paths.get_Trinity_gene_to_trans_map.pl}

References

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.

trinity_mapping

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for running align_and_estimate_abundance.pl on a Trinity assembly and the raw reads.

Tested on versions 2.4.0 and 2.5.0 of Trinity.

See the align_and_estimate_abundance.pl script documentation.

Requires

  • fastq files in at least one of the following slots:

    • sample_data[<sample>]["fastq.F"]

    • sample_data[<sample>]["fastq.R"]

    • sample_data[<sample>]["fastq.S"]

  • A Trinity assembly in one of (depending on scope)

    • sample_data[<sample>]["fasta.nucl"]

    • sample_data["fasta.nucl"]

Output:

  • Puts output files in the following slots:

    • sample_data[<sample>]["bam"]

    • sample_data[<sample>]["unsorted_bam"] (If --coordsort_bam is passed in redirects)

    • sample_data[<sample>]["isoforms.results"]

    • sample_data[<sample>]["genes.results"]

Parameters that can be set

Parameter

Values

Comments

scope

sample|project

Set if project-wide fasta slot should be used

redirects: –gene_trans_map

path or empty

If empty, use internal gene_trans_map. If path, use path as gene_trans_map for all samples. If not passed, performs analysis on isoform level only

redirects: –trinity_mode

If set, will create a gene_trans_map for each sample and store it as sample gene_trans_map

Lines for parameter file

trin_map1:
    module:               trinity_mapping
    base:                 trinity1
    script_path:          {Vars.paths.align_and_estimate_abundance}
    redirects:
        --est_method:     RSEM
        --aln_method:     bowtie
        --trinity_mode:
        --seqType:        fq

References

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.

trinity_statistics

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for running abundance_estimates_to_matrix.pl on genes or isoforms counts tables produced by align_and_estimate_abundance.pl

See the script documentation here.

This conversion makes sense at the project level - combining all sample matrices into a single, normalized, comparison table. However, for completeness, we included a sample scope option for running the script in each sample separately.

Note

scope is not defined for this module. It only makes sense to run abundance_estimates_to_matrix when comparing many samples against a single assembly

Requires

  • Either genes.results or isoforms.results files in the following slots:

    • sample_data[<sample>]["genes.results"]

    • sample_data[<sample>]["isoforms.results"]

Output:

  • Creates the following files in the following slots:

    • <project>.counts.matrix in self.sample_data["project_data"]["counts.matrix"]

    • <project>.not_cross_norm.fpkm.tmp in self.sample_data["project_data"]["not_cross_norm.fpkm.tmp"]

    • <project>.not_cross_norm.fpkm.tmp.TMM_info.txt in self.sample_data["project_data"]["not_cross_norm.fpkm.tmp.TMM_info.txt"]

    • <project>.TMM.fpkm.matrix in self.sample_data["project_data"]["TMM.fpkm.matrix"]

Parameters that can be set

Parameter

Values

Comments

use_genes

Use ‘genes.results’ matrix. If not passed, use ‘isoforms.results’

redirects: –gene_trans_map

path or ‘none’

If path, use path as gene_trans_map for all samples. If ‘none’, does not produce gene level estimates. In order to use an internal gene_trans_map, do not pass this parameter!

Lines for parameter file

trin_map_stats:
    module:             trinity_statistics
    base:               trin_map1
    script_path:        /path/to/abundance_estimates_to_matrix.pl
    use_genes:       
    redirects:
        --est_method:   RSEM

References

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.

RSEM

Authors

Liron Levin

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

Short Description

A module for running RSEM

Requires

  • fastq file in

    self.sample_data[sample]["fastq.F"] self.sample_data[sample]["fastq.R"] self.sample_data[sample]["fastq.S"]

  • or bam file in

    self.sample_data[sample]["bam"]

Output

  • puts output bam files (if the input is fastq) in:

    self.sample_data[sample]["bam"]

  • puts the location of RSEM results in:

    self.sample_data[sample]["RSEM"] self.sample_data[sample]["genes.results"] self.sample_data[sample]["isoforms.results"]

Parameters that can be set

Parameter

Values

Comments

mode

transcriptome/genome

Is the reference is a genome or a transcriptome?

gff3

None

Use if the mode is genome and the annotation file is in gff3 format

Comments

  • This module was tested on:

    RSEM v1.2.25 bowtie2 v2.2.6

Lines for parameter file

Step_Name:                                                   # Name of this step
    module: RSEM                                             # Name of the module used
    base:                                                    # Name of the step [or list of names] to run after [must be after a bam file generator step or merge with fastq files]
    script_path:                                             # Command for running the RSEM script 
    qsub_params:
        -pe:                                                 # Number of CPUs to reserve for this analysis
    mode:                                                    # transcriptome or genome
    export_transcriptome:                                    # In genome mode set the extracted transcriptome as the new project level fasta.nucl and extract the ranscript-to-gene-map file as project level gene_trans_map
    annotation:                                              # For Genome mode: the location of GTF file [the default] , for GFF3 use the gff3 flag. For Transcriptome mode: transcript-to-gene-map file.
                                                             # If annotation is set to Trinity the transcript-to-gene-map file will be generated using the from_Trinity_to_gene_map script
                                                             # If not set will use only the reference file as unrelated transcripts
    from_Trinity_to_gene_map_script_path:                    # If the mode is transcriptome and the reference was assembled using Trinity it is possible to generate the transcript-to-gene-map file automatically using this script
                                                             # If annotation is set to Trinity and this line is empty or missing it will try using the module's associated script
    gff3:                                                    # Use if the mode is genome and the annotation file is in gff3 format
    mapper:                                                  # bowtie/bowtie2/star 
    mapper_path:                                             # Location of mapper script
    rsem_prepare_reference_script_path:                      # Location of preparing reference script
    plot_stat:                                               # Generate statistical plots
    plot_stat_script_path:                                   # Location of statistical plot generating script
    reference:                                               # The reference genome/transcriptome location [FASTA file]
    rsem_generate_data_matrix_script_path:                   # Location of the final matrix generating script
                                                             # If this line is empty or missing it will try using the module's associated script
    redirects:
        --append-names:                                      # RSEM will append gene_name/transcript_name to the result files
        --estimate-rspd:                                     # Enables RSEM to learn from the data how the reads are distributed across a transcript
        -p:                                                  # Number of CPUs to use in this analysis
        --bam:                                               # Will use bam files and not fastq
        --no-bam-output:
        --output-genome-bam:                                 # Alignments in genomic coordinates (only if mode is genome)

References

Li, Bo, and Colin N. Dewey. “RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.” BMC bioinformatics 12.1 (2011): 323.‏

quast *

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running quast on fasta assemblies:

QUAST is executed on the fasta file along the following lines:

  • If ‘scope’ is specified, the appropriate fasta will be used. An error will occur if the fasta does not exist.

  • If ‘scope’ is not specified, if a project-wide fasta exists, it will be used. Otherwise, sample-wise fasta files will be used. If none exist, an error will occur.

Note

With compare_mode, you tell the module to run quast on multiple assemblies. This is done in one of three ways:

  • If scope is sample and a single base step defined, will compare between the samples.

  • If scope is sample and there is more than one base step defined, will compare between the assemblies found in the base steps for each sample separately.

  • If scope is project, will compare between the assemblies found in the base steps at the project level.

Requires

  • fasta files in one of the following slots:

    • sample_data["fasta.nucl"]

    • sample_data[<sample>]["fasta.nucl"]

Output

  • Puts output directory in one of:
    • self.sample_data["project_data"]["quast"]

    • self.sample_data[<sample>]["quast"]

Parameters that can be set

Parameter

Values

Comments

scope

project | sample

Indicates whether to use a project or sample contigs file.

compare_mode

If ‘scope’ is ‘sample’, specifies whether to analyse each sample separately or to create a single comparison report for all samples.

Lines for parameter file

  1. A quast report for each sample separately:

quast1:
    module: quast
    base: spades1
    script_path: /path/to/quast.py
    scope: sample
    redirects:
        --fast: 
  1. A quast report comparing the sample assemblies:

quast1:
    module: quast
    base: spades1
    script_path: /path/to/quast.py
    compare_mode: 
    scope: sample
    redirects:
        --fast: 
  1. A quast report comparing the project assemblies from different stages of the analysis:

quast1:
    module: quast
    base: 
        - spades1
        - megahit1
    script_path: /path/to/quast.py
    compare_mode: 
    scope: project
    redirects:
        --fast: 

References

Gurevich, A., Saveliev, V., Vyahhi, N. and Tesler, G., 2013. QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), pp.1072-1075.

htseq_count

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running htseq-count:

See htseq-count documentation.

Requires

  • fastq files in one of the following slots:

    • sample_data[<sample>]["bam"]

    • sample_data[<sample>]["sam"]

Output

  • Puts the output file in:

    self.sample_data[<sample>]["HTSeq.counts"]

Parameters that can be set

Parameter

Values

Comments

gff

path to bowtie1 index

If not given, will look for a project bowtie1 index and then for a sample bowtie1 index

-f|–format

sam | bam

In redirects. Tells htseq-count which file to use. If not specified, will use whichever file exists.

Lines for parameter file

For external index:

htseq_c1:
    module:         htseq_count
    base:           samtools_STAR1
    script_path:    /storage16/app/bioinfo/python_packages/bin/htseq-count
    gtf:            /fastspace/bioinfo_databases/STAR_GRCh38_Gencode21/gencode.v21.annotation.gtf
    redirects:
        --format:   bam
        -s:         'no'
        -m:         intersection-nonempty

References

Anders, S., Pyl, P.T. and Huber, W., 2015. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics, 31(2), pp.166-169.