NeatSeq-Flow modules

Module categories

Preparation and QC 

Modules included in this section

fastqc_html ^*
trimmo ^*
Multiqc ^*
Cutadapt
Trim_Galore
fastq_screen

`fastqc_html` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running fastqc.

Creates scripts that run fastqc on all available fastq files.

Requires

fastq files in one of the following slots:
- sample_data[<sample>]["fastq.F"]
- sample_data[<sample>]["fastq.R"]
- sample_data[<sample>]["fastq.S"]

Output

puts fastqc output files in the following slots:
- sample_data[<sample>]["fastqc_fastq.F_html"]
- sample_data[<sample>]["fastqc_fastq.R_html"]
- sample_data[<sample>]["fastqc_fastq.S_html"]
puts fastqc zip files in the following slots:
- sample_data[<sample>]["fastqc_fastq.F_zip"]
- sample_data[<sample>]["fastqc_fastq.R_zip"]
- sample_data[<sample>]["fastqc_fastq.S_zip"]

Lines for parameter file

fqc_merge1:
    module: fastqc_html
    base: merge1
    script_path: /path/to/FastQC/fastqc
    qsub_params:
        -pe: shared 15
    redirects:
        --threads: 15

References

Andrews, S., 2010. FastQC: a quality control tool for high throughput sequence data.

`trimmo` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running trimmomatic on fastq files

Requires

fastq files in at least one of the following slots:

sample_data[<sample>]["fastq.F"]

sample_data[<sample>]["fastq.R"]

sample_data[<sample>]["fastq.S"]

Output

puts fastq output files in the following slots:

sample_data[<sample>]["fastq.F"|"fastq.R"|"fastq.S"]

Parameters that can be set

Parameter	Values	Comments
spec_dir	path	If trimmomatic must be executed within a particular directory, specify that directory here
todo	LEADING:20 TRAILING:20	The trimmomatic arguments

Lines for parameter file

trim1:
    module: trimmo
    base: merge1
    script_path: java -jar trimmomatic-0.32.jar
    qsub_params:
        -pe: shared 20
        node: node1
    spec_dir: /path/to/Trimmomatic_dir/
    todo: LEADING:20 TRAILING:20
    redirects:
        -threads: 20

References

Bolger, A.M., Lohse, M. and Usadel, B., 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30(15), pp.2114-2120.

`Multiqc` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for preparing a MultiQC report for all samples.

Tip

By default, the module will search for parsable reports in the directories of all the modules in the branch leading to this instance. To search only in the directories of the explicit base steps, specify the bases_only parameter.

Requires

No real requirements. Will give a report with information if one of the base steps produces reports that MultiQC can read, e.g. fastqc, bowtie2, samtools etc.

Output

puts report dir in the following slot:
- self.sample_data[<sample>]["Multiqc_report"]

Parameters that can be set

Parameter	Values	Comments
bases_only		Search directories of explicit base steps only.

Lines for parameter file

firstMultQC:
    module: Multiqc
    base:
        - sam_bwt2_1
        - fqc_trim1
    bases_only:
    script_path: /path/to/multiqc

References

Ewels, P., Magnusson, M., Lundin, S. and Käller, M., 2016. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), pp.3047-3048.

`Cutadapt`

Authors: Levin Liron
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Short Description

A module for running cutadapt on fastqc files

Requires

fastq files in at least one of the following slots:
sample_data[<sample>]["fastq.F"] sample_data[<sample>]["fastq.R"] sample_data[<sample>]["fastq.S"]

Output

puts fastq output files in the following slots:
sample_data[<sample>]["fastq.F"] sample_data[<sample>]["fastq.R"] sample_data[<sample>]["fastq.S"]

Parameters that can be set

Parameter	Values	Comments

Comments

This module was tested on:
Cutadapt v1.12.1

Lines for parameter file

Step_Name:                       # Name of this step
    module: Cutadapt             # Name of the module used
    base:                        # Name of the step [or list of names] to run after [must be after a merge step]
    script_path:                 # Command for running the Cutadapt script
    paired:                      # Analyse Forward and Reverse reads together.
    Demultiplexing:              # Use to Demultiplex the adaptors, needs to be in the format of name=adaptor_seq
    qsub_params:
        -pe:                     # Number of CPUs to reserve for this analysis
    redirects:
        --too-short-output:      # will replace @ with the location of the sample dir  [e.g. @too_short.fq] 
        -a:                      # Use to trim poly A in SE reads [e.g. "A{100} -A T{100}"]

References

Martin, Marcel. “Cutadapt removes adapter sequences from high-throughput sequencing reads.” EMBnet. journal 17.1 (2011): pp-10

`Trim_Galore`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Short Description

A module for running Trim Galore on fastq files

Requires

fastq files in at least one of the following slots:
sample_data[<sample>]["fastq.F"] sample_data[<sample>]["fastq.R"] sample_data[<sample>]["fastq.S"]

Output

puts fastq output files in the following slots:
sample_data[<sample>]["fastq.F"] sample_data[<sample>]["fastq.R"] sample_data[<sample>]["fastq.S"]

puts unpaired fastq output files in the following slots:
sample_data[<sample>]["fastq.F.unpaired"] sample_data[<sample>]["fastq.R.unpaired"]

Parameters that can be set

Parameter	Values	Comments

Comments

This module was tested on:
Trim Galore v0.4.2 Cutadapt v1.12.1

Lines for parameter file

Step_Name:                       # Name of this step
    module: Trim_Galore          # Name of the module used
    base:                        # Name of the step [or list of names] to run after [must be after a merge step]
    script_path:                 # Command for running the Trim Galore script
    qsub_params:
        -pe:                     # Number of CPUs to reserve for this analysis
    cutadapt_path:               # Location of cutadapt executable 
    redirects:
        --length:                # Parameters for running Trim Galore
        -q:                      # Parameters for running Trim Galore

References

Cutadapt:
Martin, Marcel. “Cutadapt removes adapter sequences from high-throughput sequencing reads.” EMBnet journal 17.1 (2011):pp-10

Trim Galore:
Krueger F: Trim Galore. [http://www.bioinformatics.babraham.ac.uk/projects/]

`fastq_screen`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for executing fastq_screen on sequence files.

Input files are specified with the type parameter or taken from the fastq slots, one script per fastq file.

In regular mode, no output file are produced. However, if the --tag is included, the tagged file will be stored in the equivalent fastq.X slot. If a --filter tag is included, the filtered file will be stored in the equivalent fastq.X slot.

The parameters can be passed through a configuration file specified in the redirected parameters with the --conf parameter.

Alternatively, if you do not specify the configuration file, one will be produced for you. For this, you must include:

A genomes section specifying genome indices to screen against (see examples below) and
an aligner section specifying the aligning program to use and it’s path.

Additionally, if a --threads parameter is included in the redirects, it will be incorporated into the configuration file.

Attention

If a --bisulfite redirected parameter is included, it should contain the path to Bismark, which will be included in the configuration file.

Requires

fastq files in at least one of the following slots:
- sample_data[<sample>]["fastq.F"]
- sample_data[<sample>]["fastq.R"]
- sample_data[<sample>]["fastq.S"]

Output

If --tag and/or --filter or --nohits are included, puts output fastq files in:
- sample_data[<sample>]["fastq.F"]
- sample_data[<sample>]["fastq.R"]
- sample_data[<sample>]["fastq.S"]

Parameters that can be set

Parameter	Values	Comments
genomes	`name: index` pairs (see examples)	If `--conf` not provided, genomes to screen against.
aligner	`name: index` single pair	If `--conf` not provided, path to aligner to use.

Lines for parameter file

No configuration file:

fastq_screen:
    module:         fastq_screen
    base:           merge1
    script_path:    {Vars.paths.fastq_screen}
    qsub_params:
        -pe:        shared 60
    aligner:
        bowtie2:    {Vars.paths.bowtie2}
    genomes:
        Human:      {Vars.databases.human}
        Mouse:      {Vars.databases.moiuse}
        PhiX:       {Vars.databases.phix}
    redirects:
        --filter:   200
        --tag:
        # --nohits:
        --force: 
        --threads:  60 

With configuration file:

fastq_screen:
    module:         fastq_screen
    base:           merge1
    script_path:    {Vars.paths.fastq_screen}
    qsub_params:
        -pe:        shared 60
    redirects:
        --conf:     {Vars.paths.fastq_screen_conf_file}
        --filter:   200
        --tag:
        # --nohits:
        --force: 

References

Wingett, S.W. and Andrews, S., 2018. FastQ Screen: A tool for multi-genome mapping and quality control. F1000Research, 7.

Mapping 

Modules included in this section

bowtie2_builder ^*
bowtie2_mapper ^*
bowtie1_builder ^*
bowtie1_mapper ^*
bwa_builder ^*
bwa_mapper ^*
STAR_mapper
STAR_builder
STAR_LoadRemoveGenome
Multiqc ^*
RSEM
htseq_count
RSEM_prep
RSEM_mapper

`bowtie2_builder` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running bowtie2 index builder:

Builds a bowtie2 index for a fasta file stored at the project or sample level.

Determine which one will be used by specifying scope as either project or sample.

Requires

fasta files in one of the following slots:
- sample_data[<sample>]["fasta.nucl"]
- sample_data["fasta.nucl"]

Output

Puts output index files in one of the following slots:
- self.sample_data[<sample>]["bowtie2.index"]
- self.sample_data["project_data"]["bowtie2.index"]
Puts the fasta file in the following slot:
- self.sample_data[<sample>]["reference"]

Parameters that can be set

Parameter	Values	Comments
scope	project \| sample	Indicates whether to use a project fasta or a sample fasta.

Lines for parameter file

bwt2_build:
    module: bowtie2_builder
    base: trinity1
    script_path: /path/to/bowtie2-build
    scope: project

References

Langmead, B. and Salzberg, S.L., 2012. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), pp.357-359.

`bowtie2_mapper` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running bowtie2 mapper:

The reads stored in each sample are aligned to one of the following bowtie2 indices:

An external index passed with the -x parameter.
A bowtie2 index on a project fasta files, such as an assembly from all samples. Specify with bowtie2_mapper:scope project
A sample bowtie2 index on a sample-specific fasta file, such as from a sample-wise assembly or from the sample file. Specify with bowtie2_mapper:scope sample

The latter two options must come after a bowtie2_builder instance.

Tip

See the documentation for the bowtie2_builder module.

Note

fastq files are never defined project-wide

The scope parameter controls the origin of the index files, i.e. wheather the fasta file to map to is an assembly of the sample reads (scope: sample) or an assembly of all reads in the project (scope: project). The reads to be mapped are always saple reads, as a ‘fastq’ slot is not defined at the project level.

Requires

fastq files in one of the following slots:
- sample_data[<sample>]["fastq.F"]
- sample_data[<sample>]["fastq.R"]
- sample_data[<sample>]["fastq.S"]

Output

Puts output sam files in the following slots:
- self.sample_data[<sample>]["sam"]
Puts the name of the mapper in:
- self.sample_data[<sample>]["mapper"]
puts fasta of reference genome (if one is given in param file) in:
- self.sample_data[<sample>]["reference"]

Parameters that can be set

Parameter	Values	Comments
-x	path to bowtie2 index	If not given, will look for a project bowtie2 index and then for a sample bowtie2 index
ref_genome	path to genome fasta	If -x is NOT given, will use the equivalent internal fasta. If -x is passed, and ref_genome is NOT passed, will leave the reference slot empty
get_map_log		Store the log produced by bowtie2 (This is bowtie2 standard output)
scope	project \| sample	Indicates whether to use a project or sample bowtie2 index.

Lines for parameter file

For external index:

bwt2_1:
    module: bowtie2_mapper
    base: trim1
    script_path: /path/to/bowtie2
    qsub_params:
        -pe: shared 20
    get_map_log:
    ref_genome: /path/to/ref_genome.fna
    redirects:
        -p: 20
        -q: null
        -x: /path/to/bowtie2.index/ref_genome

Using a bowtie2 index constructed from a project fasta:

bwt2_1:
    module: bowtie2_mapper
    base: bwt2_bld1
    script_path: /path/to/bowtie2
    qsub_params:
        -pe: shared 20
    get_map_log:
    scope: project
    redirects:
        -p: 20
        -q: null

References

Langmead, B. and Salzberg, S.L., 2012. Fast gapped-read alignment with Bowtie 2. Nature methods, 9(4), pp.357-359.

`bowtie1_builder` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running bowtie1 index builder:

Requires

fasta files in one of the following slots:
- sample_data["fasta.nucl"]
- sample_data[<sample>]["fasta.nucl"]

output

Puts output index files in one of the following slot:

self.sample_data[<sample>]["bowtie1.index"]
self.sample_data["project_data"]["bowtie1.index"]

Parameters that can be set

Parameter	Values	Comments
scope	path to bowtie1 index	If not given, will look for a project bowtie1 index and then for a sample bowtie1 index

Lines for parameter file

bwt1_bld_ind:
    module: bowtie1_builder
    base: trinity1
    script_path: /path/to/bowtie
    scope: project

References

Langmead, B., Trapnell, C., Pop, M. and Salzberg, S.L., 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10(3), p.R25.

`bowtie1_mapper` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running bowtie1 mapper:

The reads stored in each sample are aligned to one of the following bowtie indices:

An external index passed with the ebwt parameter.
A bowtie index on a project fasta files, such as an assembly from all samples. Specify with bowtie1_mapper:scope project
A sample bowtie1 index on a sample-specific fasta file, such as from a sample-wise assembly or from the sample file. Specify with bowtie1_mapper:scope sample

The latter two options must come after a bowtie1_builder instance.

Requires

fastq files in one of the following slots:
- sample_data[<sample>]["fastq.F"]
- sample_data[<sample>]["fastq.R"]
- sample_data[<sample>]["fastq.S"]

Output

Puts output sam files in the following slots:
self.sample_data[<sample>]["sam"]
Puts the name of the mapper in:
self.sample_data[<sample>]["mapper"]
Puts fasta of reference genome (if one is given in param file) in:
self.sample_data[<sample>]["reference"]

Parameters that can be set

Parameter	Values	Comments
ebwt	path to bowtie1 index	If not given, will look for a project bowtie1 index and then for a sample bowtie1 index
ref_genome	path to genome fasta	If ebwt is NOT given, will use the equivalent internal fasta. If ebwt IS given, and ref_genome is NOT passed, will leave the reference slot empty.
scope	project \| sample	Indicates whether to use a project or sample bowtie1 index.

Lines for parameter file

For external index:

bwt1:
    module: bowtie1_mapper
    base: trim1
    script_path: /path/to/bowtie
    qsub_params:
        -pe: shared 20
    ebwt: /path/to/bowtie1.index/ref_genome
    ref_genome: /path/to/ref_genome.fna
    redirects:
        -p: 20

For project bowtie index:

bwt1_1:
    module: bowtie1_mapper
    base: bwt1_bld_ind
    script_path: /path/to/bowtie
    scope: project

References

Langmead, B., Trapnell, C., Pop, M. and Salzberg, S.L., 2009. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology, 10(3), p.R25.

`bwa_builder` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running bwa index builder:

Builds a bwa index for a fasta file stored at the project or sample level.

Determine which one will be used by specifying scope as either project or sample.

Requires

fasta files in one of the following slots:
- sample_data[<sample>]["fasta.nucl"]
- sample_data["fasta.nucl"]

Output

Puts output index files in one of the following slots:
- self.sample_data[<sample>]["bwa_index"]
- self.sample_data["project_data"]["bwa_index"]
Puts the fasta file in one of the following slot:
- self.sample_data[<sample>]["reference"]

Parameters that can be set

Parameter	Values	Comments
scope	project \| sample	Indicates whether to use a project fasta or a sample fasta.

Lines for parameter file

bwa_bld_ind:
    module: bwa_builder
    base: spades1
    script_path: /path/to/bwa index
    scope: project

References

Li, H. and Durbin, R., 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14), pp.1754-1760.

`bwa_mapper` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running bwa mapper:

The reads stored in each sample are aligned to one of the following bwa indices:

An external index passed with the ref_index parameter.
A bwa index on a project fasta files, such as an assembly from all samples. Specify with bwa_mapper:scope project
A sample bwa index on a sample-specific fasta file, such as from a sample-wise assembly or from the sample fasta file. Specify with bwa_mapper:scope sample

The latter two options must come after a bwa_builder instance.

Requires

fastq files in one of the following slots:
- sample_data[<sample>]["fastq.F"]
- sample_data[<sample>]["fastq.R"]
- sample_data[<sample>]["fastq.S"]
If mod is one of samse, sampe, the sai files are required as well (created by a bwa aln step:
- self.sample_data[<sample>]["saiF|saiR|saiS"]

Output

Puts output sam files in the following slots:
- If mod is one of mem, samse, sampe, bwasw:
  
  self.sample_data[<sample>]["sam"]
- If mod is aln:
  
  self.sample_data[<sample>]["saiF|saiR|saiS"]
Puts the name of the mapper in:
- self.sample_data[<sample>]["mapper"]
puts fasta of reference genome (if one is given in param file) in:
- self.sample_data[<sample>]["reference"]

Parameters that can be set

Parameter	Values	Comments
ref_index	path to bwa index	If not given, will look for a project bwa index and then for a sample bwa index
ref_genome	path to genome fasta	If ref_index is NOT given, will use the equivalent internal fasta. If ref_index is passed, and ref_genome is NOT passed, will leave the reference slot empty
scope	project \| sample	Indicates whether to use a project or sample bwa index.

Lines for parameter file

For external index:

Using mem:

bwa_mem_1:
    module: bwa_mapper
    base: trim1
    script_path: /path/to/bwa
    mod: mem
    qsub_params:
        -pe: shared 20
    ref_genome: /path/to/ref_genome.fna
    ref_index: /path/to/bwa_index/ref_genome
    redirects:
        -t: 20

2. Using ``aln - samse/sampe``:

bwa_aln_1:
    module: bwa_mapper
    base: trim1
    script_path: /path/to/bwa_mapper
    mod: aln
    qsub_params:
        -pe: shared 20
    ref_genome: /path/to/ref_genome.fna
    ref_index: /path/to/bwa_index/ref_genome
    redirects:
        -t: 20
bwa_samse_1:
    module: bwa_mapper
    base: bwt2_1
    script_path: /path/to/bwa
    mod: samse
    ref_genome: /path/to/ref_genome.fna
    ref_index: /path/to/bwa_index/ref_genome

For project bwa index:

bwa_1:
    module: bwa_mapper
    base: bwa_bld_ind
    script_path: /path/to/bwa
    mod: mem
    scope: project

References

Li, H. and Durbin, R., 2009. Fast and accurate short read alignment with Burrows–Wheeler transform. Bioinformatics, 25(14), pp.1754-1760.

`STAR_mapper`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running STAR mapper:

Requires

fastq files in one of the following slots:
- sample_data[<sample>]["fastq.F"]
- sample_data[<sample>]["fastq.R"]
- sample_data[<sample>]["fastq.S"]
If scope is set (must come after STAR_builder module which populates the required slots):
- STAR index directories in:
  sample_data[<sample>]["STAR.index"] if scope = “sample”
  
  sample_data["STAR.index"] if scope = “project”
- Reference fasta files in:
  sample_data[<sample>]["STAR.fasta"] if scope = “sample”
  
  sample_data["STAR.fasta"] if scope = “project”

Output

Puts output sam files in the following slots:
- self.sample_data[<sample>]["sam"]
Alternatively, if --outSAMtype is set to BAM, puts output BAM files in the following slots:
- self.sample_data[<sample>]["bam"]
- self.sample_data[<sample>]["bam_unsorted"]
High confidence collapsed splice junctions (SJ.out.tab file) will be stored in:
- self.sample_data[<sample>]["SJ.out.tab"]
If --quantMode contains TranscriptomeSAM, alignments BAM translated into transcript coordinates will be stored in:
- self.sample_data[<sample>]["TranscriptomeSAM"]
If --quantMode contains GeneCounts, the ReadsPerGene.out.tab file will be stored:
- self.sample_data[<sample>]["GeneCounts"]
If --outWigType is set, will store outputs in:
- if --outWigType is wiggle
  self.sample_data[<sample>]["wig2_UniqueMultiple"]
  
  self.sample_data[<sample>]["wig2_Unique"]
  
  self.sample_data[<sample>]["wig1_UniqueMultiple"]
  
  self.sample_data[<sample>]["wig1_Unique"]
  
  self.sample_data[<sample>]["wig"]
- if --outWigType is bedGraph
  self.sample_data[<sample>]["bdg2_UniqueMultiple"]
  
  self.sample_data[<sample>]["bdg2_Unique"]
  
  self.sample_data[<sample>]["bdg1_UniqueMultiple"]
  
  self.sample_data[<sample>]["bdg1_Unique"]
  
  self.sample_data[<sample>]["bdg"]
Puts the name of the mapper in:
self.sample_data[<sample>]["mapper"]
Puts fasta of reference genome (if one is given in param file) in:
self.sample_data[<sample>]["reference"]

Parameters that can be set

Parameter	Values	Comments
ref_genome	path to genome fasta
scope	project \| sample	The scope from which to take the genome directory

Note

You can set the RG atrribute of the resulting SAM/BAM files with the redirected parameter --outSAMattrRGline This will set the equivalent STAR parameter.

By default, the parameter will be set to include ID and SM tags, both set to the sample name. You can set the SM tag, but any ID tags will be removed and replaced with the sample name.

Lines for parameter file

For external index:

STAR_map:
    module:             STAR_mapper
    base:               STAR_bld_ind
    script_path:        /path/to/STAR
    redirects:
        --readMapNumber:    1000
        --genomeDir:        /path/to/genome/STAR_index/

For project STAR index:

STAR_map:
    module:             STAR_mapper
    base:               STAR_bld_ind
    script_path:        /path/to/STAR
    scope:              project
    redirects:
        --readMapNumber:    1000

References

Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M. and Gingeras, T.R., 2013. STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), pp.15-21.

`STAR_builder`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running STAR genome index construction:

Requires

fasta files in one of the following slots:
- sample_data["fasta.nucl"]
- sample_data[<sample>]["fasta.nucl"]
If --sjdbGTFfile is set in redirects, but left empty, will expect to find a GTF file here:
- sample_data["gtf"] if scope = “project”
- sample_data[<sample>]["gtf"] if scope = “sample”
If --sjdbFileChrStartEnd is set in redirects, but left empty, will expect to find an SJ file here:
- sample_data["SJ.out.tab"] if scope = “project”
- sample_data[<sample>]["SJ.out.tab"] if scope = “sample”

Output

Puts output index files in one of the following slot:

self.sample_data[<sample>]["STAR.index"]

self.sample_data["project_data"]["STAR.index"]

Puts the reference fasta file in one of the following slot:

self.sample_data[<sample>]["STAR.fasta"]

self.sample_data["project_data"]["STAR.fasta"]

Parameters that can be set

Parameter	Values	Comments
scope	project \| sample	Not used

Lines for parameter file

STAR_bld_ind:
    module:             STAR_builder
    base:               trinity1
    script_path:        /path/to/STAR
    scope:              project
    qsub_params:
        queue:          star.q
    redirects:
        --genomeSAindexNbases:  12
        --genomeChrBinNbits:    10

References

Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M. and Gingeras, T.R., 2013. STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), pp.15-21.

`STAR_LoadRemoveGenome`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for loading a STAR genome into RAM for use by subsequent STAR mapping jobs.

Note

This module saves memory and time. Set parameter --genomeLoad in the STAR mapping instance to LoadAndKeep. This will load the genome once into memory and use it repeatedly for all instances executed on the same node. When all mapping jobs are completed, Scripts produced by this instance will remove the genome from RAM for all nodes used.

Tip

Make sure you set the node parameter in qsub_params to all the nodes in use by the base STAR_mapper instance.

Attention

Currently defined for project-scope or external genomes only. Not used for sample-scope genomes.

Note

Loading a genome is not really required. It will be loaded by the first instance of STAR.

Requires

A STAR genome in:
- sample_data["STAR.index"]

Alternatively, a STAR genome index can be passed with the --genomeDir parameter.

Output

No output is created

Parameters that can be set

Parameter	Values	Comments
genome	load\|remove	Load or remove genome from RAM
qsub_params:node		Nodes on which to load/unload genome
scope	project \| sample	The scope from which to take the genome directory. Currently not in use

Lines for parameter file

For external index:

STAR_remove_genome:
    module:             STAR_LoadRemoveGenome
    base:               STAR_map
    script_path:        '{Vars.paths.STAR}STAR'
    genome:             remove
    qsub_params:
        queue:          queue.q
        node:           {Vars.nodes}
    redirects:
        --genomeDir:    /path/to/STAR/genome_directory

For project STAR index:

STAR_remove_genome:
    module:             STAR_LoadRemoveGenome
    base:               STAR_map
    script_path:        '{Vars.paths.STAR}STAR'
    genome:             remove
    qsub_params:
        queue:          queue.q
        node:           {Vars.nodes}

References

Dobin, A., Davis, C.A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M. and Gingeras, T.R., 2013. STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), pp.15-21.

`Multiqc` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for preparing a MultiQC report for all samples.

Tip

By default, the module will search for parsable reports in the directories of all the modules in the branch leading to this instance. To search only in the directories of the explicit base steps, specify the bases_only parameter.

Requires

No real requirements. Will give a report with information if one of the base steps produces reports that MultiQC can read, e.g. fastqc, bowtie2, samtools etc.

Output

puts report dir in the following slot:
- self.sample_data[<sample>]["Multiqc_report"]

Parameters that can be set

Parameter	Values	Comments
bases_only		Search directories of explicit base steps only.

Lines for parameter file

firstMultQC:
    module: Multiqc
    base:
        - sam_bwt2_1
        - fqc_trim1
    bases_only:
    script_path: /path/to/multiqc

References

Ewels, P., Magnusson, M., Lundin, S. and Käller, M., 2016. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), pp.3047-3048.

`RSEM`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Short Description

A module for running RSEM

Requires

fastq file in
self.sample_data[sample]["fastq.F"] self.sample_data[sample]["fastq.R"] self.sample_data[sample]["fastq.S"]

or bam file in
self.sample_data[sample]["bam"]

Output

puts output bam files (if the input is fastq) in:
self.sample_data[sample]["bam"]

puts the location of RSEM results in:
self.sample_data[sample]["RSEM"] self.sample_data[sample]["genes.results"] self.sample_data[sample]["isoforms.results"]

Parameters that can be set

Parameter	Values	Comments
mode	transcriptome/genome	Is the reference is a genome or a transcriptome?
gff3	None	Use if the mode is genome and the annotation file is in gff3 format

Comments

This module was tested on:
RSEM v1.2.25 bowtie2 v2.2.6

Lines for parameter file

Step_Name:                                                   # Name of this step
    module: RSEM                                             # Name of the module used
    base:                                                    # Name of the step [or list of names] to run after [must be after a bam file generator step or merge with fastq files]
    script_path:                                             # Command for running the RSEM script 
    qsub_params:
        -pe:                                                 # Number of CPUs to reserve for this analysis
    mode:                                                    # transcriptome or genome
    export_transcriptome:                                    # In genome mode set the extracted transcriptome as the new project level fasta.nucl and extract the ranscript-to-gene-map file as project level gene_trans_map
    annotation:                                              # For Genome mode: the location of GTF file [the default] , for GFF3 use the gff3 flag. For Transcriptome mode: transcript-to-gene-map file.
                                                             # If annotation is set to Trinity the transcript-to-gene-map file will be generated using the from_Trinity_to_gene_map script
                                                             # If not set will use only the reference file as unrelated transcripts
    from_Trinity_to_gene_map_script_path:                    # If the mode is transcriptome and the reference was assembled using Trinity it is possible to generate the transcript-to-gene-map file automatically using this script
                                                             # If annotation is set to Trinity and this line is empty or missing it will try using the module's associated script
    gff3:                                                    # Use if the mode is genome and the annotation file is in gff3 format
    mapper:                                                  # bowtie/bowtie2/star 
    mapper_path:                                             # Location of mapper script
    rsem_prepare_reference_script_path:                      # Location of preparing reference script
    plot_stat:                                               # Generate statistical plots
    plot_stat_script_path:                                   # Location of statistical plot generating script
    reference:                                               # The reference genome/transcriptome location [FASTA file]
    rsem_generate_data_matrix_script_path:                   # Location of the final matrix generating script
                                                             # If this line is empty or missing it will try using the module's associated script
    redirects:
        --append-names:                                      # RSEM will append gene_name/transcript_name to the result files
        --estimate-rspd:                                     # Enables RSEM to learn from the data how the reads are distributed across a transcript
        -p:                                                  # Number of CPUs to use in this analysis
        --bam:                                               # Will use bam files and not fastq
        --no-bam-output:
        --output-genome-bam:                                 # Alignments in genomic coordinates (only if mode is genome)

References

Li, Bo, and Colin N. Dewey. “RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.” BMC bioinformatics 12.1 (2011): 323.‏

`htseq_count`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running htseq-count:

See htseq-count documentation.

Requires

fastq files in one of the following slots:
- sample_data[<sample>]["bam"]
- sample_data[<sample>]["sam"]

Output

Puts the output file in:
self.sample_data[<sample>]["HTSeq.counts"]

Parameters that can be set

Parameter	Values	Comments
gff	path to bowtie1 index	If not given, will look for a project bowtie1 index and then for a sample bowtie1 index
-f\|–format	sam \| bam	In redirects. Tells htseq-count which file to use. If not specified, will use whichever file exists.

Lines for parameter file

For external index:

htseq_c1:
    module:         htseq_count
    base:           samtools_STAR1
    script_path:    /storage16/app/bioinfo/python_packages/bin/htseq-count
    gtf:            /fastspace/bioinfo_databases/STAR_GRCh38_Gencode21/gencode.v21.annotation.gtf
    redirects:
        --format:   bam
        -s:         'no'
        -m:         intersection-nonempty

References

Anders, S., Pyl, P.T. and Huber, W., 2015. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics, 31(2), pp.166-169.

`RSEM_prep`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running rsem-prepare-reference:

Requires

fasta files in one of the following slots:
- sample_data["fasta.nucl"] (scope = project)
- sample_data[<sample>]["fasta.nucl"] (scope = sample)
If neither exists, please supply reference parameter.

Attention

If type “gene_trans_map” exists, its value will be used for “–transcript-to-gene-map”, unless “–transcript-to-gene-map” is explicitly passed in redirects!

Output

Puts output index files in one of the following slot:

self.sample_data[<sample>]["RSEM.index"]

self.sample_data["project_data"]["RSEM.index"]

Parameters that can be set

Parameter	Values	Comments
scope	project \| sample	Where to take the reference from
reference	path to reference	Use this fasta file. See the definition for reference_fasta_file(s) in the ARGUMENTS section of rsem-prepare-reference help

Lines for parameter file

RSEM_prep_ind:
    module:             RSEM_prep
    base:               merge1
    script_path:        /path/to/RSEM
    reference:              /path/to/fasta
    redirects:
        --gtf:          /path/to/gtf
        --transcript-to-gene-map: /path/to/map_file

References

`RSEM_mapper`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running rsem-calculate-expression:

Requires

fasta files in one of the following slots:
- sample_data["project_data"]["fasta.nucl"] (scope = project)
- sample_data[<sample>]["fasta.nucl"] (scope = sample)
If neither exists, please supply reference parameter.

Output

Puts output index files in one of the following slot:

self.sample_data[<sample>]["genes.counts"]

self.sample_data[<sample>]["isoforms.counts"]

And the following BAMs, depending on redirected params:

self.sample_data[<sample>]["genome.unsorted.bam"]

self.sample_data[<sample>]["genome.bam"]

self.sample_data[<sample>]["transcript.unsorted.bam"]

self.sample_data[<sample>]["transcript.bam"]

Parameters that can be set

Parameter	Values	Comments
scope	project \| sample	The scope of the RSEM index. Must match the scope in the RSEM_prep instance.
result2use	genes \| isoforms	Summarize counts at the gene or isoform level.

Lines for parameter file

Mapping fastq files:

RSEM_map:
    module:             RSEM_mapper
    base:               merge1
    script_path:        {Vars.paths.RSEM.rsem-calculate-expression}
    reference:              /path/to/fasta
    redirects:
        --gtf:          /path/to/gtf
        --transcript-to-gene-map: /path/to/map_file

Parsing an existing BAM alignment file:

RSEM_parse_bam:
    module:         RSEM_mapper
    base:           mv_transcript_bam_to_bam
    script_path:    {Vars.paths.RSEM.rsem-calculate-expression}
    scope:          project
    qsub_params:
        -pe:        shared 20
    redirects:
        --num-threads:  20

References

BAM Conversion to Other Formats 

Modules included in this section

genomeCoverageBed ^*
UCSC_BW_wig
IGV_count ^*
IGV_toTDF ^*

`genomeCoverageBed` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running bedtools genomecov:

The module builds a bedgraph (bdg) file based on an existing BAM file.

Requires

BAM file in the following slot:
- sample_data[<sample>]["bam"]

Output

Puts output BedGraph files in the following slots:
- sample_data[<sample>]["bdg"]

Parameters that can be set

Parameter	Values	Comments
-g	path to chrom.sizes	You must redirect the -g parameter. Create the chrom.sizes file for the reference genome with `samtools faidx` followed by `cut -f1,2`.

Lines for parameter file

genCovBed_bwt1:
    module: genomeCoverageBed
    base: sam_bwt1
    script_path: /path/to/bedtools/bin/genomeCoverageBed
    redirects:
        -bg: 
        -g: /path/to/ref_genome/ref_genome.chrom.sizes

References

`UCSC_BW_wig`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for creating wig and bigwig files using UCSC tools:

The module creates bigwig and wig files from the current active BedGraph file.

Requires

BedGraph file in the following slot:
- sample_data[<sample>]["bdg"]

Output

Puts output sam files in the following slots:
- self.sample_data[<sample>][“bw”]
- self.sample_data[<sample>][“wig”]

Parameters that can be set

Parameter	Values	Comments
bedGraphToBigWig_params	e.g. -blockSize=10 -itemsPerSlot=20	Parameters to pass to `bedGraphToBigWig`
bigWigToWig_params	e.g. -chrom X1 -start X2 -end X3	Parameters to pass to `bigWigToWig`
script_path		Path to dir where UCSC tools are located.
scope	sample\|project	Where the ‘bdg’ is located

Note

Set script_path to the path of the UCSC tools, not to a specific tool!!! If they are in the PATH, as when installing with CONDA, leave the script_path empty. Both bedGraphToBigWig and bigWigToWig will be executed. To set specific params, use bedGraphToBigWig_params and bigWigToWig_params, respectively.

Lines for parameter file

UCSCmap_bams:
    module:         UCSC_BW_wig
    base:           genCovBed_sam
    script_path:    /path/to/ucscTools/kentUtils/bin/
    genome:        /path/to/ref_genome.chrom.sizes
    bedGraphToBigWig_params:     -blockSize=10 -itemsPerSlot=20
    bigWigToWig_params:          -chrom X1 -start X2 -end X3

References

Kent, W.J., Zweig, A.S., Barber, G., Hinrichs, A.S. and Karolchik, D., 2010. BigWig and BigBed: enabling browsing of large distributed datasets. Bioinformatics, 26(17), pp.2204-2207.

`IGV_count` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running IGVtools count:

Requires

Either SAM or BAM files in the following slots:
- sample_data[<sample>]["bam"]
- sample_data[<sample>]["sam"]

Output

Puts output tdf or wig files in one the following slots:
- self.sample_data[<sample>]["wig"]
- self.sample_data[<sample>]["tdf"]

Parameters that can be set

Parameter	Values	Comments
format	wig\|tdf	Determines whether to create a ‘wig’ or ‘tdf’ file.
genome		Path to chrom.sizes file for reference genome

Lines for parameter file

IGVcount1:
    module: IGV_count
    base: samtools1
    script_path: java -Xmx1500m -jar /path/to/igvtools.jar count
    format: tdf   # Options: 'tdf' or 'wig'
    genome: /path/to/genome.chrom.sizes

References

Thorvaldsdóttir, H., Robinson, J.T. and Mesirov, J.P., 2013. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in bioinformatics, 14(2), pp.178-192.

`IGV_toTDF` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running IGVtools toIGV:

Requires

WIG file in the following slot:
- sample_data[<sample>]["wig"]

Output

Puts output tdf file in one the following slots:
- self.sample_data[<sample>]["tdf"]

Parameters that can be set

Parameter	Values	Comments
genome		Path to chrom.sizes file for reference genome

Lines for parameter file

IGV2TDF:
    module: IGV_toTDF
    base: samtools1
    script_path: /path/to/bin/java -Xmx1500m -jar /path/to/igvtools.jar toTDF 
    genome: /path/to/genome.chrom.sizes

References

Thorvaldsdóttir, H., Robinson, J.T. and Mesirov, J.P., 2013. Integrative Genomics Viewer (IGV): high-performance genomics data visualization and exploration. Briefings in bioinformatics, 14(2), pp.178-192.

ChIP-seq 

Modules included in this section

macs2_callpeak ^*
macs2_bdgcmp
CEAS

`macs2_callpeak` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running macs2 callpeak:

Requires

bam files in the following slots:
- sample_data[<sample>]["bam"]
If using control (input) samples, make sure you include a sample-control table in your sample file.

Output

Puts output macs2 output files in the following slots:
- self.sample_data[<sample>]["prefix"])
- self.sample_data[<sample>]["peak_bed"])
- self.sample_data[<sample>]["peak_xls"])
- self.sample_data[<sample>]["summit_bed"])
If --bdg (or -B) was specified, puts output bdg files in the following slots:
- self.sample_data[<sample>]["control_lambda"] - Control BedGraph
- self.sample_data[<sample>]["treat_pileup"] - Treatment BedGraph
- self.sample_data[<sample>]["bdg"] - Treatment BedGraph
- self.sample_data[<control>]["bdg"] - Control BedGraph
If bedToBigBed_path was specified, puts output bigbed files in the following slots:
- self.sample_data[<sample>]["bb"]
If getfasta was specified, puts output fasta files in the following slots:
- self.sample_data[<sample>]["peak_fasta"]
- self.sample_data[<sample>]["fasta.nucl"]

Parameters that can be set

Parameter	Values	Comments
bedToBigBed_path	path to bedToBigBed	Runs bedToBigBed to convert the peak bed files into bigbed for uploading to UCSC.
chrom.sizes	path to chrom.sizes for reference genome	If running bedToBigBed, you must supply the genome chrom.sizes file.
getfasta		If set, a fasta file containing the peak sequences will be produced.

Lines for parameter file

macs1_CP:
    module: macs2_callpeak
    base: samtools1
    script_path: /path/to/bin/macs2 callpeak
    bedToBigBed_path: /path/to/kentUtils/bin/bedToBigBed
    chrom.sizes: /path/to/genome.chrom.sizes
    getfasta: /path/to/bedtools getfasta -name -s
    redirects:
        --SPMR: 
        --bdg: 
        -g:     mm
        --bw:   400

References

Feng, J., Liu, T., Qin, B., Zhang, Y. and Liu, X.S., 2012. Identifying ChIP-seq enrichment using MACS. Nature protocols, 7(9), pp.1728-1740.

`macs2_bdgcmp`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running macs2 bdgcmp:

Requires

Files in the following slots:
- self.sample_data[<sample>]["control_lambda"] - Control BedGraph
- self.sample_data[<sample>]["treat_pileup"] - Treatment BedGraph

Output

Puts output macs2 output files in the following slots:
- self.sample_data[<sample>]["bdg"]) - The comparison bedgraph!
- self.sample_data[<sample>]["bigwig"]) - if slop_path and ucscTools_path were passed
- self.sample_data[<sample>]["wig"]) - if slop_path and ucscTools_path were passed
- self.sample_data[<sample>]["tdf"]) - in TDF format (if slop_path, ucscTools_path and toTDF_path were passed)

Parameters that can be set

Parameter	Values	Comments
slop_path	path to bedtools slop	Is part of the process for converting bdg files into bigwig and wig
ucscTools_path	path to ucscTools	UCSCtools bedClip, bedGraphToBigWig and bigWigToWig are part of the process for converting bdg files into bigwig and wig
toTDF_path	path to toTDF	Converts the wig file into TDF file.
genome	path to chrom.sizes for reference genome	If running bedToBigBed, you must supply the genome chrom.sizes file.

Lines for parameter file

bdgcmp:
    module: macs2_bdgcmp
    base: macs1
    script_path: /path/to/macs2 bdgcmp
    genome: /path/to/chrom.sizes.txt
    slop_path: /path/to/bin/bedtools slop
    ucscTools_path: /path/to/ucscTools/bin
    toTDF_path: /path/to/bin/java -Xmx1500m -jar /path/to/igvtools.jar toTDF
    redirects:
        --method: FE

References

Feng, J., Liu, T., Qin, B., Zhang, Y. and Liu, X.S., 2012. Identifying ChIP-seq enrichment using MACS. Nature protocols, 7(9), pp.1728-1740.

`CEAS`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running CEAS:

Requires

Files in the following slots:
- self.sample_data[<sample>]["peak_bed"] - Sample peak_bed file
- self.sample_data[<sample>]["wig"] - An appropriate wig file

Output

Puts CEAS output files in the following slots:
- sample_data[sample]["CEAS.xls"]
- sample_data[sample]["CEAS.R"]
- sample_data[sample]["CEAS.plots"]

Parameters that can be set

Lines for parameter file

CEAS1:
    module: CEAS
    base: UCSC_BW_to_wig
    script_path: /path/to/bin/ceas
    redirects:
        -g: /path/to/hg19.refGene

References

Shin, H., Liu, T., Manrai, A.K. and Liu, X.S., 2009. CEAS: cis-regulatory element annotation system. Bioinformatics, 25(19), pp.2605-2606.

Genome Assembly 

Modules included in this section

clc_assembl
megahit_assembl
spades_assembl ^*
quast ^*

`clc_assembl`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for assembling reads using CLC assembler.

Requires

fastq files in at least one of the following slots:
- sample_data[<sample>]["fastq.F"]
- sample_data[<sample>]["fastq.R"]
- sample_data[<sample>]["fastq.S"]

Output:

puts fasta output files in the following slots:
- if scope set to sample:
  sample_data[<sample>]["fasta.nucl"]
  
  sample_data[<sample>]["clc_assembl.contigs"]
  
  Also, sets sample_data[<sample>]["assembler"] = "clc_assembl"
- if scope set to project:
  sample_data["fasta.nucl"]
  
  sample_data["clc_assembl.contigs"]
  
  Also, sets sample_data[<sample>]["assembler"] = "clc_assembl"

Parameters that can be set

Parameter	Values	Comments
scope	sample\|project	Set to `project` to assembl all project reads into one assembly.
p	e.g. ‘fb ss 180 250’	Sets the `-p` parameter passed to CLC for paired-end reads. Required only if the project includes paired end reads.

Lines for parameter file

clc1:
    module: clc_assembl
    base: trim1
    script_path: /path/to/clc_assembler
    qsub_params:
        -pe:    shared 30
        node:   sge37
    scope:      sample
    p:          fb ss 180 250 
    redirects:
        --cpus: 30

`megahit_assembl`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for assembling reads using MEGAHIT assembler.

Requires

fastq files in at least one of the following slots:
- sample_data[<sample>]["fastq.F"]
- sample_data[<sample>]["fastq.R"]
- sample_data[<sample>]["fastq.S"]

Output:

puts fasta output files in the following slots:
- if scope set to sample:
  sample_data[<sample>]["fasta.nucl"]
  
  sample_data[<sample>]["megahit_assembl.contigs"]
  
  Also, sets sample_data[<sample>]["assembler"] = "megahit_assembl"
- if scope set to project:
  sample_data["fasta.nucl"]
  
  sample_data["megahit_assembl.contigs"]
  
  Also, sets sample_data[<sample>]["assembler"] = "megahit_assembl"

Parameters that can be set

Parameter	Values	Comments
scope	sample\|project	Set to `project` to assembl all project reads into one assembly.

Lines for parameter file

megahit1:
    module: megahit_assembl
    base: trim1
    script_path: /path/to/megahit
    qsub_params:
        -pe: shared 30
        node: sge37
    scope: project
    redirects:
        --continue: 
        --num-cpu-threads: 30

References

Li, D., Liu, C.M., Luo, R., Sadakane, K. and Lam, T.W., 2015. MEGAHIT: an ultra-fast single-node solution for large and complex metagenomics assembly via succinct de Bruijn graph. Bioinformatics, 31(10), pp.1674-1676.

`spades_assembl` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for assembling reads using spades assembler.

Requires

fastq files in at least one of the following slots:

sample_data[<sample>]["fastq.F"]

sample_data[<sample>]["fastq.R"]

sample_data[<sample>]["fastq.S"]

Output:

puts fasta output files in the following slots:

for sample-wise assembly:

sample_data[<sample>]["fasta.nucl"]

sample_data[<sample>]["spades_assembl.contigs"]

sample_data[<sample>]["spades_assembl.scaffolds"]

for mega assembly (not defined yet):

sample_data["fasta.nucl"]

sample_data["spades_assembl.contigs"]

sample_data["spades_assembl.scaffolds"]

Parameters that can be set

Parameter	Values	Comments
scope	sample\|project	Set if project-wide fasta slot should be used
truncate_names		truncates contig names, e.g. ‘>NODE_82_length_18610_cov_38.4999_ID_165’ will be changed to ‘>NODE_82_length_18610’
use_corrected		Use the reads files after reads correction for douwnstream usge

Lines for parameter file

spades1:
    module: spades_assembl
    base: trim1
    script_path: /path/to/bin/spades.py
    truncate_names: 
    redirects:
        --careful: 

References

Bankevich, A., Nurk, S., Antipov, D., Gurevich, A.A., Dvorkin, M., Kulikov, A.S., Lesin, V.M., Nikolenko, S.I., Pham, S., Prjibelski, A.D. and Pyshkin, A.V., 2012. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. Journal of computational biology, 19(5), pp.455-477.

`quast` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running quast on fasta assemblies:

QUAST is executed on the fasta file along the following lines:

If ‘scope’ is specified, the appropriate fasta will be used. An error will occur if the fasta does not exist.
If ‘scope’ is not specified, if a project-wide fasta exists, it will be used. Otherwise, sample-wise fasta files will be used. If none exist, an error will occur.

Note

With compare_mode, you tell the module to run quast on multiple assemblies. This is done in one of three ways:

If scope is sample and a single base step defined, will compare between the samples.
If scope is sample and there is more than one base step defined, will compare between the assemblies found in the base steps for each sample separately.
If scope is project, will compare between the assemblies found in the base steps at the project level.

Requires

fasta files in one of the following slots:
- sample_data["fasta.nucl"]
- sample_data[<sample>]["fasta.nucl"]

Output

Puts output directory in one of:
- self.sample_data["project_data"]["quast"]
- self.sample_data[<sample>]["quast"]

Parameters that can be set

Parameter	Values	Comments
scope	project \| sample	Indicates whether to use a project or sample contigs file.
compare_mode		If ‘scope’ is ‘sample’, specifies whether to analyse each sample separately or to create a single comparison report for all samples.

Lines for parameter file

A quast report for each sample separately:

quast1:
    module: quast
    base: spades1
    script_path: /path/to/quast.py
    scope: sample
    redirects:
        --fast: 

A quast report comparing the sample assemblies:

quast1:
    module: quast
    base: spades1
    script_path: /path/to/quast.py
    compare_mode: 
    scope: sample
    redirects:
        --fast: 

A quast report comparing the project assemblies from different stages of the analysis:

quast1:
    module: quast
    base: 
        - spades1
        - megahit1
    script_path: /path/to/quast.py
    compare_mode: 
    scope: project
    redirects:
        --fast: 

References

Gurevich, A., Saveliev, V., Vyahhi, N. and Tesler, G., 2013. QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), pp.1072-1075.

Transcriptome Assembly 

Modules included in this section

trinity ^*
add_trinity_tags ^*
Trinity_gene_to_trans_map
trinity_mapping
trinity_statistics
RSEM
quast ^*
htseq_count

`trinity` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for RNA_seq assembly using the Trinity assembler.

Attention

This module was tested on release 2.5.x. It should also work with 2.4.x

For old versions of Trinity, you might need to use trinity_old module.

The main difference between the modules is that trinity creates an output directory with the word trinity in it as required by the newer release of Trinity.

In order to run on the cluster, you need to install HpcGridRunner.

Requires

fastq files in at least one of the following slots:

sample_data[<sample>]["fastq.F"]

sample_data[<sample>]["fastq.R"]

sample_data[<sample>]["fastq.S"]

bam file for Genome Guided assembly in:
- sample_data["bam"]
- sample_data[<sample>]["bam"]

Output:

puts fasta output files in the following slots:

for sample-wise assembly:

sample_data[<sample>]["fasta.nucl"]

sample_data[<sample>]["Trinity.contigs"]

for project-wise assembly:

sample_data["fasta.nucl"]

sample_data["Trinity.contigs"]

Parameters that can be set

Parameter	Values	Comments
scope	sample\|project	Set if project-wide fasta slot should be used
skip_gene_to_trans_map		Set to skip construction of the transcript map. You can use a dedicated module, `Trinity_gene_to_trans_map`. Both put the map in the same slot (gene_trans_map)
get_Trinity_gene_to_trans_map		Path to get_Trinity_gene_to_trans_map.pl. If not passed, will try guessing from Trinity path
TrinityStats		block with ‘path:’ set to TrinityStats.pl executable
genome_guided		Use if you have a project level BAM file with reads mapped to a reference genome and it is coordinate sorted
Group_by	Name of the Column in the grouping file to use for grouping	Only works in project scope: Will create a sample file for Trinity

Lines for parameter file

trinity1:
    module:                 trinity
    base:                   trin_tags1
    script_path:            {Vars.paths.Trinity}
    qsub_params:
        node:               sge213
        -pe:                shared 20
    redirects:
        --grid_exec:        "{Vars.paths.hpc_cmds_GridRunner} --grid_conf {Vars.paths.SGE_Trinity_conf} -c" 
        --grid_node_CPU:    40 
        --grid_node_max_memory: 80G 
        --max_memory:        80G 
        --seqType:          fq
        --min_kmer_cov:     2
        --full_cleanup:
    TrinityStats:
        path:           {Vars.paths.TrinityStats}

References

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.

`add_trinity_tags` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for adding the tags required by Trinity to the ends of the read names. See the Strand specific assembly section of the Trinity manual.

The module uses awk, so you don’t need to pass a script_path. Since you must pass a script_path, just leave it blank.

Attention

The awk command is set to remove all text in title following any whitespace. Make sure the said information is not important. If it is, you can do the mapping step using the base of add_trinity_tags.

Requires

fastq files in at least one of the following slots:

sample_data[<sample>]["fastq.F"]

sample_data[<sample>]["fastq.R"]

sample_data[<sample>]["fastq.S"]

Output:

puts fastq output files (with added tags) in the following slots:

sample_data[<sample>]["fastq.F"]

sample_data[<sample>]["fastq.R"]

sample_data[<sample>]["fastq.S"]

Lines for parameter file

trintags:
    module:      add_trinity_tags
    base:        trim1
    script_path: NOT_USED

`Trinity_gene_to_trans_map`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for creating a gene vs. transcript map for a Trinity based assembly.

Requires

fasta files in at least one of the following slots:

sample_data[<sample>]["fasta.nucl"] (if scope = sample)

sample_data["project_data"]["fasta.nucl"] (if scope = project)

Output:

puts gene to trans map in:

sample_data[<sample>]["gene_trans_map"] (if scope = sample)

sample_data["project_data"]["gene_trans_map"] (if scope = project)

Parameters that can be set

Parameter	Values	Comments
scope	sample\|project	Use sample or project scope assembly.

Lines for parameter file

Gene_Trans_Map:
    module:     Trinity_gene_to_trans_map
    base:       trinity
    script_path: {Vars.paths.get_Trinity_gene_to_trans_map.pl}

References

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.

`trinity_mapping`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for running align_and_estimate_abundance.pl on a Trinity assembly and the raw reads.

Tested on versions 2.4.0 and 2.5.0 of Trinity.

See the align_and_estimate_abundance.pl script documentation.

Requires

fastq files in at least one of the following slots:

sample_data[<sample>]["fastq.F"]

sample_data[<sample>]["fastq.R"]

sample_data[<sample>]["fastq.S"]

A Trinity assembly in one of (depending on scope)

sample_data[<sample>]["fasta.nucl"]

sample_data["fasta.nucl"]

Output:

Puts output files in the following slots:
- sample_data[<sample>]["bam"]
- sample_data[<sample>]["unsorted_bam"] (If --coordsort_bam is passed in redirects)
- sample_data[<sample>]["isoforms.results"]
- sample_data[<sample>]["genes.results"]

Parameters that can be set

Parameter	Values	Comments
scope	sample\|project	Set if project-wide fasta slot should be used
redirects: –gene_trans_map	path or empty	If empty, use internal gene_trans_map. If path, use path as gene_trans_map for all samples. If not passed, performs analysis on isoform level only
redirects: –trinity_mode		If set, will create a gene_trans_map for each sample and store it as sample gene_trans_map

Lines for parameter file

trin_map1:
    module:               trinity_mapping
    base:                 trinity1
    script_path:          {Vars.paths.align_and_estimate_abundance}
    redirects:
        --est_method:     RSEM
        --aln_method:     bowtie
        --trinity_mode:
        --seqType:        fq

References

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.

`trinity_statistics`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for running abundance_estimates_to_matrix.pl on genes or isoforms counts tables produced by align_and_estimate_abundance.pl

See the script documentation here.

This conversion makes sense at the project level - combining all sample matrices into a single, normalized, comparison table. However, for completeness, we included a sample scope option for running the script in each sample separately.

Note

scope is not defined for this module. It only makes sense to run abundance_estimates_to_matrix when comparing many samples against a single assembly

Requires

Either genes.results or isoforms.results files in the following slots:
- sample_data[<sample>]["genes.results"]
- sample_data[<sample>]["isoforms.results"]

Output:

Creates the following files in the following slots:
- <project>.counts.matrix in self.sample_data["project_data"]["counts.matrix"]
- <project>.not_cross_norm.fpkm.tmp in self.sample_data["project_data"]["not_cross_norm.fpkm.tmp"]
- <project>.not_cross_norm.fpkm.tmp.TMM_info.txt in self.sample_data["project_data"]["not_cross_norm.fpkm.tmp.TMM_info.txt"]
- <project>.TMM.fpkm.matrix in self.sample_data["project_data"]["TMM.fpkm.matrix"]

Parameters that can be set

Parameter	Values	Comments
use_genes		Use ‘genes.results’ matrix. If not passed, use ‘isoforms.results’
redirects: –gene_trans_map	path or ‘none’	If path, use path as gene_trans_map for all samples. If ‘none’, does not produce gene level estimates. In order to use an internal gene_trans_map, do not pass this parameter!

Lines for parameter file

trin_map_stats:
    module:             trinity_statistics
    base:               trin_map1
    script_path:        /path/to/abundance_estimates_to_matrix.pl
    use_genes:       
    redirects:
        --est_method:   RSEM

References

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.

`RSEM`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Short Description

A module for running RSEM

Requires

fastq file in
self.sample_data[sample]["fastq.F"] self.sample_data[sample]["fastq.R"] self.sample_data[sample]["fastq.S"]

or bam file in
self.sample_data[sample]["bam"]

Output

puts output bam files (if the input is fastq) in:
self.sample_data[sample]["bam"]

puts the location of RSEM results in:
self.sample_data[sample]["RSEM"] self.sample_data[sample]["genes.results"] self.sample_data[sample]["isoforms.results"]

Parameters that can be set

Parameter	Values	Comments
mode	transcriptome/genome	Is the reference is a genome or a transcriptome?
gff3	None	Use if the mode is genome and the annotation file is in gff3 format

Comments

This module was tested on:
RSEM v1.2.25 bowtie2 v2.2.6

Lines for parameter file

Step_Name:                                                   # Name of this step
    module: RSEM                                             # Name of the module used
    base:                                                    # Name of the step [or list of names] to run after [must be after a bam file generator step or merge with fastq files]
    script_path:                                             # Command for running the RSEM script 
    qsub_params:
        -pe:                                                 # Number of CPUs to reserve for this analysis
    mode:                                                    # transcriptome or genome
    export_transcriptome:                                    # In genome mode set the extracted transcriptome as the new project level fasta.nucl and extract the ranscript-to-gene-map file as project level gene_trans_map
    annotation:                                              # For Genome mode: the location of GTF file [the default] , for GFF3 use the gff3 flag. For Transcriptome mode: transcript-to-gene-map file.
                                                             # If annotation is set to Trinity the transcript-to-gene-map file will be generated using the from_Trinity_to_gene_map script
                                                             # If not set will use only the reference file as unrelated transcripts
    from_Trinity_to_gene_map_script_path:                    # If the mode is transcriptome and the reference was assembled using Trinity it is possible to generate the transcript-to-gene-map file automatically using this script
                                                             # If annotation is set to Trinity and this line is empty or missing it will try using the module's associated script
    gff3:                                                    # Use if the mode is genome and the annotation file is in gff3 format
    mapper:                                                  # bowtie/bowtie2/star 
    mapper_path:                                             # Location of mapper script
    rsem_prepare_reference_script_path:                      # Location of preparing reference script
    plot_stat:                                               # Generate statistical plots
    plot_stat_script_path:                                   # Location of statistical plot generating script
    reference:                                               # The reference genome/transcriptome location [FASTA file]
    rsem_generate_data_matrix_script_path:                   # Location of the final matrix generating script
                                                             # If this line is empty or missing it will try using the module's associated script
    redirects:
        --append-names:                                      # RSEM will append gene_name/transcript_name to the result files
        --estimate-rspd:                                     # Enables RSEM to learn from the data how the reads are distributed across a transcript
        -p:                                                  # Number of CPUs to use in this analysis
        --bam:                                               # Will use bam files and not fastq
        --no-bam-output:
        --output-genome-bam:                                 # Alignments in genomic coordinates (only if mode is genome)

References

Li, Bo, and Colin N. Dewey. “RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome.” BMC bioinformatics 12.1 (2011): 323.‏

`quast` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running quast on fasta assemblies:

QUAST is executed on the fasta file along the following lines:

If ‘scope’ is specified, the appropriate fasta will be used. An error will occur if the fasta does not exist.
If ‘scope’ is not specified, if a project-wide fasta exists, it will be used. Otherwise, sample-wise fasta files will be used. If none exist, an error will occur.

Note

With compare_mode, you tell the module to run quast on multiple assemblies. This is done in one of three ways:

If scope is sample and a single base step defined, will compare between the samples.
If scope is sample and there is more than one base step defined, will compare between the assemblies found in the base steps for each sample separately.
If scope is project, will compare between the assemblies found in the base steps at the project level.

Requires

fasta files in one of the following slots:
- sample_data["fasta.nucl"]
- sample_data[<sample>]["fasta.nucl"]

Output

Puts output directory in one of:
- self.sample_data["project_data"]["quast"]
- self.sample_data[<sample>]["quast"]

Parameters that can be set

Parameter	Values	Comments
scope	project \| sample	Indicates whether to use a project or sample contigs file.
compare_mode		If ‘scope’ is ‘sample’, specifies whether to analyse each sample separately or to create a single comparison report for all samples.

Lines for parameter file

A quast report for each sample separately:

quast1:
    module: quast
    base: spades1
    script_path: /path/to/quast.py
    scope: sample
    redirects:
        --fast: 

A quast report comparing the sample assemblies:

quast1:
    module: quast
    base: spades1
    script_path: /path/to/quast.py
    compare_mode: 
    scope: sample
    redirects:
        --fast: 

A quast report comparing the project assemblies from different stages of the analysis:

quast1:
    module: quast
    base: 
        - spades1
        - megahit1
    script_path: /path/to/quast.py
    compare_mode: 
    scope: project
    redirects:
        --fast: 

References

Gurevich, A., Saveliev, V., Vyahhi, N. and Tesler, G., 2013. QUAST: quality assessment tool for genome assemblies. Bioinformatics, 29(8), pp.1072-1075.

`htseq_count`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running htseq-count:

See htseq-count documentation.

Requires

fastq files in one of the following slots:
- sample_data[<sample>]["bam"]
- sample_data[<sample>]["sam"]

Output

Puts the output file in:
self.sample_data[<sample>]["HTSeq.counts"]

Parameters that can be set

Parameter	Values	Comments
gff	path to bowtie1 index	If not given, will look for a project bowtie1 index and then for a sample bowtie1 index
-f\|–format	sam \| bam	In redirects. Tells htseq-count which file to use. If not specified, will use whichever file exists.

Lines for parameter file

For external index:

htseq_c1:
    module:         htseq_count
    base:           samtools_STAR1
    script_path:    /storage16/app/bioinfo/python_packages/bin/htseq-count
    gtf:            /fastspace/bioinfo_databases/STAR_GRCh38_Gencode21/gencode.v21.annotation.gtf
    redirects:
        --format:   bam
        -s:         'no'
        -m:         intersection-nonempty

References

Anders, S., Pyl, P.T. and Huber, W., 2015. HTSeq—a Python framework to work with high-throughput sequencing data. Bioinformatics, 31(2), pp.166-169.

Transcriptome Annotation 

Modules included in this section

Trinotate
TransDecoder

`Trinotate`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for RNA_seq assembly annotation using Trinotate.

Note

This module will be updated in the future to support uploading of other sources of information such as RNAMMER output. See Trinotate documentation.

Requires

A transcripts file in
- self.sample_data[“project_data”][“transcripts.fasta.nucl”],
A gene to transcript mapping file in: (produced by Trinity_gene_to_trans_map module)
- self.sample_data[“project_data”][“gene_trans_map”],
A protein fasta file (produced by TransDecoder)
- self.sample_data[“project_data”][“fasta.prot”])
Results of blastp of protein file against swissprot database:
- self.sample_data[“project_data”][“blast.prot”],
Results of blastx of transcripts file against swissprot database:
- self.sample_data[“project_data”][“blast.nucl”],
Results of hmmscan of protein file against pfam database:
- self.sample_data[“project_data”][“hmmscan.prot”])
Results of signalp of protein file using signalp program: [ optional ]
- self.sample_data[“project_data”][“signalp”])
Results of rnammer/infernal transcripts of file: [ optional, use Infernal with Trinotate-V4 ]
- self.sample_data[“project_data”][“rnammer”])
Results of tmhmm of protein file using TmHMM program: [ optional ]
- self.sample_data[“project_data”][“tmhmm”])
Results of eggnog of protein file using EggnogMapper program: [ optional only Trinotate-V4]
- self.sample_data[“project_data”][“eggnog”])

Attention

If scope is set to sample, all of the above files should be in the sample scope!

Output:

puts Trinotate report file in:
- sample_data[<sample>]["trino.rep"] (scope = sample)
- sample_data["trino.rep"] (scope = project)

Parameters that can be set

Parameter	Values	Comments
scope	sample\|project
sqlitedb		Path to Trinotate sqlitedb
cp_sqlitedb		Create local copy of the sqlitedb, before loading teh data (recommended)
ver4		Indicate you are using Trinotate V4

Lines for parameter file

trino_Trinotate:
    module:             Trinotate
    base:               
                        - trino_blastp_sprot
                        - trino_blastx_sprot
                        - trino_hmmscan1
    script_path:        {Vars.paths.Trinotate}
    scope:              project
    sqlitedb:           {Vars.databases.trinotate.sqlitedb}
    cp_sqlitedb:    
    ver4:

References

Grabherr, M.G., Haas, B.J., Yassour, M., Levin, J.Z., Thompson, D.A., Amit, I., Adiconis, X., Fan, L., Raychowdhury, R., Zeng, Q. and Chen, Z., 2011. Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nature biotechnology, 29(7), p.644.

`TransDecoder`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running TransDecoder on a transcripts file.

Note

Tested on TransDecoder version 5.5.0.. The main difference being that in this version an output directory can be specified in the command line.

Requires

fasta files in at least one of the following slots:

sample_data[<sample>]["fasta.nucl"] (if scope = sample)

sample_data["fasta.nucl"] (if scope = project)

Output:

If scope = project:
- Protein fasta in self.sample_data["project_data"]["fasta.prot"]
- Gene fasta in self.sample_data["project_data"]["fasta.nucl"]
- Original transcripts in self.sample_data["project_data"]["transcripts.fasta.nucl"]
- GFF file in self.sample_data["project_data"]["gff3"]
If scope = sample:
- Protein fasta in self.sample_data[<sample>]["fasta.prot"]
- Gene fasta in self.sample_data[<sample>]["fasta.nucl"]
- Original transcripts in self.sample_data[<sample>]["transcripts.fasta.nucl"]
- GFF file in self.sample_data[<sample>]["gff3"]

Parameters that can be set

Parameter	Values	Comments
scope	sample\|project	Determine weather to use sample or project transcripts file.

Lines for parameter file

trino_Transdecode_highExpr:
    module:             TransDecoder
    base:               Split_Fasta
    script_path:        {Vars.paths.TransDecoder}
    scope:              sample

References

RNASeq 

Modules included in this section

DeSeq2

`DeSeq2`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Short Description

A module to preform: * Gene level differential expression using DeSeq2. * Gene annotation. * PCA plot. * Clustering of significant genes. * Heatmaps of significant genes by clusters. * Expression patterns plot by clusters * Enrichment analysis KEGG/GO.

Requires

Search for count data in :
self.sample_data[<sample>][“RSEM”] self.sample_data[<sample>][“genes.counts”] self.sample_data[<sample>][“HTSeq.counts”] self.sample_data[“project_data”][“results”]

Parameters that can be set

Parameter	Values	Comments
use_click		Will use the CLICK clustering program (Shamir et al. 2000)

Note

If your using the use_click option, cite: Expander: Ulitsky I, Maron-Katz A, Shavit S, Sagir D, Linhart C, Elkon R, Tanay A, Sharan R, Shiloh Y, Shamir R. Expander: from expression microarrays to networks and functions. Nature Protocols Vol 5, pp 303 - 322, 2010 Click: Shamir , R. and Sharan, R. CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis. Proceedings ISMB 2000, pp.307-316 (2000)

Comments

The following R packages are required:
DESeq2 ggplot2 pheatmap mclust factoextra cowplot gridExtra biomaRt clusterProfiler KEGGREST scater sva rmarkdown plotly dt xml2 dplyr rcolorbrewer colorspace stringr

Note

It is Possible to use CONDA to install all dependencies:

wget https://raw.githubusercontent.com/bioinfo-core-BGU/neatseq-flow-modules/master/neatseq_flow_modules/Liron/DeSeq2_module/DeSeq2_env_install.yaml
conda env create -f DeSeq2_env_install.yaml

Flow this Tutorial for More Information.

Lines for parameter file

Step_Name:                              # Name of this step
    module: DeSeq2                      # Name of the used module
    base:                               # Name of the step [or list of names] to run after with count results.
    script_path:                        # Command for running the a DeSeq2 script
                                        # If this line is empty or missing it will try using the module's associated script
    use_click:                          # Will use the CLICK clustering program (Shamir et al. 2000). 
    redirects:
        --SAMPLE_DATA_FILE:             # Path to Samples Information File
        --GENE_ID_TYPE:                 # The Gene ID Type i.e 'ENSEMBL'[for Bioconductor] OR 'ensembl_gene_id'/'ensembl_transcript_id' [for ENSEMBL]
        --Annotation_db:                # Bioconductor Annotation Data Base Name from https://bioconductor.org/packages/release/BiocViews.html#___OrgDb  
        --Species:                      # Species Name to Retrieve Annotation Data from ENSEMBL
        --KEGG_Species:                 # Species Name to Retrieve Annotation Data from KEGG
        --KEGG_KAAS:                    # Gene to KO file from KEGG KAAS [first column gene id, second column KO number]
        --Trinotate:                    # Path to a Trinotate annotation file in which the first column is the genes names
        --FILTER_SAMPLES:               # Filter Samples with Low Number of expressed genes OR with Small Library size using 'scater' package 
        --FILTER_GENES:                 # Filter Low-Abundance Genes using 'scater' package
        --NORMALIZATION_TYPE:           # The DeSeq2 Normalization Type To Use [VSD , RLOG] The Default is VSD
        --BLIND_NORM:                   # Perform Blind Normalization
        --DESIGN:                       # The Main DeSeq2 Design [ ~ Group ]
        --removeBatchEffect             # Will Remove Batch Effect from the Normalized counts data up to 2 
                                        # [using the limma package and only one using the sva package]
                                        # Batch Effect fields [from the Sample Data ] separated by , 
        --removeBatchEffect_method      # The method to Remove Batch Effect from the Normalized counts data using the limma or sva packages [sva is the default]
        --LRT:                          # The LRT DeSeq2 Design
        --ALPHA:                        # Significant Level Cutoff, The Default is 0.05
        --Post_statistical_ALPHA        # Post Statistical P-value Filtering
        --FoldChange:                   # Fold change Cutoff [testing for fold changes greater in absolute value], The Default is 1
        --Post_statistical_FoldChange   # Post Statistical Fold change Filtering
        --CONTRAST:                     # The DeSeq Contrast Design ["Group,Treatment,Control"] [Not For LTR] .
                                        # It is possible to define more then one contrast Design ["Group,Treatment1,Control1|Group,Treatment2,Control2|..."]
        --SPLIT_BY_CONTRAST             # Only use Samples found in the relevant contrast for Clustering and Enrichment Analysis
        --modelMatrixType:              # How the DeSeq model matrix of the GLM formula is formed [standard or expanded] ,The Default is standard
        --GENES_PLOT:                   # Genes Id To Plot count Data [separated by ','] 
        --X_AXIS:                       # The Filed In the Sample Data To Use as X Axis
        --GROUP:                        # The Filed In the Sample Data To Group By [can be two fields separated by ',']
        --SPLIT_BY:                     # The Filed In the Sample Data To Split the Analysis By.
        --FUNcluster:                   # A clustering function including [kmeans,pam,clara,fanny,hclust,agnes,diana,click]. The default is hclust
                                        # If the 'use_click' option is used the '--FUNcluster' option is set to 'click' 
        --hc_metric:                    # Hierarchical clustering metric to be used for calculating dissimilarities between observations. The default is pearson
        --hc_method:                    # Hierarchical clustering agglomeration method to be used. The default is ward.D2
        --k.max:                        # The maximum number of clusters to consider, must be at least two. The default is 20
        --nboot:                        # Number of Monte Carlo (bootstrap) samples for determining the number of clusters [Not For Mclust]. The default is 10 
        --stand:                        # The Data will be Standardized Before Clustering.
        --Mclust:                       # Use Mclust for determining the number of clusters.
        --CLICK_HOMOGENEITY:            # The HOMOGENEITY [0-1] of clusters using CLICK program (Shamir et al. 2000). The default is 0.5 
        --PCA_COLOR:                    # The Filed In the Sample Data To Determine Color In The PCA Plot
        --PCA_SHAPE:                    # The Filed In the Sample Data To Determine Shape In The PCA Plot
        --PCA_SIZE:                     # The Filed In the Sample Data To Determine Size In The PCA Plot. The default is Library Size
        --Enriched_terms_overlap:       # Test for genes overlap in enriched terms
        --USE_INPUT_GENES_AS_BACKGROUND # Use The input Genes as the Background for Enrichment Analysis
        --only_clustering               # Don't Perform Differential Analysis!!!
        --significant_genes             # Use these genes as the set of significant genes [a comma separated list]
        --collapseReplicates            # Will collapse technical replicates using a Sample Data field indicating which samples are technical replicates

Sequence Annotation 

Modules included in this section

Prokka
prokka_old ^*

`Prokka`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

Short Description

Runs Prokka on all samples

Requires

For each Sample, a fasta.nucl file type [e.g. an assembly result] in:
sample_data[sample]["fasta.nucl"]

Output

For each Sample, puts the location of the Sample’s GFF file in:
sample_data[sample]["GFF"]

For each Sample, puts the location of the Sample’s identified genes file in:
sample_data[sample]["fasta.nucl"]

For each Sample, puts the location of the Sample’s identified genes [translated] file in:
sample_data[sample]["fasta.prot"]

if generate_GFF_dir option exist, puts the directory location of all Samples GFFs in:
sample_data["GFF_dir"]

Parameters that can be set

Parameter	Values	Comments
generate_GFF_dir		Create GFF directory

Comments

Lines for parameter file

Step_Name:                                  # Name of this step
    module: Prokka                          # Name of the module to use
    base:                                   # Name of the step [or list of names] to run after [must be after a fasta file generator step like an assembly program or start the analysis with fasta files]
    script_path:                            # Command for running Prokka 
    env:                                    # env parameters that needs to be in the PATH for running this module
    qsub_params:
        -pe:                                # Number of CPUs to reserve for this analysis
    generate_GFF_dir:                       # Create GFF directory
    redirects:
        --cpus:                             # parameters for running Prokka
        --force:                            # parameters for running Prokka
        --genus:                            # parameters for running Prokka
        --kingdom:                          # parameters for running Prokka
        --proteins:                         # Use the location of a protein DB [FASTA] for extra annotation or use "VFDB" to use the module VFDB built-in virulence/resistance DB  

References

Seemann, Torsten. “Prokka: rapid prokaryotic genome annotation.” Bioinformatics 30.14 (2014): 2068-2069.‏

`prokka_old` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running prokka:

Prokka is executed on the contigs stored in sample_data.

Requires

A nucleotide fasta file in one of the following slots:
- sample_data[<sample>]["fasta.nucl"]
- sample_data["fasta.nucl"]

Output

If scope is set to sample:
- Puts output predicted protein sequences (faa file) in:
  sample_data[<sample>]["fasta.prot"]
- Puts output predicted protein genomic sequences (fna file) in:
  sample_data[<sample>]["fasta.nucl"]
- Puts the annotation file (gff) in:
  sample_data[<sample>]["gff"]
- Stores the prokks dir in:
  sample_data[<sample>]["prokka.dir"]
If scope is set to project:
- Puts output predicted protein sequences (faa file) in:
  sample_data["fasta.prot"]
- Puts output predicted protein genomic sequences (fna file) in:
  sample_data["fasta.nucl"]
- Puts the annotation file (gff) in:
  sample_data["gff"]
- Stores the prokks dir in:
  sample_data["prokka.dir"]

Parameters that can be set

Parameter	Values	Comments
generate_GFF_dir	empty	Create a dir with links to the gff files for use downstream by others. Only relevant when `scope=='sample'`

Comments

If you set values to --locustag, --genus, --species and --strain, these will hold for all the samples, and will be passed as-is to the scripts.

If you pass the parameters without setting their values, the values will be set to the sample names (or to the project name, when scope == 'project').

Lines for parameter file

prokka1:
    module: prokka_old
    base: spades1
    script_path: /path/to/prokka
    qsub_params:
        -pe: shared 20
    generate_GFF_dir: 
    scope: sample
    redirects:
        --cpus: 20
        --fast: 
        --force:
        --genus: Staphylococcus
        --metagenome: 
        --strain: 

References

Seemann, T., 2014. Prokka: rapid prokaryotic genome annotation. Bioinformatics, 30(14), pp.2068-2069.

Sequence-Searching Related Tasks 

Modules included in this section

makeblastdb ^*
blast
parse_blast
Gassst
hmmscan
mash_sketch
mash_dist

`makeblastdb` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Create a blastdb from a fasta file

Requires

fastq files in the following slots:
- sample_data[<sample>]["fasta.nucl"|"fasta.prot"]
Or (if ‘projectBLAST’ is set)
- sample_data["fasta.nucl"|"fasta.prot"]

Output

A BLAST database in the following slots:
- sample_data[<sample>]["blastdb"]
- sample_data[<sample>]["blastdb.nucl"|"blastdb.prot"]
- sample_data[<sample>]["blastdb.nucl.log"|"blastdb.prot.log"]
Or (if ‘projectBLAST’ is set):
- sample_data["blastdb"]
- sample_data["blastdb.nucl"|"blastdb.prot"]
- sample_data["blastdb.nucl.log"|"blastdb.prot.log"]

Parameters that can be set:

Parameter	Values	Comments
scope	sample\|project	Set if project-wide or sample fasta slot should be used
-dbtype	nucl/prot	This is a compulsory redirected parameter.Helps the module decide which fasta file to use.

Lines for parameter file

mkblst1:
    module: makeblastdb
    base: trinity1
    script_path: /path/to/bin/makeblastdb
    redirects:
        -dbtype: nucl
    scope: project

References

Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17), pp.3389-3402.

`blast`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for executing BLAST of any type on a nucleotide or protein fasta file. The search can be either on a sample fasta or on a project-wide fasta. It can use the fasta as a database or as a query. If used as a database, you must call the makeblastdb module prior to this step.

both query and db parameters must be passed. They should be set to one of the following values:

Value	Description
`sample`	The `query` or `db` should be taken from the sample scope
`project`	The `query` or `db` should be taken from the project scope
A path	A path to a fasta file or `makeblastdb` database to use as-is

The type of fasta and database to use are set with the querytype and dbtype parameters, respectively.

dbtype must be set if db is set to sample or project.
querytype must be set regardless. It will determine the type of blast report (i.e. whether it will be stored in blast.nucl or blast.prot)

Requires:

fasta files in one of the following slots for sample-wise blast:
- sample_data[<sample>]["fasta.nucl"]
- sample_data[<sample>]["fasta.prot"]
or fasta files in one of the following slots for project-wise blast:
- sample_data["fasta.nucl"]
- sample_data["fasta.prot"]
or a makeblastdb index in one of the following slots:
- When -db is set to ‘project’
  sample_data["blastdb.nucl"|"blastdb.prot"]
- When -db is set to ‘sample’
  sample_data[<sample>]["blastdb.nucl"|"blastdb.prot"]

File type	Scope	Comments
`fasta.nucl`	sample/project	If `query` is `sample` or `project` and `querytype` is `nucl`
`fasta.prot`	sample/project	If `query` is `sample` or `project` and `querytype` is `prot`
`blastdb.nucl`	sample/project	If `db` is `sample` or `project` and `dbtype` is `nucl`
`blastdb.prot`	sample/project	If `db` is `sample` or `project` and `dbtype` is `prot`

Output:

puts BLAST output files in the following slots for sample-wise blast:
- sample_data[<sample>]["blast.nucl"|"blast.prot"]
- sample_data[<sample>]["blast"]
puts fasta output files in the following slots for project-wise blast:
- sample_data["blast.nucl"|"blast.prot"]
- sample_data["blast"]

File type	Scope	Comments
`blast.nucl`	sample/project	Blast report if `querytype` is `nucl`
`blast.prot`	sample/project	Blast report if `querytype` is `prot`
`blast`	sample/project	Blast report, regardless of `querytype`

Parameters that can be set

Parameter	Values	Comments
dbtype	nucl\|prot	Helps the module decide which blastdb to use.
querytype	nucl\|prot	Helps the module decide which fasta file to use.
query	sample\|project\|<Path to fasta or BLAST index>	Set to `sample` for sample-scope query, to `project` for project-scope query, or to a path for an external query file.
db	sample\|project\|<Path to BLAST index>	Set to `sample` for sample-scope index, to `project` for project-scope index, or to a path for an external index.

Note

You can’t set both db and query to external files. One of them at least has to be sample or project.

Lines for parameter file

External query, project-wise nucl-type database (must be proceeded by makeblastdb module):

tbl_blst_int:
    module:             blast
    base:               mkblst1
    script_path:        {Vars.Programs.blast.Bin}/blastn
    query:              /path/to/query.fasta
    db:                 project
    dbtype:             nucl
    redirects:
        -evalue:        0.0001
        -max_target_seqs: 5
        -num_of_proc:   20
        -num_threads:   20

Sample specific prot-type fasta, external database:

tbl_blst_ext:
    module:             blast
    base:               prokka1
    script_path:        {Vars.Programs.blast.Bin}/blastp
    query: sample
    querytype:          prot
    db:                 {Vars.Genome.blast_index}
    redirects:
        -evalue: 0.0001

References

Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17), pp.3389-3402.

`parse_blast`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running parse_blast.R:

The parse_blast.R script is available on github.

The program performs the following tasks:

It adds annotation to raw tabular BLAST output files,
filters the BLAST results by several possible fields,
selects the best hit for a group when passed a grouping field and
extracts the sequences equivalent to the alignments.

Requires

Tabular BLAST result files in the following slots:
- sample_data[<sample>]["blast.nucl|blast.prot"] (if scope set to sample)
- sample_data["project_data"]["blast.nucl|blast.prot"] (if scope set to project)

File type	Scope	Comments
`blast.nucl\|blast.prot`	sample/project	A blast report for a `nucl` or `prot` query

Attention

If both blast.nucl and blast.prot exist, determine which to use by setting fasta2use. See parameter table below.

Output

Puts the parsed report in:
- sample_data[<sample>]["blast.parsed"] if scope = sample
- sample_data["project_data"]["blast.parsed"] if scope = project

File type	Scope	Comments
`blast.parsed`	sample/project	Results of parsed blast report

Parameters that can be set

Parameter	Values	Comments
fasta2use	`nucl\|prot`	If both nucl and prot BLAST reports exist, you have to specify which one to use with this parameter.
blast_merge		Block with `path` set to path of `compare_blast_parsed_reports.R` and `redirects` set to `compare_blast_parsed_reports.R` parameters.
extract_fasta		Should the script extract a fasta of the hits?

Note

path in blast_merge block can be left empty. The script will be taken from the same location as the main parse_blast.R script. redirects in blast_merge block can be either in string format or the regular block format.

Lines for parameter file

parse_blast_table:
    module: parse_blast
    base: blst_table
    script_path: {Vars.paths.parse_blast}
    scope: sample
    redirects:
        --columns2keep: '"group name accession qseqid sallseqid evalue bitscore score pident coverage align_len"'
        --dbtable: {Vars.databases.gene_list.table}
        --group_dif_name: # See parse_blast.R documentation for how this is to be specified
        --max_evalue: 1e-7
        --merge_blast: qseqid
        --merge_metadata: # See parse_blast.R documentation for how this is to be specified
        --min_align_len: 30
        --min_coverage: 60
        --names: '"qseqid sallseqid qlen slen qstart qend sstart send length evalue bitscore score pident qframe"'
        --num_hits: 1
    extract_fasta:
    blast_merge:
        path: '{Vars.paths.compare_blast_parsed_reports}'
        redirects:
            --variable:     evalue
            --full_txt_output:

`Gassst`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

Short Description

A module for executing Gassst on a nucleotide fasta file. The search can be either on a sample fasta or on a project-wide fasta. It can use the fasta as a database or as a query.

Requires

fasta files in the following slot for sample-wise Gassst:

sample_data[<sample>]["fasta.nucl"]

or fasta files in the following slots for project-wise Gassst:

sample_data["fasta.nucl"]

Output

puts Gassst output files in the following slots for sample-wise Gassst:

sample_data[<sample>]["blast"]

sample_data[<sample>]["blast.nucl"]

puts fasta output files in the following slots for project-wise Gassst:

sample_data["blast"]

sample_data["blast.nucl"]

Parameters that can be set

Parameter	Values	Comments
scope	project/sample	Set if project-wide fasta.nucl file type should be used [project] the default is sample-wide fasta.nucl file type

Comments

This module was tested on:
Gassst v1.28

The following python packages are required:
pandas

Only -d [database] or -i [query] not both

The Gassst module will generate blast like output with fields:
`"qseqid sallseqid qlen slen qstart qend sstart send length evalue sseq"

Lines for parameter file

Step_Name:                         # Name of this step
    module: Gassst                 # Name of the module to use
    base:                          # Name of the step [or list of names] to run after [mast be after a fasta generating step]
    script_path:                   # Command for running the Gassst script
                                   # The Gassst module will generate blast like output with fields:
                                   # "qseqid sallseqid qlen slen qstart qend sstart send length evalue sseq"
    scope:                         # Set if project-wide fasta.nucl file type should be used [project] the default is sample-wide fasta.nucl file type
    qsub_params:
        -pe:                       # Number of CPUs to reserve for this analysis
    redirects:
        -h:                        # Max hits per query, for downstream best hit will be chosen!
        -i:                        # Only -d [database] or -i [query] not both
        -l:                        # Complexity_filter off
        -d:                        # Only -d [database] or -i [query] not both
        -n:                        # Number of CPUs running Gassst
        -p:                        # Minimum percentage of identity. Must be in the interval [0 100]

References

Rizk, Guillaume, and Dominique Lavenier. “GASSST: global alignment short sequence search tool.” Bioinformatics 26.20 (2010): 2534-2540.‏

`hmmscan`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for searching a fasta file with hmmscan.

Requires

If scope = sample, fasta files in one of the following slots:
- sample_data[<sample>]["fasta.nucl"]
- sample_data[<sample>]["fasta.prot"]
If scope = project, fasta files in one of the following slots:
- sample_data["fasta.nucl"]
- sample_data["fasta.prot"]

Output:

puts hmmscan output files in the following slots:
- for scope = sample (depending on type passed):
  sample_data[<sample>]["hmmscan.nucl"]
  
  sample_data[<sample>]["hmmscan.prot"]
- for scope = project (depending on type passed):
  sample_data["hmmscan.nucl"]
  
  sample_data["hmmscan.prot"]

Parameters that can be set

Parameter	Values	Comments
scope	sample\|project	Create one assembly for all samples or one assembly per sample.
type		Use a prot or nucl fasta file for the search.
output_type	tblout\|domtblout\|pfamtblout	tblout: parseable table of per-sequence hits to file, domtblout: parseable table of per-domain hits to file, pfamtblout: table of hits and domains in Pfam format
hmmdb		A path to the hmmdb to search against.

Lines for parameter file

trino_hmmscan1_highExpr:
    module:             hmmscan
    base:               trino_Transdecode_highExpr
    script_path:        {Vars.paths.hmmscan}
    scope:              sample
    type:               prot
    output_type:        domtblout 
    hmmdb:              {Vars.databases.trinotate.pfam}
    qsub_params:
        -pe:            shared 10
    redirects:
        --cpu:          1

References

Finn, Robert D., Jody Clements, and Sean R. Eddy. “HMMER web server: interactive sequence similarity searching.” Nucleic acids research 39.suppl_2 (2011): W29-W37.

`mash_sketch`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Build mash sketches from sequence files.

Works in three modes:

scope=sample
Builds a separate sketch for each sample
scope=project and src_scope=sample
Builds a project wide sketch from sample sequence files. This can be used with mash_dist module to perform all-against-all comparisons.
scope=project
Builds a sketch from project sequence files.

Requires:

fasta files in one of the following slots:
- sample_data[<sample>]["fasta.nucl"]
or fastq files in the following slots:
- sample_data[<sample>]["fastq.F"]
- sample_data[<sample>]["fastq.R"]
- sample_data[<sample>]["fastq.S"]
For scope = project, uses project-wide files.

Output:

puts ‘msh’ output files in the following slots for (scope=sample):
- sample_data[<sample>]["msh.fasta"]
- sample_data[<sample>]["msh.fastq"]
puts ‘msh’ output files in the following slots for (scope=project):
- sample_data["project_data"]["msh.fasta"]
- sample_data["project_data"]["msh.fastq"]

Parameters that can be set

Parameter	Values	Comments
scope	project\|sample	The scope for which to build the sketch.
src_scope	project\|sample	The scope from which to take the sequence files. Default - same as `scope`
type	nucl\|prot	Use fastq or fasta files. By default, uses any that exist.

Lines for parameter file

Create sketch for each sample based on fastq files

sketch_smp:
    module:         mash_sketch
    base:           trim_gal
    script_path:    "{Vars.paths.mash} sketch"
    scope:          sample
    type:           fastq
    rm_merged:
    qsub_params:
        -pe:        shared 10
    redirects:
        -m:         2
        -p:         10

Create project sketch for all samples’ fastq files

sketch_proj:
    module:         mash_sketch
    base:           merge1
    script_path:    "{Vars.paths.mash} sketch"
    src_scope:      sample
    scope:          project
    type:           fastq
    rm_merged:
    redirects:
        -m:         2
        -p:         10

References

Ondov, Brian D., et al. Mash: fast genome and metagenome distance estimation using MinHash Genome biology, 17.1 (2016): 132.

`mash_dist`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Requires:

fasta files in one of the following slots:
- sample_data[<sample>]["fasta.nucl"]
- sample_data["fasta.nucl"]
OR fastq files in one of the following slots (merge fastq files first with mash_sketch or otherwise):
- sample_data[<sample>]["fastq"]
- sample_data["fastq"]
OR sketch files in one of the following slots:
- sample_data[<sample>]["msh.fastq"]
- sample_data[<sample>]["msh.fasta"]
- sample_data["msh.fastq"]
- sample_data["msh.fasta"]

Output:

puts ‘msh’ output files in the following slots for (scope=sample):
- sample_data[<sample>]["msh.fasta"]
- sample_data[<sample>]["msh.fastq"]
puts ‘msh’ output files in the following slots for (scope=project and scope=all_samples):
- sample_data[<sample>]["mash.dist.table"]
- sample_data["mash.dist.table"]

Parameters that can be set

Parameter	Values	Comments
reference		A block including ‘path’ or ‘scope’, ‘type’ and optionally ‘msh’
query		A block including ‘scope’ (sample, project or all_samples), ‘type’ and optionally ‘msh’

Lines for parameter file

External reference. Sample-wise fastq files.
Returns table of mash dist of sample against external reference. One table per sample

dist:
    module:         mash_dist
    base:           [sketch_proj,sketch_smp]
    script_path:    "{Vars.paths.mash} dist"
    reference:
        path:   /path/to/ref1
    query:
        scope:          sample
        type:           fastq
        msh:

Project mashed fasta reference. Sample mashed fastq query
Returns table of mash dist of sample against project reference. One table per sample

dist:
    module:         mash_dist
    base:           [sketch_proj,sketch_smp]
    script_path:    "{Vars.paths.mash} dist"
    reference:
        scope:      project
        type:       fasta
        msh:
    query:
        scope:      sample
        type:       fastq
        msh:

Project mashed reference. Project mashed fastq query
Returns table of mash dist of project sketch against project sketch. One table for the whole project.

If the project sketch is built from sample sketches, as is created by mash_sketch using scope=project and src_scope=sample, the result will be an all-agianst-all mash dist table.

dist:
    module:         mash_dist
    base:           [sketch_proj,sketch_smp]
    script_path:    "{Vars.paths.mash} dist"
    reference:
        scope:      project
        type:       fastq
        msh:
    query:
        scope:      project
        type:       fastq
        msh:

Project mashed fastq reference. Sample mashed fastq query
Returns table of mash dist of project sketch against teach sample sketch. One table per sample.

dist: 
    module:         mash_dist
    base:           [sketch_proj,sketch_smp]
    script_path:    "{Vars.paths.mash} dist"
    reference:
        scope:      project
        type:       fastq
        msh:
    query:
        scope:      sample
        type:       fastq
        msh:

References

Ondov, Brian D., et al. Mash: fast genome and metagenome distance estimation using MinHash Genome biology, 17.1 (2016): 132.

Variant-related Tasks 

Modules included in this section

freebayes
mpileup_varscan
vcftools ^*
Snippy

`freebayes`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for identifying variants by running freebayes:

Requires

BAM files in the the following slots:
- sample_data[<sample>]["bam"]
Genome reference fasta files in the the following slot (the slot should be populated by the module that created the bam file):
- sample_data[<sample>]["reference"]

Note

Do not specify the reference (-f), since it is filled in automatically by neatseq-flow

Output

If scope is set to sample:
- Puts output files in:
  
  sample_data[<sample>]["vcf"] (if output_type is set to vcf) sample_data[<sample>]["gvcf"] (if output_type is set to gvcf)
If scope is set to project:
- Puts output files in:
  
  sample_data["vcf"] (if output_type is set to vcf) sample_data["gvcf"] (if output_type is set to gvcf)

Parameters that can be set

Parameter	Values	Comments
output_type	vcf\|gvcf	The type of output produced by freebayes. (Can be specified alternatively with appropriate redirects)

Comments

Lines for parameter file

freebayes1:
    module: freebayes
    base: samtools1
    script_path: /path/to/freebayes
    scope: sample
    output_type: vcf
    redirects: 
        --strict-vcf:

References

Marth GT, Korf I, Yandell MD, Yeh RT, Gu Z, Zakeri H, Stitziel NO, Hillier L, Kwok PY, Gish WR: A general approach to single-nucleotide polymorphism discovery. Nat Genet. 1999, 23: 452-456. 10.1038/70570.

`mpileup_varscan`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for identifying variance by running mpileup and piping it’s (large) output into varscan:

Requires

BAM files in the the following slots:
- sample_data[<sample>]["bam"]
Genome reference fasta files in the the following slot (the slot should be populated by the module that created the bam file):
- sample_data[<sample>]["reference"]

Output

If scope is set to sample:
- Puts output files in:
  
  sample_data[<sample>]["vcf"] sample_data[<sample>]["variants"] (if --output-vcf is not redirected in redirects)
If scope is set to project:
- Puts output files in:
  
  sample_data["vcf"] sample_data["variants"] (if --output-vcf is not redirected in redirects)

Parameters that can be set

Parameter	Values	Comments
mpileup_path	path	The full path to the mpileup program. You can append additional mpileup arguments after the path (see example lines)
script_path	path	The full path to the relevant varscan program (see example lines).

Comments

Lines for parameter file

mpileup_varscan1:
    module: mpileup_varscan
    base: samtools1
    script_path: /path/to/java -jar /path/to/VarScan.v2.3.9.jar mpileup2snp
    mpileup_path: /path/to/samtools mpileup --max-depth 6000
    scope: sample
    redirects:
        --min-coverage: 4
        --output-vcf:
        --variants: 1

References

Li, H., Handsaker, B., Wysoker, A., Fennell, T., Ruan, J., Homer, N., Marth, G., Abecasis, G. and Durbin, R., 2009. The sequence alignment/map format and SAMtools. Bioinformatics, 25(16), pp.2078-2079.
Koboldt, D.C., Chen, K., Wylie, T., Larson, D.E., McLellan, M.D., Mardis, E.R., Weinstock, G.M., Wilson, R.K. and Ding, L., 2009. VarScan: variant detection in massively parallel sequencing of individual and pooled samples. Bioinformatics, 25(17), pp.2283-2285.

`vcftools` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running vcftools:

Can take a VCF, gunzipped VCF or BCF file as input.

Produces an output file, as specified by the output options arguments.

Requires

Input files in one of the following slots (for project scope):
- sample_data["VCF" | "gzVCF" | "BCF"]
Input files in one of the following slots (for sample scope):
- sample_data[<sample>]["variants"]["VCF" | "gzVCF" | "BCF"]

Output

Puts output files in the following slots (for project scope):
self.sample_data["project_data"][<output type>]
Puts output files in the following slots (for sample scope):
self.sample_data[<sample>]["variants"][<output type>]

Note

Output type is set by redirecting the required type, i.e. any number of the following list of types.

For extracting several INFO fields, set --get-INFO to a list of INFO elements to extract (instead of passing --get-INFO several times). See examples below.

If several output types are passed, each type will be created in parallel with a different vcftools script.

See the vcftools manual for details (https://vcftools.github.io/man_latest.html).

"--freq", "--freq2", "--counts", "--counts2", "--depth", "--site-depth", "--site-mean-depth", "--geno-depth", "--hap-r2", "--geno-r2", "--geno-chisq", "--hap-r2-positions", "--geno-r2-positions", "--interchrom-hap-r2", "--interchrom-geno-r2", "--TsTv", "--TsTv-summary", "--TsTv-by-count", "--TsTv-by-qual", "--FILTER-summary", "--site-pi", "--window-pi", "--weir-fst-pop", "--het", "--hardy", "--TajimaD", "--indv-freq-burden", "--LROH", "--relatedness", "--relatedness2", "--site-quality", "--missing-indv", "--missing-site", "--SNPdensity", "--kept-sites", "--removed-sites", "--singletons", "--hist-indel-len", "--hapcount", "--mendel", "--extract-FORMAT-info", "--get-INFO", "--recode", "--recode-bcf", "--12", "--IMPUTE", "--ldhat", "--ldhat-geno", "--BEAGLE-GL", "--BEAGLE-PL", "--plink", "--plink-tped".

Warning

At the moment, you can’t pass more than one extract-FORMAT-info option at once. For more than one extract-FORMAT-info, create more than one instance of vcftools.

Parameters that can be set

Parameter	Values	Comments
scope	project \| sample	Indicates whether to use a project or sample bowtie1 index.
input	vcf \| bcf \| gzvcf	Type of input to use. Default: vcf

Lines for parameter file

vcftools1:
    module: vcftools
    base: freebayes1
    script_path: /path/to/vcftools
    scope: project
    input: vcf
    redirects:
        --recode:
        --extract-FORMAT-info: GT
        --get-INFO:
            - NS
            - DB

References

Danecek, P., Auton, A., Abecasis, G., Albers, C.A., Banks, E., DePristo, M.A., Handsaker, R.E., Lunter, G., Marth, G.T., Sherry, S.T. and McVean, G., 2011. The variant call format and VCFtools. Bioinformatics, 27(15), pp.2156-2158.

`Snippy`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

Short Description

A module for running Snippy on fastq files

Requires

fastq files in at least one of the following slots:
self.sample_data[<sample>]["fastq.F"] self.sample_data[<sample>]["fastq.R"] self.sample_data[<sample>]["fastq.S"]

Output

puts Results directory location in:
self.sample_data[<sample>]["Snippy"]

puts for each sample the vcf file location in:
self.sample_data[<sample>]["vcf"]

if snippy_core is set to run:

puts the core Multi-FASTA alignment location in:
self.sample_data["project_data"]["fasta.nucl"]

puts core vcf file location of all analyzed samples in the following slot:
self.sample_data["project_data"]["vcf"]

if Gubbins is set to run:

puts result Tree file location of all analyzed samples in:
self.sample_data["project_data"]["newick"]

update the core Multi-FASTA alignment in:
self.sample_data["project_data"]["fasta.nucl"]

update the core vcf file in the slot:
self.sample_data["project_data"]["vcf"]

if pars is set to run, puts phyloviz ready to use files in:

Alleles:
self.sample_data["project_data"]["phyloviz_Alleles"]

MetaData:
self.sample_data["project_data"]["phyloviz_MetaData"]

Parameters that can be set

Parameter	Values	Comments

Comments

This module was tested on:
Snippy v3.2 gubbins v2.2.0

For the pars analysis the following python packages are required:
pandas

Lines for parameter file

Step_Name:                                  # Name of this step
    module: Snippy                          # Name of the module used
    base:                                   # Name of the step [or list of names] to run after [must be after a merge step]
    script_path:                            # Command for running the Snippy script
    env:                                    # env parameters that needs to be in the PATH for running this module
    qsub_params:
        -pe:                                # Number of CPUs to reserve for this analysis
    gubbins:
        script_path:                        # Command for running the gubbins script, if empty or this line dose not exist will not run gubbins
        --STR:                              # More redirects arguments for running gubbins
    phyloviz:                                   # Generate phyloviz ready to use files
        -M:                                 # Location of a MetaData file 
        --Cut:                              # Use only Samples found in the metadata file
        --S_MetaData:                       # The name of the samples ID column
        -C:                                 # Use only Samples that has at least this fraction of identified alleles
    snippy_core:
        script_path:                        # Command for running the snippy-core script, if empty or this line dose not exist will not run snippy-core
        --noref:                            # Exclude reference 
    redirects:
        --cpus:                             # Parameters for running Snippy
        --force:                            # Force overwrite of existing output folder (default OFF)
        --mapqual:                          # Minimum mapping quality to allow
        --mincov:                           # Minimum coverage of variant site
        --minfrac:                          # Minumum proportion for variant evidence
        --reference:                        # Reference Genome location
        --cleanup                           # Remove all non-SNP files: BAMs, indices etc (default OFF)            

References

Snippy:
https://github.com/tseemann/snippy

gubbins:
Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. “Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins”. doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014

Metagenomics 

Modules included in this section

HUMAnN2
kraken
kraken_biom
metaphlan2
centrifuge

`HUMAnN2`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running HUMAnN2:

Requires

fastq files, either forward or single:
- sample_data[<sample>]["fastq.F"]
- sample_data[<sample>]["fastq.S"]

Output

Puts the HUMAnN2 output files in:
- self.sample_data[sample]["HUMAnN2.genefamilies"] (Also in HUMAnN2.genefamilies.RPK)
- self.sample_data[sample]["HUMAnN2.pathabundance"] (Also in HUMAnN2.pathabundance.RPK)
- self.sample_data[sample]["HUMAnN2.pathcoverage"]
If humann2_renorm_table block is set in params, puts the normalized tables in:
- self.sample_data[sample]["HUMAnN2.genefamilies"] (Also in HUMAnN2.genefamilies.<units>, where <units> is the value passed to --units)
- self.sample_data[sample]["HUMAnN2.pathabundance"] (Also in HUMAnN2.pathabundance.<units>, where <units> is the value passed to --units)
If humann2_join_tables block is set in params, puts the joined tables in:
- self.sample_data["project_data"]["HUMAnN2.genefamilies"]
- self.sample_data["project_data"]["HUMAnN2.pathabundance"]
- self.sample_data["project_data"]["HUMAnN2.pathcoverage"]

Note

If both humann2_renorm_table and humann2_join_tables blocks exist in params, humann2_join_tables will work on the normalized tables produced by humann2_renorm_table! To join the non-normalized tables, do not normalize the tables by not including a humann2_renorm_table block.

Parameters that can be set

Parameter	Values	Comments
humann2_join_tables		Block containing `path` to `humann2_join_tables`, and a `redirects` block if necessary.
humann2_renorm_table		Block containing `path` to `humann2_renorm_table`, and a `redirects` block if necessary.
protein-database	uniref50\|uniref90	Protein database used for analysis.

Warning

The protein-database parameter records the protein database being used: uniref50 or uniref90. It is not used by this module but is required by the downstream module, HUMAnN2_further_processing. If you do not include it, you will not be able to add a HUMAnN2_further_processing instance for downstream analysis.

Lines for parameter file

HUMAnN2_uniref50_hardtrimmed_reads:
    module: HUMAnN2
    base: Trim_Galore
    script_path: '{Vars.Programs_path.humann2}'
    setenv: PERL5LIB="" mpa_dir=$CONDA_PREFIX/bin
    qsub_params:
        -pe: shared 30
    protein-database:   uniref50
    redirects:
        --gap-fill: 'on'
        --input-format: fastq
        --minpath: 'on'
        --nucleotide-database: '{Vars.databases.humann2.chocophlan}'
        --protein-database: '{Vars.databases.humann2.uniref50}'
        --threads: '30'
    humann2_join_tables:
        path: humann2_join_tables
    humann2_renorm_table:
        path: humann2_renorm_table
        redirects:
            --units: cpm

References

HUMAnN2 home page

`kraken`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running kraken:

Note that kraken executable must be in a folder together with kraken-translate and kraken-report. This is the default for kraken installation.

Pass the full path to the kraken executable in script_path.

Merging of sample kraken reports in done with krona. See the section on Parameters that can be set.

You can follow this module with the kraken-biom module to create a biom table from the reports.

Requires

fastq files, either paired end or single:
- sample_data[<sample>]["fastq.F"]
- sample_data[<sample>]["fastq.R"]
- sample_data[<sample>]["fastq.S"]

Output

Puts the kraken output files in:
- self.sample_data[<sample>]["raw_classification"]
- self.sample_data[<sample>]["classification"]
- self.sample_data[<sample>]["kraken.report"]
- If ktImportTaxonomy_path parameter was passed, puts the krona reports in
- self.sample_data["project_data"]["krona"]

Parameters that can be set

Parameter	Values	Comments
ktImportTaxonomy_path		Path to ktImportTaxonomy. You can additional `ktImportTaxonomy` parameters at the end of the path. If not passed, the `krona` report will not be built.

Lines for parameter file

kraken1:
    module: kraken
    base: trim1
    script_path: {Vars.paths.kraken}
    qsub_params:
        -pe: shared 20
    ktImportTaxonomy_path: /path/to/ktImportTaxonomy  -u  http://krona.sourceforge.net
    redirects:
        --db: /path/to/kraken_std_db
        --preload: 
        --quick: 
        --threads: 20

References

Wood, D.E. and Salzberg, S.L., 2014. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome biology, 15(3), p.R46.

`kraken_biom`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running kraken-biom (https://github.com/smdabdoub/kraken-biom)

Requires

Kraken reports:
- sample_data[<sample>]["kraken.report"]

Output

Puts the resulting biom output files in:
- self.sample_data["project_data"]["kraken.biom"]
- self.sample_data["project_data"]["biom_table"]
- self.sample_data["project_data"]["biom_table_tsv"] (if skip_tsv is not set)

Parameters that can be set

Parameter	Values	Comments
skip_tsv		Set if you do not want to convert the report into tsv format.
skip_summary		Set if you do not want to create a summary of the report.
biom_path	/path/to/biom	The path to biom. This is required for conversion to tsv and for producing the summary

Lines for parameter file

kraken_biom1:
    module:             kraken_biom
    base:               kraken1
    script_path:        '{Vars.paths.kraken_biom}'
    # skip_tsv:
    biom_path:          '{Vars.paths.biom}'
    redirects:
        --max:          D 
        --min:          S 
        --gzip:

References

https://github.com/smdabdoub/kraken-biom

`metaphlan2`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running metaphlan2:

Requires

fastq files, either paired end or single:
- sample_data[<sample>]["fastq.F"]
- sample_data[<sample>]["fastq.R"]
- sample_data[<sample>]["fastq.S"]

Output

Puts the metaphlan2 output files in:
- self.sample_data[<sample>]["raw_classification"]
If
If ktImportText_path parameter was passed, puts the krona reports in
- self.sample_data["project_data"]["krona"]
If merge_metaphlan_tables was passed, puts the merged reports in
- self.sample_data["project_data"]["merged_metaphlan2"]
If ‘–biom’ is set in redirects, the biom table is put in:
- self.sample_data[<sample>]["biom_table"]
If ‘–bowtie2out’ is set in redirects, the SAM file is put in:
- self.sample_data[<sample>]["sam"]
If ‘metaphlan2krona_path’ is set:
- self.sample_data[<sample>]["classification"]

Parameters that can be set

Parameter	Values	Comments
ktImportText_path		Path to ktImportText.
merge_metaphlan_tables		Path to merge_metaphlan_tables.py. If not specified, will derive it from the location of `metaphlan2`
metaphlan2krona_path		Path to metaphlan2krona.py

Lines for parameter file

metph1:
    module: metaphlan2
    base: trim1
    script_path: {Vars.paths.metaphlan2}
    ktImportText_path: /path/to/ktImportText
    merge_metaphlan_tables: 
    metaphlan2krona_path:   /path/to/metaphlan2krona.py
    redirects:
        --biom: 
        --bowtie2_exe: /path/to/bowtie2
        --bowtie2db: /path/to/database
        --bowtie2out:
        --input_type: fastq
        --mdelim: ';'
        --mpa_pkl: /path/to/mpa_v20_m200.pkl

References

Truong, D.T., Franzosa, E.A., Tickle, T.L., Scholz, M., Weingart, G., Pasolli, E., Tett, A., Huttenhower, C. and Segata, N., 2015. MetaPhlAn2 for enhanced metagenomic taxonomic profiling. Nature methods, 12(10), pp.902-903.

`centrifuge`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running centrifuge:

Pass the full path to the centrifuge executable in script_path.

Merging of sample centrifuge reports in done with krona. See the section on Parameters that can be set.

Requires

fastq files, either paired end or single:
- sample_data[<sample>]["fastq.F"]
- sample_data[<sample>]["fastq.R"]
- sample_data[<sample>]["fastq.S"]

Output

Puts the centrifuge output files in:
- self.sample_data[<sample>]["raw_classification"]
- self.sample_data[<sample>]["classification"]
- self.sample_data[<sample>]["classification_report"]
If ktImportTaxonomy_path parameter was passed, puts the krona reports in
- self.sample_data["project_data"]["krona"]

Parameters that can be set

Parameter	Values	Comments
ktImportTaxonomy_path		Path to ktImportTaxonomy. You can additional `ktImportTaxonomy` parameters at the end of the path. If not passed, the `krona` report will not be built.

Lines for parameter file

Centrifuge:
    module:         centrifuge
    base:           trim1
    script_path:    {Vars.paths.centrifuge}
    qsub_params:
        -pe:        shared 20
    ktImportTaxonomy_path: /path/to/ktImportTaxonomy  -u  http://krona.sourceforge.net
    redirects:
        --db:       /path/to/centrifuge_db
        --preload: 
        --quick: 
        --threads:  20

References

Kim, D., Song, L., Breitwieser, F. P., & Salzberg, S. L. (2016). Centrifuge: rapid and sensitive classification of metagenomic sequences. Genome research, 26(12), 1721-1729.

Microbiology 

Modules included in this section

CARD_RGI
cgMLST_and_MLST_typing
Roary
Snippy
Gubbins
Tree_plot

`CARD_RGI`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running CARD RGI:

RGI is executed on the contigs stored in a Nucleotide fasta file.

Requires

A nucleotide fasta file in one of the following slots:
- sample_data[<sample>]["fasta.nucl"]
- sample_data["fasta.nucl"]

Output

If scope is set to sample:
- Puts output files in:
  
  sample_data[<sample>]["CARD_RGI.json"] sample_data[<sample>]["CARD_RGI.tsv"]
- Puts index of output files in:
  
  self.sample_data["project_data"]["CARD_RGI.files_index"]
- If merge_script_path is specified in parameters, puts the merged file in
  
  self.sample_data["project_data"]["CARD_RGI.merged_reports"]
If scope is set to project:
- Puts output files in:
  
  sample_data["CARD_RGI.json"] sample_data["CARD_RGI.tsv"]

Parameters that can be set

Parameter	Values	Comments
JSON2tsv_script	path	The path to the CARD script for converting the JSON output to tsv (find ‘convertJsonToTSV.py’ in your RGI installation)
merge_script_path	path	Path to a script that takes an index of RGI output files (’–ind’) and a place to put the output (–output). This script will be executed in the wrapping up stage. (Note, the script can take more parameters. These should be passed with the path in the parameter files, e.g. ‘python /path/to/script –param1 val1 –param2 val2’) If the parameters is not passed, no action will be taken on the output files.

Comments

Lines for parameter file

rgi_inst:
    module: CARD_RGI
    base: spades1
    script_path: python /path/to/rgi.py
    qsub_params:
        -pe: shared 15
    JSON2tsv_script: python /path/to/convertJsonToTSV.py
    merge_script_path: Rscript /path/to/merge_reports.R --variable bit_score
    orf_to_use: -x
    scope: sample
    redirects:
        -n: 20
        -x: 1

References

McArthur, A.G., Waglechner, N., Nizam, F., Yan, A., Azad, M.A., Baylay, A.J., Bhullar, K., Canova, M.J., De Pascale, G., Ejim, L. and Kalan, L., 2013. The comprehensive antibiotic resistance database. Antimicrobial agents and chemotherapy, 57(7), pp.3348-3357.

`cgMLST_and_MLST_typing`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad. The MLST typing R script was created by Menachem Sklarz & Michal Gordon

Short Description

A module for a MLST and cgMLST Typing

Requires

Blast results after parsing in:
self.sample_data[<sample>]["blast.parsed"]

Output

Typing results in:
self.sample_data[<sample>]["Typing"]

Merge of typing results in:
self.sample_data["project_data"]["Typing"]

Files for phyloviz in:
self.sample_data["project_data"]["phyloviz_MetaData"] self.sample_data["project_data"]["phyloviz_Alleles"]

Tree file (if –Tree flag is set) in newick format in:
self.sample_data["project_data"]["newick"]

Parameters that can be set

Parameter	Values	Comments
cut_samples_not_in_metadata		In the final merge file consider only samples found in the Meta-Data file
sample_cutoff	[0-1]	In the final merge file consider only samples that have at least this fraction of identified alleles

Comments

The following python packages are required:
- pandas
The following R packages are required:
- magrittr
- plyr
- optparse
- tools

Note

If using conda environment with R installed the R packages will be automatically installed inside the environment.

Lines for parameter file

Step_Name:                                   # Name of this step
    module: cgMLST_and_MLST_typing           # Name of the module to use
    base:                                    # Name of the step [or list of names] to run after [must be after steps that generates blast.parsed File_Types] 
    script_path:                             # Leave blank
    metadata:                                # Path to Meta-Data file
    metadata_samples_ID_field:               # Column name in the Meta-Data file of the samples ID
    cut_samples_not_in_metadata:             # In the final merge file consider only samples found in the Meta-Data file
    sample_cutoff:                           # In the final merge file consider only samples that have at least this fraction of identified alleles
    Tree:                                    # Generate newick Tree using hierarchical-clustering [Hamming distance]
    Tree_method:                             # The hierarchical-clustering linkage method [default=complete]
    redirects:
        --scheme:                            # Path to the Typing scheme file [Tab delimited]
        --Type_col_name:                     # Column/s name/s in the scheme file that are not locus names
        --ignore_unidentified_alleles        # Remove columns with unidentified alleles [default=False]

References

`Roary`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad. The Bi_clustering R script was created by Eliad Levi

Short Description

A module for running Roary on GFF files

Requires

For each Sample, GFF file location in:
- sample_data[<sample>]["GFF"]
If there is a GFF directory in the following slot, no new GFF directory will be created and ONLY the GFF files in this directory will be analysed.
- sample_data["GFF_dir"]
If the search_GFF flag is on GFF files will be searched in the last base name directory

Output

puts output GFF directory location in the following slots:
- sample_data["GFF"]
puts output pan_genome results directory location in the following slots:
- sample_data["pan_genome_results_dir"]
puts output pan_genome presence_absence_matrix file location in the following slots:
- sample_data["presence_absence_matrix"]
puts output pan_genome clustered_proteins file location in the following slots:
- sample_data["clustered_proteins"]
puts output GWAS directory location in the following slot:
- sample_data["GWAS_results_dir"]
puts output Biclustering directory location in the following slot:
- sample_data["Bicluster_results_dir"]
puts output Biclustering cluster file location in the following slot:
- sample_data["Bicluster_clusters"]
puts output Gecko directory location in the following slot:
- sample_data["Gecko_results_dir"]
puts Accessory genes or virulence/resistance hierarchical-clustering tree file in the following slot:
- self.sample_data["project_data"]["newick"]

Parameters that can be set

Parameter	Values	Comments

Comments

This module was tested on:
- Roary v3.10.2
- Roary v1.006924
- Scoary v1.6.11
- Scoary v1.6.9
- Gecko3
For the Bi_cluster analysis the following R packages are required:
- optparse
- eisa
- ExpressionView
- openxlsx
- clusterProfiler
- org.Hs.eg.db
To plot the pan-genome matrix the following python packages are required:
- pandas
- patsy
- seaborn
- matplotlib
- numpy
- scipy
For the scoary analysis the following python packages are required:
- pandas
For the Gecko analysis the following python packages are required:
- pandas

Note

If using conda environment with R installed, the R packages will be automatically installed inside the environment.

Lines for parameter file

Step_Name:                                   # Name of this step
    module: Roary                            # Name of the module used
    base:                                    # Name of the step [or list of names] to run after [must be after a GFF file generator step like Prokka]
    script_path:                             # Command for running the Roary script 
    env:                                     # env parameters that needs to be in the PATH for running this module
    qsub_params:                             
        -pe:                                 # Number of CPUs to reserve for this analysis
    virulence_resistance_tag:                # Use the name of the db used in prokka or use "VFDB" if you used the VFDB built-in Prokka module DB 
    search_GFF:                              # Search for GFF files?
    Bi_cluster:                              # Do Bi_cluster analysis using the Roary results, if empty or this line dose not exist will not do Bi_cluster analysis 
        --Annotation:                        # location of virulence annotation file to use to annotate the clusters or use "VFDB" if you used the VFDB built-in Prokka module DB
        --ID_field:                          # The column name in the MetaData file of the samples IDs
        --cols_to_use:                       # list of the MetaData columns to use to annotate the clusters  example: '"ST","CC","source","host","geographic.location","Date"'
        --metadata:                          # location of MetaData file to use to annotate the clusters
    plot:                                    # plot gene presence/absence matrix
        format:                              # The gene presence/absence matrix plot output format. example: pdf
        Clustering_method                    # The gene presence/absence matrix plot hierarchical-clustering method. example: ward
        Tree:                                # Save s tree in newick format of the 'Accessory' genes or the 'virulence_resistance_tag' genes hierarchical-clustering
                                             # example: Tree: Accessory 
    scoary:
        script_path:                         # Command for running the scoary script, if empty or this line dose not exist will not run scoary 
        BH_cutoff:                           # Scoary BH correction for multiple testing cut-off
        Bonferroni_cutoff:                   # Scoary Bonferroni correction for multiple testing cut-off
        metadata_file:                       # location of MetaData file to use to create the scoary traits file
        metadata_samples_ID_field:           # The column name in the MetaData file of the sample's IDs
        traits_file:                         # Path to a traits file
        traits_to_pars:               # If a traits file is not provided use a list of conditions to create the scoary traits file from MetaData file. example:"source/=='blood'"  "source/=='wound'"
                                             # Pairs of field and operator + value to convert to boolean traits: field_name1/op_value1 .. field_nameN/op_valueN Example: "field_1/>=val_1<val_2"    "feild_2/=='str_val'"
                                             # A Filter can be used by FILTER_field_name1/FILTER_op_value1&field_name1/op_value1
                                             # Note that Gecko can't run if the Bi_clustering was not run
    Gecko:
        script_path:                         # Command for running the Gecko script, if empty or this line dose not exist will not run Gecko
        -d:                                  # Parameters for running Gecko
        -s:                                  # Parameters for running Gecko
        -q:                                  # Parameters for running Gecko
    redirects:
        -k:                                  # Parameters for running Roary
        -p:                                  # Parameters for running Roary
        -qc:                                 # Parameters for running Roary
        -s:                                  # Parameters for running Roary
        -v:                                  # Parameters for running Roary
        -y:                                  # Parameters for running Roary

References

Roary program: Page, Andrew J., et al. “Roary: rapid large-scale prokaryote pan genome analysis.” Bioinformatics 31.22 (2015): 3691-3693.‏

Scoary program: Brynildsrud, Ola, et al. “Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary.” Genome biology 17.1 (2016): 238.‏

Gecko program: Winter, Sascha, et al. “Finding approximate gene clusters with Gecko 3.” Nucleic acids research 44.20 (2016): 9600-9610.‏

`Snippy`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

Short Description

A module for running Snippy on fastq files

Requires

fastq files in at least one of the following slots:
self.sample_data[<sample>]["fastq.F"] self.sample_data[<sample>]["fastq.R"] self.sample_data[<sample>]["fastq.S"]

Output

puts Results directory location in:
self.sample_data[<sample>]["Snippy"]

puts for each sample the vcf file location in:
self.sample_data[<sample>]["vcf"]

if snippy_core is set to run:

puts the core Multi-FASTA alignment location in:
self.sample_data["project_data"]["fasta.nucl"]

puts core vcf file location of all analyzed samples in the following slot:
self.sample_data["project_data"]["vcf"]

if Gubbins is set to run:

puts result Tree file location of all analyzed samples in:
self.sample_data["project_data"]["newick"]

update the core Multi-FASTA alignment in:
self.sample_data["project_data"]["fasta.nucl"]

update the core vcf file in the slot:
self.sample_data["project_data"]["vcf"]

if pars is set to run, puts phyloviz ready to use files in:

Alleles:
self.sample_data["project_data"]["phyloviz_Alleles"]

MetaData:
self.sample_data["project_data"]["phyloviz_MetaData"]

Parameters that can be set

Parameter	Values	Comments

Comments

This module was tested on:
Snippy v3.2 gubbins v2.2.0

For the pars analysis the following python packages are required:
pandas

Lines for parameter file

Step_Name:                                  # Name of this step
    module: Snippy                          # Name of the module used
    base:                                   # Name of the step [or list of names] to run after [must be after a merge step]
    script_path:                            # Command for running the Snippy script
    env:                                    # env parameters that needs to be in the PATH for running this module
    qsub_params:
        -pe:                                # Number of CPUs to reserve for this analysis
    gubbins:
        script_path:                        # Command for running the gubbins script, if empty or this line dose not exist will not run gubbins
        --STR:                              # More redirects arguments for running gubbins
    phyloviz:                                   # Generate phyloviz ready to use files
        -M:                                 # Location of a MetaData file 
        --Cut:                              # Use only Samples found in the metadata file
        --S_MetaData:                       # The name of the samples ID column
        -C:                                 # Use only Samples that has at least this fraction of identified alleles
    snippy_core:
        script_path:                        # Command for running the snippy-core script, if empty or this line dose not exist will not run snippy-core
        --noref:                            # Exclude reference 
    redirects:
        --cpus:                             # Parameters for running Snippy
        --force:                            # Force overwrite of existing output folder (default OFF)
        --mapqual:                          # Minimum mapping quality to allow
        --mincov:                           # Minimum coverage of variant site
        --minfrac:                          # Minumum proportion for variant evidence
        --reference:                        # Reference Genome location
        --cleanup                           # Remove all non-SNP files: BAMs, indices etc (default OFF)            

References

Snippy:
https://github.com/tseemann/snippy

gubbins:
Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. “Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins”. doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014

`Gubbins`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

Short Description

A module for running Gubbins on a project level nucleotide Multi-FASTA alignment file.

Requires

Project level nucleotide Multi-FASTA alignment file in the following slot:
sample_data["fasta.nucl"]

Output

puts result Tree file location of all analyzed samples in the slot:
self.sample_data["project_data"]["newick"]

update the Multi-FASTA alignment in the slot:
self.sample_data["project_data"]["fasta.nucl"]

puts the filtered vcf file in the slot:
self.sample_data["project_data"]["vcf"]

if pars is set to run, puts phyloviz ready to use files in the slots:

Alleles:
self.sample_data["project_data"]["phyloviz_Alleles"]

MetaData:
self.sample_data["project_data"]["phyloviz_MetaData"]

Parameters that can be set

Parameter	Values	Comments

Comments

This module was tested on:
gubbins v2.2.0

For the pars analysis the following python packages are required:
pandas

Lines for parameter file

Step_Name:                                  # Name of this step
    module: Gubbins                         # Name of the module used
    base:                                   # Name of the step [or list of names] to run after [must be after a step that generates a Project level nucleotide Multi-FASTA alignment]
    script_path:                            # Command for running the gubbins script, if empty or this line dose not exist will not run gubbins
    env:                                    # env parameters that needs to be in the PATH for running this module
    qsub_params:
        -pe:                                # Number of CPUs to reserve for this analysis
    phyloviz:                                   # Generate phyloviz ready to use files
        -M:                                 # Location of a MetaData file 
        --Cut:                              # Use only Samples found in the metadata file
        --S_MetaData:                       # The name of the samples ID column
        -C:                                 # Use only Samples that has at least this fraction of identified alleles
    redirects:
        --threads:                          # Parameters for running Gubbins

References

gubbins:
Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. “Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins”. doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014

`Tree_plot`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

Short Description

A module for plotting tree file in newick format together with MetaData information and possible additional matrix information.

Requires

A tree file in newick format in:
self.sample_data["project_data"]["newick"]

Tab delimited file with samples names in one of the columns from:
self.sample_data["project_data"]["MetaData"] self.sample_data["project_data"]["results"] or from external file.

Output

Generate pdf file of the tree with the MetaData information:

Parameters that can be set

Parameter	Values	Comments

Comments

The following R packages are required:
optparse ape ggtree openxlsx

Lines for parameter file

Step_Name:                            # Name of this step
    module: Tree_plot                 # Name of the used module
    base:                             # Name of the step [or list of names] to run after and generate a Tree plot [must be after a tree making step]
                                      # If more then one base is specified: the first overwrite the other bases overlapped slots  
    script_path:                      # Command for running the Tree plot script
                                      # If this line is empty or missing it will try using the module's associated script
    iterate_on_bases:                 # If set will iterate over the step's bases and generate a plot for each base. 
    tree_by_heatmap:                  # Generate additional tree using Hierarchical Clustering of the heatmap
    redirects:
        --layout:                     # Tree layout [fan or rectangular (default)]
        --Meta_Data:                  # Path to tab-delimited Meta Data file with header line. 
                                      # If this line is empty or missing it will try searching for results data.
        --M_Excel:                    # If the Meta_Data input is an Excel file indicate the sheet name to use
        --ID_field:                   # Column name in the Meta Data file for IDs found in the tips of the tree
        --cols_to_use:                # Columns in the Meta Data file to use and the order from the center up  
        --open.angle:                 # Tree open angle.
        --branch.length:              # Don't use branch length [cladogram]
        --conect.tip:                 # Connect the tip to its label
        --pre_spacer:                 # Space before the label text [default=0.05]
        --post_spacer:                # Space after the label text [default=0.01]
        --OTU:                        # Column name in the Meta Data file to use as OTU annotation
        --labels:                     # Use branch length labels
        --Tip_labels:                 # Show tip labels
        --heatmap:                    # Path to Data file to generate a heatmap 
                                      # If this line is empty it will try searching for results data.
        --H_Excel:                    # If the heatmap input is an Excel file indicate the sheet name to use
        --heatmap_cell_border:        # Color of heatmap cell border [default='white']
        --heatmap_lowest_value:       # Color of heatmap lowest value [default='white']
        --heatmap_highest_value:      # Color of heatmap highest value [default='red']
        --cols_to_use_heatmap:        # Columns in the heatmap Data file to use and the order from the center up
        --ID_heatmap_field:           # Column name for IDs found in the tips of the tree in the heatmap Data file
        --heatmap_variable:           # Use only variable columns in the heatmap
        --heatmap_count_by_sep:       # Count the sep in each cell to generate the values for the heatmap
        --heatmap_HC_dist:            # The heatmap Hierarchical Clustering dist method
        --heatmap_HC_agg:             # The heatmap Hierarchical Clustering agglomeration method

QIIME (version 1.9)

Modules included in this section

qiime_prep
qiime_demult
qiime_chimera
qiime_pick_otus
qiime_pick_rep_set
qiime_align_seqs
qiime_filter_alignment
qiime_assign_taxonomy
qiime_make_phylogeny
qiime_make_otu_table
qiime_filter_samples_from_otu_table
qiime_filter_otus
qiime_sort_otu_table
qiime_divers

`qiime_prep`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for preparing fastq reads for analysis with QIIME (1.9):

The reads stored in each sample are optinally joined and then set it a directory in such a way the downstream, QIIME’s demult can concatenate the sequences while saving the sample of origin.

The directory will contain symbolic links to the files to be used by demult in the following step.

Requires

fastq files in one of the following slots:
- sample_data[<sample>]["fastq.F"]
- sample_data[<sample>]["fastq.R"]
- sample_data[<sample>]["fastq.S"]

Output

Puts directory of links to files to use with QIIME:
- self.sample_data["project_data"]["qiime.prep_links_dir"]
If join is performed:
- puts the new joined reads in:
  self.sample_data[<sample>]["fastq.J"]
- puts the unjoined forward reads in:
  self.sample_data[<sample>]["fastq.F"]
- puts the unjoined reverse reads in:
  self.sample_data[<sample>]["fastq.R"]

Parameters that can be set

Parameter	Values	Comments
join	none, join (or join_cat - not implemented)	Wheather to join paired reads.
unjoined	forward, reverse, both or none	What to do with unjoined sequences? Use only forward, only reverse, both or none. If join is none, use this parameter to indicate which reads to take for analysis.
join_algo	forward, reverse, both or none	What to do with unjoined sequences?
parameters		Path to QIIME parameter file to be used downstream

Lines for parameter file

q_prep_1:
    module: qiime_prep
    base: merge1
    script_path: /path/to/join_paired_ends.py
    join: join
    unjoined: forward
    parameters: /path/to/qiime_params.txt
    redirects:
        --pe_join_method: fastq-join

References

Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I. and Huttley, G.A., 2010. “QIIME allows analysis of high-throughput community sequencing data”. Nature methods, 7(5), pp.335-336.

`qiime_demult`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running QIIME’s multiple_split_libraries_fastq.py:

The reads from step qiime_prep are combined into one seqs.fna file.

Note

The module has not been tested on other types of data, such as undemultiplexed reads. It should work but there will probably be unexpected problems.

Requires

A directory of read files with smaple names coded in the file names, such as the directory produced by qiime_prep:
- sample_data["qiime.prep_links_dir"]

Output

Puts the resulting seqs.fna file in the following slots:
- self.sample_data["project_data"]["qiime.demult_seqs"]
- self.sample_data["project_data"]["qiime.fasta"]
- self.sample_data["project_data"]["fasta.nucl"]

Lines for parameter file

q_demult_1:
    module: qiime_demult
    base: q_prep_1
    script_path: '/path/to/multiple_split_libraries_fastq.py'
    redirects:
        --demultiplexing_method: sampleid_by_file
        --include_input_dir_path: null
        --parameter_fp: /path/to/qiime_params
        --remove_filepath_in_name: null

References

Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I. and Huttley, G.A., 2010. “QIIME allows analysis of high-throughput community sequencing data”. Nature methods, 7(5), pp.335-336.

`qiime_chimera`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running QIIME’s identify_chimeric_seqs.py:

The module can operate on the raw seqs.fna or on an aligned version. The latter is used for ChimeraSlayer and the former for usearch61

Requires

A fasta file in:
- sample_data["qiime.fasta"]
Alternatively, an aligned fasta file in:
- sample_data["fasta.aligned"]

Output

Puts the resulting list of chimeras in
- self.sample_data["project_data"]["chimeras"]
Puts the filtered fasta file in:
- self.sample_data["project_data"]["fasta.chimera_removed"]
- self.sample_data["project_data"]["fasta.nucl"]

Note

When using parallel_identify_chimeric_seqs.py, the module tries to build the scripts appropriately. It is wise to check the parallel scripts before running them…

Parameters that can be set

Parameter	Values	Comments
method	usearch61 or ChimeraSlayer	Method to use for the analysis (passed to the –chimera_detection_method of `identify_chimeric_seqs.py`

Lines for parameter file

q_chimera_usrch:
    module: qiime_chimera
    base: q_demult_1
    # script_path: '{Vars.qiime_path}/parallel_identify_chimeric_seqs.py'
    script_path: '{Vars.qiime_path}/identify_chimeric_seqs.py'
    method:         usearch61 # Or ChimeraSlayer. Will guess depending on existing files.
    redirects:
        # --jobs_to_start:              20
        --aligned_reference_seqs_fp:  /path/to/reference_files.otus_aligned
        --reference_seqs_fp:  /path/to/reference_files.otus

References

Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I. and Huttley, G.A., 2010. “QIIME allows analysis of high-throughput community sequencing data”. Nature methods, 7(5), pp.335-336.

`qiime_pick_otus`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running QIIME’s pick_otus.py

Requires

A fasta file in:
- sample_data["fasta.nucl"]

Output

Puts the resulting OTU table in:
- self.sample_data["project_data"]["otu_table"]

Lines for parameter file

q_pick_otu_1:
    module: qiime_pick_otus
    base: q_chimera_usrch
    script_path: '{Vars.qiime_path}/pick_otus.py'
    setenv: {Vars.qiime_env}

References

Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I. and Huttley, G.A., 2010. “QIIME allows analysis of high-throughput community sequencing data”. Nature methods, 7(5), pp.335-336.

`qiime_pick_rep_set`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running QIIME’s pick_rep_set.py

Requires

A fasta file in:
- sample_data["fasta.nucl"]
An OTU table in:
- sample_data["otu_table"]

Output

Puts the resulting fasta file in:
- self.sample_data["project_data"]["fasta.nucl"]
Saves the original fasta file in:
- self.sample_data["project_data"]["qiime.full_fasta"]

Lines for parameter file

q_rep_set_1:
    module: qiime_pick_rep_set
    base: q_pick_otu_1
    script_path: '{Vars.qiime_path}/pick_rep_set.py'
    setenv: {Vars.qiime_env}

References

Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I. and Huttley, G.A., 2010. “QIIME allows analysis of high-throughput community sequencing data”. Nature methods, 7(5), pp.335-336.

`qiime_align_seqs`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running QIIME's align_seqs.py:

Can be used for the parallel versions thereof: parallel_align_seqs_pynast.py

Requires

A fasta file in:
- sample_data["fasta.nucl"]

Output

Puts the resulting aligned fasta file in:
- self.sample_data["project_data"]["fasta.nucl"]
- self.sample_data["project_data"]["fasta.aligned"]
Stores the old, unaligned version in:
- self.sample_data["project_data"]["fasta.unaligned"]

Note

When using parallel_align_seqs_pynast.py, the module tries to build the scripts appropriately. It is wise to check the parallel scripts before running them…

Lines for parameter file

q_align_para:
    module: qiime_align_seqs
    base: q_rep_set_1
    script_path: '{Vars.qiime_path}/parallel_align_seqs_pynast.py'
    setenv: {Vars.qiime_env}
    redirects:
        --jobs_to_start: 5
        --retain_temp_files: 

References

Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I. and Huttley, G.A., 2010. “QIIME allows analysis of high-throughput community sequencing data”. Nature methods, 7(5), pp.335-336.

`qiime_filter_alignment`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running QIIME’s filter_alignment.py

Requires

A fasta file in:
- sample_data["fasta.nucl"]

Output

Puts the resulting aligned fasta file in:
- self.sample_data["project_data"]["fasta.nucl"]
Saves the original unaligned fasta file in:
- self.sample_data["project_data"]["fasta.aligned_unfiltered"]

Lines for parameter file

q_filt_align_1:
    module: qiime_filter_alignment
    base: q_align_1
    script_path: '{Vars.qiime_path}/filter_alignment.py'
    setenv: {Vars.qiime_env}

References

Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I. and Huttley, G.A., 2010. “QIIME allows analysis of high-throughput community sequencing data”. Nature methods, 7(5), pp.335-336.

`qiime_assign_taxonomy`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running QIIME’s assign_taxonomy.py

Can also be used to run the parallel versions of the program:

parallel_assign_taxonomy_blast.py

parallel_assign_taxonomy_rdp.py

parallel_assign_taxonomy_uclust.py

Requires

A fasta file in:
- sample_data["fasta.nucl"]

Output

Puts the resulting list of chimeras in
- self.sample_data["project_data"]["taxonomy"]

Note

When using the parallel version, the module tries to build the scripts appropriately. It is wise to check the parallel scripts before running them…

Lines for parameter file

q_tax_asn_1:
    module: qiime_assign_taxonomy
    base: q_rep_set_1
    script_path: '{Vars.qiime_path}/parallel_assign_taxonomy_rdp.py'
    setenv: {Vars.qiime_env}
    redirects:
        --confidence: 0.5
        --id_to_taxonomy_fp: {Vars.reference_files.id_to_taxonomy}
        --jobs_to_start: 20
        --rdp_max_memory: 50000
        --reference_seqs_fp: {Vars.reference_files.otus}

References

Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I. and Huttley, G.A., 2010. “QIIME allows analysis of high-throughput community sequencing data”. Nature methods, 7(5), pp.335-336.

`qiime_make_phylogeny`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running QIIME’s make_phylogeny.py

Requires

A fasta file in:
- sample_data["fasta.nucl"]

Output

Puts the resulting OTU table in:
- self.sample_data["project_data"]["phylotree"]

Lines for parameter file

q_phylo_1:
    module: qiime_make_phylogeny
    base: q_filt_align_1
    script_path: '{Vars.qiime_path}/make_phylogeny.py'
    setenv: {Vars.qiime_env}

References

Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I. and Huttley, G.A., 2010. “QIIME allows analysis of high-throughput community sequencing data”. Nature methods, 7(5), pp.335-336.

`qiime_make_otu_table`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running QIIME’s make_otu_table.py:

The module creates a BIOM table based on the OTU table and a taxonomy assignment if avaliable (will be available if the qiime_assign_taxonomy is in the branch).

If chimera checking has been performed, the suspected chimeric sequences will be removed from the BIOM table.

The module also adds code for creating a summary of the BIOM table and a tab-delimited version thereof.

Requires

An OTU table:
- sample_data["otu_table"]

Optional

A taxonomy assignment of the sequences:
- sample_data["taxonomy"]

Output

Puts the BIOM table in
- self.sample_data["project_data"]["biom_table"]
Puts the BIOM table summary in:
- self.sample_data["project_data"]["biom_table_summary"]
Puts the BIOM table in tab-delimited format in:
- self.sample_data["project_data"]["biom_table_tsv"]
If a fasta.chimera_removed file exists, will put the unfiltered BIOM table in:
- self.sample_data["project_data"]["unfiltered_biom_table"]

Parameters that can be set

Parameter	Values	Comments
skip_summary		If passed, will not create the BIOM table summary.
skip_tsv		If passed, will not create the tsv version of the BIOM table.

Lines for parameter file

q_mk_otu_1:
    module: qiime_make_otu_table
    base: q_phylo_1
    script_path: '{Vars.qiime_path}/make_otu_table.py'
    setenv: {Vars.qiime_env}
    # skip_summary:
    # skip_tsv:
    redirects:
        --mapping_fp: /path/to/qiime1_mapping.txt

References

Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I. and Huttley, G.A., 2010. “QIIME allows analysis of high-throughput community sequencing data”. Nature methods, 7(5), pp.335-336.

`qiime_filter_samples_from_otu_table`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running QIIME’s filter_samples_from_otu_table.py

Requires

A BIOM table in:
- sample_data["biom_table"]

Output

Puts the resulting BIOM table in:
- self.sample_data["project_data"]["biom_table"]
Puts the BIOM table summary in:
- self.sample_data["project_data"]["biom_table_summary"]
Puts the BIOM table in tab-delimited format in:
- self.sample_data["project_data"]["biom_table_tsv"]
Puts the unfiltered BIOM table in:
- self.sample_data["project_data"]["prefilter_biom_table"]

Parameters that can be set

Parameter	Values	Comments
skip_summary		If passed, will not create the BIOM table summary.
skip_tsv		If passed, will not create the tsv version of the BIOM table.

Lines for parameter file

filt_samp_1:
    module: qiime_filter_samples_from_otu_table
    base: q_mk_otu_1
    script_path: '{Vars.qiime_path}/filter_samples_from_otu_table.py'
    setenv: {Vars.qiime_env}
    redirects:
        --mapping_fp: /path/to/mapping.txt
        --min_count: 100000

References

Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I. and Huttley, G.A., 2010. “QIIME allows analysis of high-throughput community sequencing data”. Nature methods, 7(5), pp.335-336.

`qiime_filter_otus`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running QIIME’s filter_otus_from_otu_table.py

Requires

A BIOM table in:
- sample_data["biom_table"]

Output

Puts the resulting BIOM table in:
- self.sample_data["project_data"]["biom_table"]
Puts the BIOM table summary in:
- self.sample_data["project_data"]["biom_table_summary"]
Puts the BIOM table in tab-delimited format in:
- self.sample_data["project_data"]["biom_table_tsv"]
Puts the unfiltered BIOM table in:
- self.sample_data["project_data"]["prefilter_biom_table"]

Parameters that can be set

Parameter	Values	Comments
skip_summary		If passed, will not create the BIOM table summary.
skip_tsv		If passed, will not create the tsv version of the BIOM table.

Lines for parameter file

q_filt_otus_1:
    module: qiime_filter_otus
    base: filt_samp_1
    script_path: '{Vars.qiime_path}/filter_otus_from_otu_table.py'
    setenv: {Vars.qiime_env}
    redirects:
        --min_count_fraction: 0.00005
        --min_samples: 10

References

Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I. and Huttley, G.A., 2010. “QIIME allows analysis of high-throughput community sequencing data”. Nature methods, 7(5), pp.335-336.

`qiime_sort_otu_table`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running QIIME’s sort_otu_table.py

Requires

A BIOM table in:
- sample_data["biom_table"]

Output

Puts the resulting BIOM table in:
- self.sample_data["project_data"]["biom_table"]
Puts the BIOM table summary in:
- self.sample_data["project_data"]["biom_table_summary"]
Puts the BIOM table in tab-delimited format in:
- self.sample_data["project_data"]["biom_table_tsv"]

Parameters that can be set

Parameter	Values	Comments
skip_summary		If passed, will not create the BIOM table summary.
skip_tsv		If passed, will not create the tsv version of the BIOM table.

Lines for parameter file

q_sort_otus_1:
    module: qiime_sort_otu_table
    base: filt_samp_1
    script_path: '{Vars.qiime_path}/sort_otu_table.py'
    setenv: {Vars.qiime_env}
    redirects:
        --sort_field:   XXX

References

Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I. and Huttley, G.A., 2010. “QIIME allows analysis of high-throughput community sequencing data”. Nature methods, 7(5), pp.335-336.

`qiime_divers`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running QIIME’s core_diversity_analyses.py:

The module creates a BIOM table based on the OTU table and a taxonomy assignment if avaliable (will be available if the qiime_assign_taxonomy is in the branch).

If chimera checking has been performed, the suspected chimeric sequences will be removed from the BIOM table.

The module also adds code for creating a summary of the BIOM table and a tab-delimited version thereof.

Requires

A BIOM table:
- sample_data["biom_table"]

Optional

A phylogenetic tree:
- sample_data["phylotree"]

Output

Puts the core diversity directory name in
- self.sample_data["project_data"]["diversity"]

Parameters that can be set

Parameter	Values	Comments
–mapping_fp		A path to the qiime mapping file (if not set, will use the mapping file passed in `qiime_prep`.
–parameter_fp		A path to a qiime parameter file.

Lines for parameter file

q_divers_1:
    module: qiime_divers
    base: q_filt_otus_1
    script_path: /path/to/QIIME/bin/core_diversity_analyses.py
    qsub_params:
        -pe: shared 20
    sampling_depth: 109897
    redirects:
        --categories: Disease,sex
        --parameter_fp: /path/to/parameter_file

References

Caporaso, J.G., Kuczynski, J., Stombaugh, J., Bittinger, K., Bushman, F.D., Costello, E.K., Fierer, N., Peña, A.G., Goodrich, J.K., Gordon, J.I. and Huttley, G.A., 2010. “QIIME allows analysis of high-throughput community sequencing data”. Nature methods, 7(5), pp.335-336.

QIIME (version 2)

Note

The modules were tested on qiime version 2018.11

GATK 

Modules included in this section

GATK_CatVariants
GATK_gvcf
GATK_hard_filters
GATK_merge_gvcf
GATK_pre_processing
GATK_SelectVariants
GATK_VQSR
GenotypeGVCFs
Picard_CollectAlignmentSummaryMatrics
Picard_CollectVariantCalling
VEP

`GATK_CatVariants`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module to concatenate chromosome to get one VCF file for each sample.

Attention

The module generate script for each sample - chromosom.

The programs included in the module are the following:

CatVariants (GATK)

Requires

self.sample_data[sample][chr]["GATK_vcf"]

Output

self.sample_data[sample]["vcf"]

Parameters that can be set

Parameter	Values	Comments
genome_reference
chrom_list	Comma-separated list of chromosome names as mentioned in the BAM file

Lines for parameter file

GATK_CatVariants1:
    module: GATK_CatVariants
    base: GATK_SelectVariants_VEPfiltered
    script_path:     /path/to/java -cp /path/to/GenomeAnalysisTK.jar org.broadinstitute.gatk.tools.CatVariants
    genome_reference:   /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

`GATK_gvcf`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for generate gVCF file from BAM file.

Attention

The module generate script for each sample-chromosom.

The programs included in the module are the following:

HaplotypeCaller (GATK)

Requires

self.sample_data[sample]["bam"]

Output

self.sample_data[sample][chr]["GATK_g.vcf"]

Parameters that can be set

Parameter	Values	Comments
genome_reference
chrom_list	Comma-separated list of chromosome names as mentioned in the BAM file

Lines for parameter file

GATK_gvcf:  # check about -nct for parallization and deal with memmory problem
    module: GATK_gvcf
    base: GATK_pre_processing
    script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    genome_reference:    /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 
    qsub_params:
        -pe:      shared 15
    redirects:
        -nct: 15

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

`GATK_hard_filters`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for apply hard filters to a variant callset that is too small for VQSR or for which truth/training sets are not available..

Attention

The module generate script for each chromosom.

The programs included in the module are the following:

SelectVariants and VariantFiltration (GATK)

Requires

self.sample_data[chr]["vcf"]

Output

self.sample_data[chr]["vcf"]

Parameters that can be set

Parameter	Values	Comments
genome_reference
chrom_list		Comma-separated list of chromosome names as mentioned in the BAM file
filterExpression_SNP		filter e xpression for SNP
filterExpression_INDEL		filter e xpression for INDEL

Lines for parameter file

GATK_hard_filters1:
    module: GATK_hard_filters 
    base: GenotypeGVCFs1
    script_path:     /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    genome_reference:   /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 
    filterExpression_SNP: '"QD < 2.0 || MQ < 40.0 || FS > 60.0 || SOR > 3.0 || MQRankSum < -12.5 || ReadPosRankSum < -8.0"'
    filterExpression_INDEL: '"QD < 2.0 || ReadPosRankSum < -20.0 || FS > 200.0 || SOR > 10.0 || InbreedingCoeff < -0.8"'

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

`GATK_merge_gvcf`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for combine g.vcf files to cohorts.

Attention

The module generate script for each sample-chromosom.

The programs included in the module are the following:

CombineGVCFs (GATK)

Requires

self.sample_data[sample][chr]["GATK_g.vcf"]

Output

self.sample_data["cohorts"]

Parameters that can be set

Parameter	Values	Comments
genome_reference
chrom_list	Comma-separated list of chromosome names as mentioned in the BAM file
cohort_size		number of g.vcf file to be in each cohort

Lines for parameter file

gatk_merge_gvcf:
    module: GATK_merge_gvcf
    base: GATK_gvcf
    script_path:     /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    genome_reference:    /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    cohort_size: 10
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

`GATK_pre_processing`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for generating ready-to-GATK-use BAM files from fastq files.

Attention

The module lacks the “base recalibration process (BQSR)” step

The programs included in the module are the following:

FastqToSam Picard tool to generate uBAM
MarkIlluminaAdapters Picard tool to Mark Illumina Adapters
SamToFastq Picard tool uBAM to fastq
MergeBamAlignment Picard tool to merge BAM and uBAM
MarkDuplicates Picard tool to remove PCR duplicates
BWA MEM mapping with BWA MEM

Requires

A fastq file in the following locations:
- self.sample_data[sample]["fastq.F"]
- self.sample_data[sample]["fastq.R"]

Output

self.sample_data[sample]["bam"]

Parameters that can be set

Parameter	Values	Comments
picard_path	path to PICARD	Full path to the PICARD .jar file
bwa_mem_path
genome_reference

Lines for parameter file

GATK_pre_processing:
    module: GATK_pre_processing
    base: fQC_trim
    script_path: /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    picard_path:     /path/to/picard.jar
    bwa_mem_path:    /path/to/bwa mem
    genome_reference:    /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    threads: 20
    qsub_params:
        -pe: shared 20

References

http://broadinstitute.github.io/picard/

`GATK_SelectVariants`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for separation of multi-VCF per-chromosome to one VCF per-sample per-chromosome

Attention

The module generates a script for each sample/chromosome.

The programs included in the module are the following:

SelectVariants (GATK)

Requires

self.sample_data[chr]["vcf"]

Output

self.sample_data[sample][chr]["GATK_vcf"]

Parameters that can be set

Parameter	Values	Comments
genome_reference		path to reference genome
chrom_list		Comma-separated list of chromosome names as mentioned in the BAM file

Lines for parameter file

GATK_SelectVariants_VEPfiltered:
    module: GATK_SelectVariants
    base: VEP1
    script_path: /path/to/GenomeAnalysisTK.jar        
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 
    genome_reference:   /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta
    redirects:
        --setFilteredGtToNocall: null

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

`GATK_VQSR`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for apply VQSR filters

Attention

The module generates script for each chromosoms.

The programs included in the module are the following:

VariantRecalibrator and ApplyRecalibration (GATK)

Requires

self.sample_data[chr]["vcf"]

Output

self.sample_data[chr]["vcf"]

Parameters that can be set

Parameter	Values	Comments
genome_reference
chrom_list		list of chromosomes names as mentioned in BAM file separated by ‘,’
ts_filter_level_SNP		filter e xpression for SNP
ts_filter_level_INDEL		filter e xpression for INDEL
resource_SNP
resource_INDEL

Lines for parameter file

GATK_VQSR1:
    module: GATK_VQSR 
    base: GenotypeGVCFs1
    script_path:     /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    genome_reference:   /path/to/bundle/b37/human_g1k_v37_decoy.fasta
    resource_SNP: 
        - hapmap,known=false,training=true,truth=true,prior=15.0 /path/to/bundle/b37/hapmap_3.3.b37.vcf
        - omni,known=false,training=true,truth=true,prior=12.0 /path/to/bundle/b37/1000G_omni2.5.b37.vcf
        - 1000G,known=false,training=true,truth=false,prior=10.0 /path/to/bundle/b37/1000G_phase1.snps.high_confidence.b37.vcf
        - dbsnp,known=true,training=false,truth=false,prior=2.0 /path/to/bundle/b37/dbsnp_138.b37.vcf
    resource_INDEL: 
        - mills,known=false,training=true,truth=true,prior=12.0 /path/to/bundle/b37/Mills_and_1000G_gold_standard.indels.b37.sites.vcf
        - dbsnp,known=true,training=false,truth=false,prior=2.0 /path/to/bundle/b37/dbsnp_138.b37.vcf 
    ts_filter_level_SNP: 99.0
    ts_filter_level_INDEL: 99.0
    maxGaussians: 4
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

`GenotypeGVCFs`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for perform joint genotyping on gVCF files produced by HaplotypeCaller.

Attention

The module generate script for each cohort-chromosom.

The programs included in the module are the following:

GenotypeGVCFs (GATK)

Requires

self.sample_data["cohorts"]

Output

self.sample_data[chr]["vcf"]

Parameters that can be set

Parameter	Values	Comments
genome_reference
chrom_list		list of chromosomes names as mentioned in BAM file separated by ‘,’

Lines for parameter file

GenotypeGVCFs1:
    module: GenotypeGVCFs
    base: gatk_merge_gvcf
    script_path:     /path/to/java -jar /path/to/GenomeAnalysisTK.jar
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 
    genome_reference:   /path/to/gatk/bundle/b37/human_g1k_v37_decoy.fasta

References

Van der Auwera, Geraldine A., et al. “From FastQ data to high‐confidence variant calls: the genome analysis toolkit best practices pipeline.” Current protocols in bioinformatics 43.1 (2013): 11-10.‏

`Picard_CollectAlignmentSummaryMatrics`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for statistical information about the mapping generated by CollectAlignmentSummaryMetrics from Picard tools.

The programs included in the module are the following:

CollectAlignmentSummaryMatrics from PICARD tools.

Requires

A fastq file in the following location:
- self.sample_data[sample]["bam"]

Output

Parameters that can be set

Parameter	Values	Comments
genome_reference

Lines for parameter file

Picard_CollectAlignmentSummaryMatrics1:
    module: Picard_CollectAlignmentSummaryMatrics
    base: GATK_pre_processing
    script_path: /path/to/java -jar /path/to/picard-1.139/dist/picard.jar
    genome_reference:    /path/to/bundle/b37/human_g1k_v37_decoy.fasta

References

http://broadinstitute.github.io/picard/

`Picard_CollectVariantCalling`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for generating SNP and indel statistics information

The programs included in the module are the following:

CollectVariantCallingMetrics Picard tool to generate A collection of metrics relating to snps and indels within a variant-calling file (VCF)

Requires

A fastq file in the following location:
- self.sample_data[chr]["vcf"]

Output

Lines for parameter file

Picard_CollectVariantCalling1:
    module: Picard_CollectVariantCalling 
    base: GATK_hard_filters1
    script_path: /path/to/java -jar /path/to/picard.jar
    DBSNP: /path/to/bundle/b37/dbsnp_138.b37.vcf
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT"

References

http://broadinstitute.github.io/picard/

`VEP`

Authors: Michal Gordon
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for annotation of the multi VCF file

Attention

The module generates a script for each chromosome.

The programs included in the module are the following:

VEP (Variant Effect Predictor. )

Requires

self.sample_data[chr]["vcf"]

Output

self.sample_data[chr]["vcf"] - annotated multi-VCF per chromosome

Parameters that can be set

Parameter	Values	Comments
chrom_list	Comma-separated list of chromosome names as mentioned in the BAM file

Note

VEP parameters can be passed via redirects

Lines for parameter file

VEP1:
    module: VEP 
    base: GATK_hard_filters1
    script_path: /path/to/vep
    chrom_list: "1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, X, Y, MT" 
    redirects:
        --format: vcf
        --offline: null
        --species: homo_sapiens
        --fork: 10
        --assembly: GRCh37
        --max_af: null
        --pick: null
        --dir: /path/to/VEP/ensembl-vep-release-88.10/cache
        --check_existing: null
        --symbol: null
        --force_overwrite: null
        --vcf: null

References

McLaren, William, et al. “The ensembl variant effect predictor.” Genome biology 17.1 (2016): 122.‏

Sequence Clustering 

Modules included in this section

cd_hit
vsearch_cluster
vsearch_derepel

`cd_hit`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for clustering with cd-hit/ch-hit-est:

This module runs both cd-hit and cd-hit-est. The type of sequence (nucl or prot) will be determined by the program supplied in script_path.

You must make sure that the required file exists: If clustering prot sequences with cd-hit-est, make sure there is a fasta.prot file, etc.

CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Weizhong Li & Adam Godzik. Bioinformatics, (2006) 22:1658-1659

CD-HIT: accelerated for clustering the next generation sequencing data, Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu & Weizhong Li. Bioinformatics, (2012) 28:3150-3152

Requires

fasta files in the following slot (scope = sample):
- sample_data[<sample>]["fasta.nucl"|"fasta.prot"]
fasta files in the following slot (scope = project):
- sample_data["fasta.nucl"|"fasta.prot"]

Output

Puts the output fasta file in the fasta slot:

self.sample_data[<sample>]["fasta.nucl"|"fasta.prot"]
Or

self.sample_data["project_data"]["fasta.nucl"|"fasta.prot"]

Parameters that can be set

Parameter	Values	Comments
scope	project \| sample	Indicates whether to use a project or sample fasta.

Lines for parameter file

clust_proj:
    module: cd_hit
    base: derepel_proj
    script_path: 'path/to/cd-hit-est'
    qsub_params:
        -pe: shared 40
    scope: project
    redirects:
        -T: 40

References

Fu, L., Niu, B., Zhu, Z., Wu, S. and Li, W., 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23), pp.3150-3152.

`vsearch_cluster`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running vsearch clustering:

The reads stored in fasta files are clustered with one of the 3 methods available: cluster_fast, cluster_size or cluster_smallmem.

..Note: At the moment this works on the nucl fasta only. See the web: https://github.com/torognes/vsearch/issues/42

Output types are defined with the outputs parameter which can be a comma separated list of the following:

biomout,mothur_shared_out,otutabout,profile,uc

Fasta output files are defined with the fasta_outputs parameter which can be a comma separated list of the following:

centroids,consout,msaout

By default, the centroids file is stored in the fasta slot. Change this by setting store_fasta to one of the types listed above, i.e. centroids,consout or msaout

Requires

fasta files in the following slot (scope = sample):
- sample_data[<sample>]["fasta.nucl"]
fasta files in the following slot (scope = project):
- sample_data["fasta.nucl"]

Output

Puts required output in similarly named slots, e.g.:

self.sample_data[<sample>]["vsearch.centroids"] or self.sample_data["project_data"]["vsearch.centroids"]
Puts the required fasta in the fasta slot:

self.sample_data[<sample>]["fasta.nucl"] or self.sample_data["project_data"]["fasta.nucl"]

Parameters that can be set

Parameter	Values	Comments
outputs	biomout,mothur_shared_out,otutabout,profile,uc	List of outputs other than fasta type outputs (see fasta_outputs
fasta_outputs	centroids,consout,msaout	A list of fasta types to produce.
store_fasta	centroids\|consout\|msaout	The fasta type to store in fasta slot
scope	project \| sample	Indicates whether to use a project or sample nucl fasta.

Lines for parameter file

clust_proj:
    module: vsearch_cluster
    base: derepel_proj
    script_path: '{Vars.vsearch_path}/vsearch'
    qsub_params:
        -pe: shared 40
    fasta_outputs: centroids,consout
    outputs: uc
    store_fasta: centroids
    scope: project
    type: cluster_fast
    redirects:
        --id: 0.85  # From ipyrad defaults
        --qmask: dust
        --strand: both
        --threads: 40
        --sizein:
        --sizeout:

References

Rognes, T., Flouri, T., Nichols, B., Quince, C. and Mahé, F., 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, p.e2584.

`vsearch_derepel`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running vsearch read dereplication:

Performs dereplication on fastq and fasta files.

Note

Dereplication with vsearch is not defined on paired end reads.

At the moment, this module is defined only for fasta.nucl or for fastq.S.

Requires

fastq files in the following slots:
- sample_data[<sample>]["fastq.S"]
or fasta files the following slot:
- sample_data[<sample>]["fasta.nucl"]

Output

Puts output fasta file in the following slots:
- self.sample_data[<sample>]["fasta.nucl"]
- self.sample_data[<sample>]["vsearch_derepl"]

Parameters that can be set

Parameter	Values	Comments
scope	sample \| project	Which file to use for dereplication: sample-wise or project-wise files
uc		Save UCLUST-like dereplication output? (see –uc in manual)
type	derep_fulllength \| derep_prefix	Type of derelpication strategy. See manual

Lines for parameter file

For external index:

derepel_proj:
    module: vsearch_derepel
    base: merge_proj
    script_path: '{Vars.vsearch_path}/vsearch'
    scope: project
    type: derep_fulllength
    uc: 
    redirects:
        --sizein:
        --sizeout:

References

Rognes, T., Flouri, T., Nichols, B., Quince, C. and Mahé, F., 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, p.e2584.

Various Reporting Programs 

Modules included in this section

NGSplot
Multiqc ^*
Collect_results
Tree_plot
BUSCO

`NGSplot`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running NGSplot:

Runs NGSplot on existing sorted BAM files.

Please make sure the BAM is sorted, such as following the samtools module

If this is a ChIP-seq experiment and you have controls defined, it will also run NGSplot for the sample:control comparison.

At the moment, the module works only at the sample scope. (BAM files in the project scope are rare!)

Requires

BAM files in the following slots:
- sample_data[<sample>]["bam"]

Output

Puts output NGS reports in the following slots:
- self.sample_data[<sample>]["NGSplot"]
For ChIP-seq data, puts comparison reports in
- self.sample_data[<sample>]["NGSplot_vs_control"]

Parameters that can be set

Parameter	Values	Comments
setenv	NGSPLOT=/path/to/ngsplot	Running NGSplot requires setting this EV.

Lines for parameter file

NGSplot_genebody:
    module:             NGSplot
    base:               sam_base
    script_path:        Rscript /path/to/ngsplot-2.61/bin/ngs.plot.r
    setenv:             NGSPLOT=/path/to/ngsplot-2.61
    redirects:
        -G:             mm10
        -R:             genebody
        -P:             20
        -GO:            hc
    qsub_params:
        -pe:            shared 20

References

Shen, L., Shao, N., Liu, X. and Nestler, E., 2014. ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databases. BMC genomics, 15(1), p.284.

`Multiqc` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for preparing a MultiQC report for all samples.

Tip

By default, the module will search for parsable reports in the directories of all the modules in the branch leading to this instance. To search only in the directories of the explicit base steps, specify the bases_only parameter.

Requires

No real requirements. Will give a report with information if one of the base steps produces reports that MultiQC can read, e.g. fastqc, bowtie2, samtools etc.

Output

puts report dir in the following slot:
- self.sample_data[<sample>]["Multiqc_report"]

Parameters that can be set

Parameter	Values	Comments
bases_only		Search directories of explicit base steps only.

Lines for parameter file

firstMultQC:
    module: Multiqc
    base:
        - sam_bwt2_1
        - fqc_trim1
    bases_only:
    script_path: /path/to/multiqc

References

Ewels, P., Magnusson, M., Lundin, S. and Käller, M., 2016. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), pp.3047-3048.

`Collect_results`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

Short Description

A module to Collect and merge/append results from all base steps directories: This module will search for each base step for all the results files with a common name pattern [Regular expression]. The search will be done within the base step result directories. The sample name could be inferred for each result file base on the parent directory name and added to the merged file [as new column named “Samples”]. All the results files will be append [by default] or merged by a common column name. The merge files can then be convert individually to pivot table file

Requires

Tab delimited files with common name pattern found within the base step data directories:

For example files ending with .out

Output

Generate merged tab delimited files:

Will generate file for each of the base steps with the file ending with .merg

Can also generate Excel file with sheet for each base step

Put results file in:
self.sample_data[“project_data”][“results”]

Parameters that can be set

Parameter	Values	Comments

Comments

The following python packages are required:
pandas openpyxl

Lines for parameter file

Step_Name:                            # Name of this step
    module: Collect_results           # Name of the used module
    base:                             # Name of the step [or list of names] to run after and collect results from [must be after a merge step]
    script_path:                      # Command for running the a merging script
                                      # If this line is empty or missing it will try using the module's associated script
    redirects:
        -R:                           # Regular expression to find result files
        --Merge_by:                   # Merge files by common column
        --header:                     # Don't use a header row, use integers instead [0,1,2,3...], easy to use with --pivot option
        --Excel:                      # Collect all results to excel file split by sheets
        --add_samples_names:          # Infer and add samples names from file parent directory to "Samples" column
        --pivot:                      # Convert to pivot table by [index columns values]
                                      # If with the options: -add_samples_names and --header  it is possible to use: '''Samples'' '5' '0''
        --MetaData:                   # Use external MetaData file as the base for merging
        --split_by:                   # Split the data in the columns [index <columns> values] before pivot
        --sep:                        # Columns separator for input file
        -T:                           # Write Transpose output

`Tree_plot`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

Short Description

A module for plotting tree file in newick format together with MetaData information and possible additional matrix information.

Requires

A tree file in newick format in:
self.sample_data["project_data"]["newick"]

Tab delimited file with samples names in one of the columns from:
self.sample_data["project_data"]["MetaData"] self.sample_data["project_data"]["results"] or from external file.

Output

Generate pdf file of the tree with the MetaData information:

Parameters that can be set

Parameter	Values	Comments

Comments

The following R packages are required:
optparse ape ggtree openxlsx

Lines for parameter file

Step_Name:                            # Name of this step
    module: Tree_plot                 # Name of the used module
    base:                             # Name of the step [or list of names] to run after and generate a Tree plot [must be after a tree making step]
                                      # If more then one base is specified: the first overwrite the other bases overlapped slots  
    script_path:                      # Command for running the Tree plot script
                                      # If this line is empty or missing it will try using the module's associated script
    iterate_on_bases:                 # If set will iterate over the step's bases and generate a plot for each base. 
    tree_by_heatmap:                  # Generate additional tree using Hierarchical Clustering of the heatmap
    redirects:
        --layout:                     # Tree layout [fan or rectangular (default)]
        --Meta_Data:                  # Path to tab-delimited Meta Data file with header line. 
                                      # If this line is empty or missing it will try searching for results data.
        --M_Excel:                    # If the Meta_Data input is an Excel file indicate the sheet name to use
        --ID_field:                   # Column name in the Meta Data file for IDs found in the tips of the tree
        --cols_to_use:                # Columns in the Meta Data file to use and the order from the center up  
        --open.angle:                 # Tree open angle.
        --branch.length:              # Don't use branch length [cladogram]
        --conect.tip:                 # Connect the tip to its label
        --pre_spacer:                 # Space before the label text [default=0.05]
        --post_spacer:                # Space after the label text [default=0.01]
        --OTU:                        # Column name in the Meta Data file to use as OTU annotation
        --labels:                     # Use branch length labels
        --Tip_labels:                 # Show tip labels
        --heatmap:                    # Path to Data file to generate a heatmap 
                                      # If this line is empty it will try searching for results data.
        --H_Excel:                    # If the heatmap input is an Excel file indicate the sheet name to use
        --heatmap_cell_border:        # Color of heatmap cell border [default='white']
        --heatmap_lowest_value:       # Color of heatmap lowest value [default='white']
        --heatmap_highest_value:      # Color of heatmap highest value [default='red']
        --cols_to_use_heatmap:        # Columns in the heatmap Data file to use and the order from the center up
        --ID_heatmap_field:           # Column name for IDs found in the tips of the tree in the heatmap Data file
        --heatmap_variable:           # Use only variable columns in the heatmap
        --heatmap_count_by_sep:       # Count the sep in each cell to generate the values for the heatmap
        --heatmap_HC_dist:            # The heatmap Hierarchical Clustering dist method
        --heatmap_HC_agg:             # The heatmap Hierarchical Clustering agglomeration method

`BUSCO`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for running BUSCO.

BUSCO searches for predefined sequences in an assembly. See the BUSCO website.

This module creates scripts for running BUSCO on a fasta file against a BUSCO lineage database.

The lineage can be specified in two ways:

Specify the path to the lineage file with the --lineage redirected argument.
Specify the URL of the database (e.g. http://busco.ezlab.org/v2/datasets/eukaryota_odb9.tar.gz). The file will be downloaded and unzipped.

Requires

fasta files in one of the following slots for sample-wise BUSCO:
- sample_data[<sample>]["fasta.nucl"]
- sample_data[<sample>]["fasta.prot"]
or fasta files in one of the following slots for project-wise BUSCO:
- sample_data["fasta.nucl"]
- sample_data["fasta.prot"]

Output:

Stores output directory in:
- self.sample_data[<sample>][“BUSCO”] (scope = sample)
- self.sample_data[“project_data”][“BUSCO”] (scope = project)

Parameters that can be set

Parameter	Values	Comments
scope	`sample` \| `project`	Use sample of project scope fasta file.
get_lineage		Path to one of the lineages to download from https://busco.ezlab.org/frame_wget.html. Will be downloaded, unzipped and used if no –lineage is passed.

Lines for parameter file

Run BUSCO on project-scope fasta file, using a pre-downloaded BUSCO database:

BUSCO1:
    module:             BUSCO
    base:               Trinity_assembl
    script_path:        {Vars.paths.BUSCO} 
    scope:              project
    redirects:
        --mode:         transcriptome
        --lineage:      {Vars.databases.BUSCO}
        --cpu:          65
        --force:
        --restart:

Run BUSCO on project-scope fasta file, including downloading the BUSCO database:

BUSCO1:
    module:             BUSCO
    base:               Trinity_assembl
    script_path:        {Vars.paths.BUSCO}
    scope:              project
    get_lineage:        http://busco.ezlab.org/v2/datasets/eukaryota_odb9.tar.gz
    redirects:
        --mode:         transcriptome
        --cpu:          65
        --force:
        --restart:

References

Miscellaneous Modules 

Modules included in this section

manage_types ^*
merge_table
split_fasta
fasta_splitter
ProjectToSample ^*

`manage_types` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for managing file type without script creation.

Supports adding, deleting, copying and moving file types.

Requires

Output

Parameters that can be set

Parameter	Values	Comments
operation	add\|del\|mv\|cp	The operation to perform on the file type dictionary
scope	project\|sample	The scope on which to perform the operation. For ‘mv’ and ‘cp’ this is the source scope
type		The type on which to perform the operation. For ‘mv’ and ‘cp’ this is the source type
scope_trg	project\|sample	The destination scope for ‘mv’ and ‘cp’ operations
type_trg		The destination type for ‘mv’ and ‘cp’ operations
path		For ‘add’ operation, the value to insert in the file type.

Attention

The operations do NOT operate on the actual files! They only modify internal file types index.

Tip

You can combine several operations in one module instance, by passing lists to the parameters in the table above. All lists should be of the same length, or of length 1 (i.e. plain strings). Plain strings will be extrapolated to all operations. e.g., to delete one file type and add another, both at the project scope, pass [del,add] to the ‘operation’ parameter, and ‘project’ to the ‘scope’ parameter. The ‘path’ can also be a plain string. It will be extrapolated to ‘del’, as well, but will be ignored by it. See example lines below.

Lines for parameter file

manage_types1:
    module:             manage_types
    base:               STAR_bld_ind
    script_path:        
    scope:              project
    operation:          mv
    type:               trinity.contigs
    type_trg:           trinity.contigs
    scope_trg:          sample

manage_types1:
    module:             manage_types
    base:               trinity1
    script_path:        
    scope:              - project
                        - sample
                        - sample
                        - project
    operation:          - mv
                        - del
                        - cp
                        - add
    type:               - fasta.nucl
                        - fasta.nucl
                        - fastq.F
                        - bam
    type_trg:           [transcripts.nucl, None ,fastq.main, None]
    scope_trg:          sample
    path:               /path/to/mapping.bam   

`merge_table`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for merging sample tables into a single project-wide table, or into group tables by category.

The table can be with or without a header line.

Can be used for merging fasta and fastq files as well.

Important

When merging by category, the sample names will be set to the category level names for all subsequent steps.

Tip

You can merge several types at once by passing them as a list to type. If the type files have different numbers of header lines, pass a list of header line numbers with header. The header list must be of length 1 or identical to the length of type.

The extension of the resulting file will be the same as that of the files being merged, if they are all the same. If not, will not add an extension. To change the default behaviour, set an ext parameter with the extension to use, e.g. fna. If several types are being merged, if ext is a string, the string will be used for all types. For a different ext for each file type, use a list of strings, in the same order as the type parameter.

Attention

If you split sample-scope fasta files with fasta_splitter or split_fasta modules, the new subsamples are stored with a source category, containing the sample name from which the subsample was produced. When merging back into the sample scope, use scope: group and category: source.

Requires

A table file in any slot:
- sample_data[<sample>][<file.type>]

Output

Puts output files in the following slot:
- sample_data["project_data"][<file.type>]
Or, for merging by category, in the following slot:
- sample_data[category_level][<file.type>]

Parameters that can be set

Parameters that can be set:
Parameter	Values	Comments
type		A file type that exists in all samples. Can also be a list of types, each one of which will be merged independently
script_path		Leave blank
scope	project\|group	Merge all samples into one project table, or merge sample tables by category.
category		If `scope` is set to `group`, you must specify the category by which to divide the samples for merging. The category must be a string containing one of the categories (columns) in the mapping file
header	0	The number of header lines each table has. The header will be used for the complete table and all other headers will be removed. If there is no header line, set to 0 or leave out completely. If set but not specified, will default to 1!.
ext		The extension to use for the merged file. If `type` is a list, `ext` will be used for all types unless `ext` itself is a list of the same length as `type`.
add_filename		If set, the source filename will be appended to each line in the resulting table.

Lines for parameter file

Merge sample-scope tables into single project-scope table:

merge_blast_tables:
    module:         merge_table
    base:           merge1
    script_path:
    scope:          project
    type:           blast.prot
    header:         0

Merge sample-scope tables into group-scope table, by category country:

merge_blast_tables:
    module:         merge_table
    base:           merge1
    script_path:
    scope:          group
    category:       country
    type:           blast.prot
    header:         0

`split_fasta`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for splitting fasta files into parts.

Convenient for parallelizing processes on the cluster. You can take a project wide fasta file (such as a transcriptome), split it into sub-fasta files, and run various processes on the sub-files.

The parts can then be combined with merge_table module, which can concatenate any type of file.

Important

When splitting sample-scope fasta files, the subsamples are stored with a source category set to the original sample name. You can use this for merging results at the sample scope downstream. See documentation for merge_table.

Requires

A fasta file in one of the following slots (scope = “project”):
- sample_data["project_data"]["fasta.nucl"]
- sample_data["project_data"]["fasta.prot"]
A fasta file in one of the following slots (scope = “sample”):
- sample_data[<sample>]["fasta.nucl"]
- sample_data[<sample>]["fasta.prot"]

Output

Puts output files in the following slots:
- sample_data[<sample>]["fasta.nucl"]
- sample_data[<sample>]["fasta.prot"]
For sample scope, the original sample list will be overridden with the new sample list.

Parameters that can be set

Parameters that can be set:
Parameter	Values	Comments
type	nucl\|prot	The type of fasta file to split
subsample_num		Number of fragments

Lines for parameter file

split_fasta1:
    module:         split_fasta
    base:           Trinity1
    script_path:    
    type:           nucl
    subsample_num:      4

`fasta_splitter`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for splitting fasta files into parts, using fasta-splitter.pl.

Convenient for parallelizing processes on the cluster. You can take a project wide fasta file (such as a transcriptome), split it into sub-fasta files, and run various processes on the sub-files.

The parts can then be combined with merge_table module, which can concatenate any type of file.

Attention

The module ships with fasta-splitter.pl version 0.2.6, 2017-08-01.

Leave script_path empty to use the perl script provided. Perl must be in the path!

To use a different version, supply it via script_path.

Usage:

Usage: fasta-splitter [options] <file>...
    Options:
        --n-parts <N>        - Divide into <N> parts
        --part-size <N>      - Divide into parts of size <N>
        --measure (all|seq|count) - Specify whether all data, sequence length, or
                               number of sequences is used for determining part
                               sizes ('all' by default).
        --line-length        - Set output sequence line length, 0 for single line
                               (default: 60).
        --eol (dos|mac|unix) - Choose end-of-line character ('unix' by default).
        --part-num-prefix T  - Put T before part number in file names (def.: .part-)
        --out-dir            - Specify output directory.
        --nopad              - Don't pad part numbers with 0.
        --version            - Show version.
        --help               - Show help.

You can’t use the --part-size method, since it will end up in an unknown number of files, which is not defined in Neat-Seq Flow.

Please do not use the --nopad parameter. There is no reason to…

Important

When splitting sample-scope fasta files, the subsamples are stored with a source category set to the original sample name. You can use this for merging results at the sample scope downstream. See documentation for merge_table.

Requires

A fasta file in one of the following slots (scope = “project”):
- sample_data["project_data"]["fasta.nucl"]
- sample_data["project_data"]["fasta.prot"]
A fasta file in one of the following slots (scope = “sample”):
- sample_data[<sample>]["fasta.nucl"]
- sample_data[<sample>]["fasta.prot"]

Output

Puts output files in the following slots:
- sample_data[<sample>]["fasta.nucl"]
- sample_data[<sample>]["fasta.prot"]
For sample scope, the original sample list will be overridden with the new sample list.

Parameters that can be set

Parameters that can be set:
Parameter	Values	Comments
`type`	nucl\|prot	The type of fasta file to split
`redirects: --n-parts`		Number of fragments

Lines for parameter file

split_fasta1:
    module:         fasta_splitter
    base:           Trinity1
    script_path:
    type:           nucl
    redirects:
        --n-parts:      4
        --measure:      seq

References

http://kirill-kryukov.com/study/tools/fasta-splitter

`ProjectToSample` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A utility module for moving project data to a sample, and back again. Is useful when a module which works on sample data has to be executed on data in the project scope.

For instance, in the STAR 2 pass pipeline, the first stage involves aligning all reads to the reference in order to find splice junctions. The reads can be merged into a project scope fastq.F and fastq.R slots, but all aligners take there reads from the sample scope!

This module overrides the sample list with a single sample containing the project slots (or a subset of the slots). Then, the mapping modules will take the project-wide reads from the sample representing the project.

Recovering the old sample list is done by setting the direction parameter to smp2proj.

See the STAR2pass workflow for the working example.

Usually, the module should be called twice, once in the proj2smp direction and the in the smp2proj direction. Although it is possible to use the smp2proj to move data from sample sample_name to the project, it is better to do this operation with the manage_types module.

Requires

Output

Parameters that can be set

Parameter	Values	Comments
direction	proj2smp\|smp2proj	Move project info to sample or vice versa
type		The types to operate on
operation	cp\|mv	Whether to move the slots or just copy them.
sample_name		The name of the new sample to create or the sample to copy from. Defaults to project title

Attention

This moduel does NOT operate on the actual files! It only modifies internal file types index.

Lines for parameter file

Moving from project to sample:

ProjectToSample:
    module:     ProjectToSample
    base:       merge_table
    script_path:
    direction:  proj2smp
    # sample_name:    fromproj
    operation:  mv   # mv or cp
    type:       [fastq.F,fastq.R]

Copying from sample to project:

SampleToProject:
    module:     ProjectToSample
    base:       STAR_map_proj
    script_path:
    direction:  smp2proj
    operation:  mv   # mv or cp
    type:       SJ.out.tab

Copying and moving from sample to project: (Just for the example. Isn’t necessarily practical)

SampleToProject:
    module:     ProjectToSample
    base:       STAR_map_proj
    script_path:
    direction:  smp2proj
    operation:  [cp, mv, mv]   # mv or cp
    type:       [SJ.out.tab, fastq.F, fastq.R]

Generic Modules 

The generic modules, called Generic and Fillout_Generic, do not contain a definition of input and output file types, therefore the user has to specify the input and output file types in the parameter file.

Generic is simpler to use for defining most Linux programs, and has extra file type management capacities.
Fillout_Generic can incorporate more than one command per step, as well as cater to irregular program calls, such as calls including complex pipes; however, using it is slightly more complicated.

Modules included in this section

Generic
Fillout_Generic

`Generic`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Short Description

A generic module that enables the user to design new modules that can handle most cases.

Requires

In this module the users define the required file types in the inputs section

Output

In this module the users define the output file types in the outputs section
The scope of the output file types is determinant by the module scope

Parameters that can be set

Parameter	Values	Comments
scope	sample/project	The scope of this module could be sample/project, the default is by sample
shell	csh/bash	Type of shell [csh OR bash]. bash is the default, only bash can be used in conda environment
inputs_last		The inputs arguments will be at the end of the command

Comments

The order of the input/output arguments in the final command will be according to the order of their appearance in the parameter file. The redirect arguments are always first.

Example of usage and implementation of the generic module:

Attention

A generic module is used to generate a BLAST database for each sample and a subsequent generic step queries each database with sequences from an external FASTA file. This example is a typical use of BLAST in many biological scenarios such as searching for virulence/resistance genes (whose sequences are in the external FASTA file) in bacterial genomes

A. Calling a generic module to generate a BLAST database (using makeblastdb) from each sample. This step can be used after (base:) any step that creates a nucleotide FASTA file (File_Type: fasta.nucl), e.g. after merge (if the raw files are in nucleotide FASTA format) or after a de novo assembly step. The location of the BLAST database for each sample is saved as a blast_db file type (File_Type: blast_db) for downstream use. B. Calling a generic module which performs a BLAST search (tblastn) of an external query protein fasta file (-query : path to query protein fasta file) against the previously generated BLAST data base per sample. This step can be used after the Make_BLAST_DB step (base: Make_BLAST_DB). The user can pass additional parameters directly to the used program in the redirects section (e.g. –dbtype, –evalue, -num_descriptions etc.).

Lines for parameter file

Step_Name:                      # Name of this step
    module: Generic             # Name of module
    base:                       # Name of the step [or list of names] to run after [must be after steps that generates the inputs File_Types] 
    script_path:                # Main command for this module
    scope:                      # The scope of this module could be sample/project, the default is by sample
    shell:                      # Type of shell [csh OR bash]. bash is the default. only bash can be used in conda environment  
    arg_separator:              # The separator between the arguments and values [The default is space].
    inputs_last:                # The inputs arguments will be at the end of the command. [The default is inputs arguments at the beginning of the command]
    command_order:              # The order of the command parts as string default 'redirects,inputs,outputs' ignored if inputs_last is set.
    use_base_dir:               # Use the base step directory as the output for this step, it is possible to specify the base to use.
    cd:                         # Change current working directory to the output location.
    no_sample_dir:              # In Sample Scope: will NOT create a dedicated folder for each sample and the location of the base folder will be stored
                                # in a project level 'base_dir' File_Type
    remove_subsamples:          # Will remove subsamples created by previous steps (split_fasta for example)
    subsamples_string:          # A string to identify a subsample, all subsample will start with this string. [default: 'subsample']
    inputs:                     # The inputs for this module
        STR:                    # Input argument, e.g. -i, --input [could be also 'empty1', 'empty2'.. for no input argument string]
            scope:              # The scope of this input argument could be sample/project
                                # If the module scope is project and the argument scope is sample:
                                # all the samples inputs File_Types of this argument will be listed as: [input argument] [File_Type(sample#)] e.g. -i sample1.bam -i sample2.bam ... 
            File_Type:          # The input File_Type could be any File_Type available from previous (in this branch) steps
                                # It is possible to indicate more then one File_Type separated by comma 'fastq.F,fastq.R'
            base:               # From which previous step to take the input File_Type. The default is the current step.
            sep:                # If the module scope is project and the argument scope is sample:  
                                #       All the samples inputs File_Types of this argument will be listed delimited by sep. e.g. [sep=,] -i sample1.bam,sample2.bam ... 
                                # If more then one File_Type was specify the inputs File_Types of this argument will be listed delimited by sep.
            prefix:             # A prefix for this input argument file name
            suffix:             # A suffix for this input argument file name
            use_dirname:        # Use only the input Directory and add suffix for constant file name and prefix to add a string before the input Directory
            del:                # Delete the files in the input File_Type after the step ends [use to save space for large files you don't need downstream]
                                # Will generate empty file with the same name and a suffix of _DELETED
            constant_value:     # use a constant value instead of "File_Type".
                                # it is the same as the "redirects".
                                # use when the order of inputs are important!!
                                # use '{{sample_name}}' to be replace with the sample name (or project name in project scope)
                                # using the constant_value option will override all other input arguments!!!!!!
    outputs:                    # The outputs for this module
        STR:                    # Output argument, e.g. -o, --out , the scope of the output arguments is determinant by the module scope
                                # could be also 'empty1', 'empty2'.. for no output argument string OR 'No_run1', 'No_run2'.. for only entering the file information to output File_Type
            File_Type:          # The output File_Type could be any File_Type name for the current branch downstream work 
                                # If the File_Type exists its content will be override for the current branch downstream work 
            prefix:             # A prefix for this output argument file name
            suffix:             # A suffix for this output argument file name
                                # between prefix and suffix will be the sample name [in sample scope] or the project title [in project scope] 
            constant_file_name: # Use constant file name for this output argument [ignore prefix and suffix]
                                # If empty [''] will enter the output directory location
            use_base_name:      # use only the base name of the output file [ignored if constant_file_name is used]
    copy_File_Types:            # Transferring information between File_Types
        STR:                    # Unique name for the transfer
            source:
                File_Type:      # Copy the content of source File_Type to the target File_Type [copy from here]
                scope:          # Copy the source File_Type From this scope [if not specified the default is sample]
                base:           # The source step to copy the File_Type from (from previous steps). The default it the current step.
                constant_value: # Use to transfer information outside of the 'File_Type' system to a File Type, will always be considered as project scope
                                # Using the constant_value option will override all other source arguments!!!!!!
            target:
                File_Type:      # Copy the content of source File_Type to the target File_Type [copy to here]
                scope:          # Copy to the target File_Type in this scope [if not specified the default is sample]
    collect_results:            # Will copy (symbolic link) selected files to a Results folder
        sample:                 # 
            File_Type:          # list of sample scope File_Type separated by comma to copy ['fastq.F,fastq.R']
        project:                # 
            File_Type:          # list of project scope File_Type separated by comma to copy ['fastq.F,fastq.R']
    qsub_params:                # Parameters for qsub [number of cpus or memory to reserve etc ]
        STR: 
    redirects:                  # Parameters to pass directly to the command
        STR: 

`Fillout_Generic`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Description

This module enables executing any type of bash command, including pipes and multiple steps. File and directory names are embedded in the script by describing the file or directory in a {{}} block, as follows:

1. File names:

Include 4 colon-separated fields: (a) scope, (b) slot, (c) separator and (d) base. For example: {{sample:fastq.F:,:merge1}} is replaced with sample fastq.F files from merge1 instance, seperated by commas (only for project scope scripts, of course). Leave fields empty if you do not want to pass a value, e.g. {{sample:fastq.F}} is replaced with the sample fastq.F file. If the partition option is used: (a) can be set to part.

2. Sample and project names:

You can include the sample or project names in the script by leaving out the file type field. e.g. {{sample}} will be replaced by the sample name.

To get a list of sample names, set the separator field to the separator of your choice, e.g. {{sample::,}} will be replaced with a comma-separated list of sample names.

If the partition option is used: (a) can be set to part and if (b) is not set it will use the part number for numeric partition or line in the file used in the partition option. In this case (d) can be set to “clean” , “basename” or “clean,basename” to use only the basename of part. If clean is used will replace spaces with underscore and strip whitespaces

3. Directories

You can include two directories in your command:

Dir descriptor	Result
`{{base_dir}}`	Returns the base directory for the step.
`{{dir}}`	Returns the active directory of the script. For project-scope scripts, this is identical to `base_dir`. For sample scope scripts, this will be a direcotry within `base_dir` for sample related files.

Tip

You can obtain the base_dir or dir values for a base step, by including the name of the base in the 4th colon separated position, just as you’d do for the file slots. e.g. {{base_dir:::merge1}} will return the base_dir for step merge1 and {{dir:::merge1}} will return the dir for the current sample for step merge1.

3. Outputs

Will be replaced with the filename specified in the named output. e.g. {{o:fasta.nucl}} will be replced according to the specifications in the output block named fasta.nucl.

Each output block must contain 2 fields: scope and string. The string contains a string describing the file to be stored in the equivalent slot. In the example above, there must be a block called fasta.nucl in the output block which can be defined as shown in the example in section Lines for parameter file below.

3. Examples

The following examples cover most of the options:

File descriptor	Result
`{{project:fasta.nucl}}`	The `fasta.nucl` slot of the project
`{{sample:fastq.F}}`	The `fastq.F` slot of the sample
`{{sample:fastq.F:,}}`	A comma-separated list of the `fastq.F` slots of all samples
`{{project}}`	The project name
`{{sample}}`	The sample name
`{{sample::,}}`	A comma-separated list of sample names
`{{sample:fastq.F:,:base}}`	A comma-separated list of the `fastq.F` files of all samples, taken from the sample data of step `base`.
`{{part:::basename,clean}}`	If the partition option is used, returns the part number for numeric partition or line in the file used in the partition option. If basename is added it will use only the basename of part. If clean is used will replace spaces with underscore and strip whitespaces
`{{part:fastq.F}}`	If the partition option is used, returns the file type fastq.F in the current part

Tip

For a colon separate list of sample names or files, use the word ‘colon’ in the separator slot.

Note

The separator field is ignored for project-scope slots.

Attention

If a sample-scope slot is used, in the inputs or the outputs, the scripts will be sample-scope scripts. Otherwise, one project-scope script will be produced. To override this behaviour, set scope to project. However, you cannot set scope to project if there are sample-scope fields defined.

Requires:

Customizable

Output:

Customizable

Parameters that can be set

Parameter	Values	Comments
output		A block including ‘scope’ and ‘string’ definining the script outputs
scope	`sample\|project`	The scope of the resulting scripts. You cannot set scope to project if there are sample-scope fields defined.
partition	`number\|file-path`	Can split the analysis to number of parts or by lines in a file.
del_partition		If in Previous steps the `partition` option was used, it will delete all output information from that partition.

Lines for parameter file

Demonstration of embedding various files and titles in a script file:

pipe_gen_3:
    module:             Fillout_Generic
    base:               pipe_gen_2
    script_path: |
        project:                    {{project}}
        fasta.nucl in project:         {{project:fasta.nucl}}
        fasta.nucl in project from base merge1:   {{project:fasta.nucl::merge1}}

        sample names:             {{sample::,}}
        fastq.F in sample:     {{sample:fastq.F}}
        fastq.F in sample from base merge1:     {{sample:fastq.F::merge1}}

        output:fasta.nucl:    {{o:fasta.nucl}}
    output:
        fasta.nucl:
            scope:      project
            string:       "{{base_dir}}{{project}}_new_pipegen3.fasta"