NeatSeq_Flow Module Repository Logo
1.5.0

Workflows

  • NeatSeq-Flow Tutorial Workflow
  • RNA-Seq analysis for non-model organisms
  • RNA-Seq using a reference genome
  • Transcriptome assembly and annotation with Trinity
  • Microbiome analysis using QIIME
  • Microbiome analysis using QIIME2
  • Variant analysis using GATK
  • Microbe-Flow: a comprehensive workflow for bacterial genomics, pathogenomics and genomic epidemiology
  • ChIP-seq workflow
  • Shotgun Metagenomics

Modules

  • NeatSeq-Flow modules

Tutorials

  • Using The Generic Module
NeatSeq_Flow Module Repository
  • Sequence-Searching Related Tasks
  • View page source

Sequence-Searching Related Tasks

Modules included in this section

  • makeblastdb *

  • blast

  • parse_blast

  • Gassst

  • hmmscan

  • mash_sketch

  • mash_dist

makeblastdb *

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

Create a blastdb from a fasta file

Requires

  • fastq files in the following slots:

    • sample_data[<sample>]["fasta.nucl"|"fasta.prot"]

  • Or (if ‘projectBLAST’ is set)
    • sample_data["fasta.nucl"|"fasta.prot"]

Output

  • A BLAST database in the following slots:

    • sample_data[<sample>]["blastdb"]

    • sample_data[<sample>]["blastdb.nucl"|"blastdb.prot"]

    • sample_data[<sample>]["blastdb.nucl.log"|"blastdb.prot.log"]

  • Or (if ‘projectBLAST’ is set):

    • sample_data["blastdb"]

    • sample_data["blastdb.nucl"|"blastdb.prot"]

    • sample_data["blastdb.nucl.log"|"blastdb.prot.log"]

Parameters that can be set:

Parameter

Values

Comments

scope

sample|project

Set if project-wide or sample fasta slot should be used

-dbtype

nucl/prot

This is a compulsory redirected parameter.Helps the module decide which fasta file to use.

Lines for parameter file

mkblst1:
    module: makeblastdb
    base: trinity1
    script_path: /path/to/bin/makeblastdb
    redirects:
        -dbtype: nucl
    scope: project

References

Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17), pp.3389-3402.

blast

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for executing BLAST of any type on a nucleotide or protein fasta file. The search can be either on a sample fasta or on a project-wide fasta. It can use the fasta as a database or as a query. If used as a database, you must call the makeblastdb module prior to this step.

both query and db parameters must be passed. They should be set to one of the following values:

Value

Description

sample

The query or db should be taken from the sample scope

project

The query or db should be taken from the project scope

A path

A path to a fasta file or makeblastdb database to use as-is

The type of fasta and database to use are set with the querytype and dbtype parameters, respectively.

  • dbtype must be set if db is set to sample or project.

  • querytype must be set regardless. It will determine the type of blast report (i.e. whether it will be stored in blast.nucl or blast.prot)

Requires:

  • fasta files in one of the following slots for sample-wise blast:

    • sample_data[<sample>]["fasta.nucl"]

    • sample_data[<sample>]["fasta.prot"]

  • or fasta files in one of the following slots for project-wise blast:

    • sample_data["fasta.nucl"]

    • sample_data["fasta.prot"]

  • or a makeblastdb index in one of the following slots:

    • When -db is set to ‘project’

      • sample_data["blastdb.nucl"|"blastdb.prot"]

    • When -db is set to ‘sample’

      • sample_data[<sample>]["blastdb.nucl"|"blastdb.prot"]

File type

Scope

Comments

fasta.nucl

sample/project

If query is sample or project and querytype is nucl

fasta.prot

sample/project

If query is sample or project and querytype is prot

blastdb.nucl

sample/project

If db is sample or project and dbtype is nucl

blastdb.prot

sample/project

If db is sample or project and dbtype is prot

Output:

  • puts BLAST output files in the following slots for sample-wise blast:

    • sample_data[<sample>]["blast.nucl"|"blast.prot"]

    • sample_data[<sample>]["blast"]

  • puts fasta output files in the following slots for project-wise blast:

    • sample_data["blast.nucl"|"blast.prot"]

    • sample_data["blast"]

File type

Scope

Comments

blast.nucl

sample/project

Blast report if querytype is nucl

blast.prot

sample/project

Blast report if querytype is prot

blast

sample/project

Blast report, regardless of querytype

Parameters that can be set

Parameter

Values

Comments

dbtype

nucl|prot

Helps the module decide which blastdb to use.

querytype

nucl|prot

Helps the module decide which fasta file to use.

query

sample|project|<Path to fasta or BLAST index>

Set to sample for sample-scope query, to project for project-scope query, or to a path for an external query file.

db

sample|project|<Path to BLAST index>

Set to sample for sample-scope index, to project for project-scope index, or to a path for an external index.

Note

You can’t set both db and query to external files. One of them at least has to be sample or project.

Lines for parameter file

External query, project-wise nucl-type database (must be proceeded by makeblastdb module):

tbl_blst_int:
    module:             blast
    base:               mkblst1
    script_path:        {Vars.Programs.blast.Bin}/blastn
    query:              /path/to/query.fasta
    db:                 project
    dbtype:             nucl
    redirects:
        -evalue:        0.0001
        -max_target_seqs: 5
        -num_of_proc:   20
        -num_threads:   20

Sample specific prot-type fasta, external database:

tbl_blst_ext:
    module:             blast
    base:               prokka1
    script_path:        {Vars.Programs.blast.Bin}/blastp
    query: sample
    querytype:          prot
    db:                 {Vars.Genome.blast_index}
    redirects:
        -evalue: 0.0001

References

Altschul, S.F., Madden, T.L., Schäffer, A.A., Zhang, J., Zhang, Z., Miller, W. and Lipman, D.J., 1997. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic acids research, 25(17), pp.3389-3402.

parse_blast

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running parse_blast.R:

The parse_blast.R script is available on github.

The program performs the following tasks:

  1. It adds annotation to raw tabular BLAST output files,

  2. filters the BLAST results by several possible fields,

  3. selects the best hit for a group when passed a grouping field and

  4. extracts the sequences equivalent to the alignments.

Requires

  • Tabular BLAST result files in the following slots:

    • sample_data[<sample>]["blast.nucl|blast.prot"] (if scope set to sample)

    • sample_data["project_data"]["blast.nucl|blast.prot"] (if scope set to project)

File type

Scope

Comments

blast.nucl|blast.prot

sample/project

A blast report for a nucl or prot query

Attention

If both blast.nucl and blast.prot exist, determine which to use by setting fasta2use. See parameter table below.

Output

  • Puts the parsed report in:

    • sample_data[<sample>]["blast.parsed"] if scope = sample

    • sample_data["project_data"]["blast.parsed"] if scope = project

File type

Scope

Comments

blast.parsed

sample/project

Results of parsed blast report

Parameters that can be set

Parameter

Values

Comments

fasta2use

nucl|prot

If both nucl and prot BLAST reports exist, you have to specify which one to use with this parameter.

blast_merge

Block with path set to path of compare_blast_parsed_reports.R and redirects set to compare_blast_parsed_reports.R parameters.

extract_fasta

Should the script extract a fasta of the hits?

Note

path in blast_merge block can be left empty. The script will be taken from the same location as the main parse_blast.R script. redirects in blast_merge block can be either in string format or the regular block format.

Lines for parameter file

parse_blast_table:
    module: parse_blast
    base: blst_table
    script_path: {Vars.paths.parse_blast}
    scope: sample
    redirects:
        --columns2keep: '"group name accession qseqid sallseqid evalue bitscore score pident coverage align_len"'
        --dbtable: {Vars.databases.gene_list.table}
        --group_dif_name: # See parse_blast.R documentation for how this is to be specified
        --max_evalue: 1e-7
        --merge_blast: qseqid
        --merge_metadata: # See parse_blast.R documentation for how this is to be specified
        --min_align_len: 30
        --min_coverage: 60
        --names: '"qseqid sallseqid qlen slen qstart qend sstart send length evalue bitscore score pident qframe"'
        --num_hits: 1
    extract_fasta:
    blast_merge:
        path: '{Vars.paths.compare_blast_parsed_reports}'
        redirects:
            --variable:     evalue
            --full_txt_output:

Gassst

Authors

Liron Levin

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

Short Description

A module for executing Gassst on a nucleotide fasta file. The search can be either on a sample fasta or on a project-wide fasta. It can use the fasta as a database or as a query.

Requires

  • fasta files in the following slot for sample-wise Gassst:
    • sample_data[<sample>]["fasta.nucl"]

  • or fasta files in the following slots for project-wise Gassst:
    • sample_data["fasta.nucl"]

Output

  • puts Gassst output files in the following slots for sample-wise Gassst:
    • sample_data[<sample>]["blast"]

    • sample_data[<sample>]["blast.nucl"]

  • puts fasta output files in the following slots for project-wise Gassst:
    • sample_data["blast"]

    • sample_data["blast.nucl"]

Parameters that can be set

Parameter

Values

Comments

scope

project/sample

Set if project-wide fasta.nucl file type should be used [project] the default is sample-wide fasta.nucl file type

Comments

  • This module was tested on:

    Gassst v1.28

  • The following python packages are required:

    pandas

  • Only -d [database] or -i [query] not both

  • The Gassst module will generate blast like output with fields:

    `"qseqid sallseqid qlen slen qstart qend sstart send length evalue sseq"

Lines for parameter file

Step_Name:                         # Name of this step
    module: Gassst                 # Name of the module to use
    base:                          # Name of the step [or list of names] to run after [mast be after a fasta generating step]
    script_path:                   # Command for running the Gassst script
                                   # The Gassst module will generate blast like output with fields:
                                   # "qseqid sallseqid qlen slen qstart qend sstart send length evalue sseq"
    scope:                         # Set if project-wide fasta.nucl file type should be used [project] the default is sample-wide fasta.nucl file type
    qsub_params:
        -pe:                       # Number of CPUs to reserve for this analysis
    redirects:
        -h:                        # Max hits per query, for downstream best hit will be chosen!
        -i:                        # Only -d [database] or -i [query] not both
        -l:                        # Complexity_filter off
        -d:                        # Only -d [database] or -i [query] not both
        -n:                        # Number of CPUs running Gassst
        -p:                        # Minimum percentage of identity. Must be in the interval [0 100]

References

Rizk, Guillaume, and Dominique Lavenier. “GASSST: global alignment short sequence search tool.” Bioinformatics 26.20 (2010): 2534-2540.‏

hmmscan

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for searching a fasta file with hmmscan.

Requires

  • If scope = sample, fasta files in one of the following slots:

    • sample_data[<sample>]["fasta.nucl"]

    • sample_data[<sample>]["fasta.prot"]

  • If scope = project, fasta files in one of the following slots:

    • sample_data["fasta.nucl"]

    • sample_data["fasta.prot"]

Output:

  • puts hmmscan output files in the following slots:

    • for scope = sample (depending on type passed):

      • sample_data[<sample>]["hmmscan.nucl"]

      • sample_data[<sample>]["hmmscan.prot"]

    • for scope = project (depending on type passed):

      • sample_data["hmmscan.nucl"]

      • sample_data["hmmscan.prot"]

Parameters that can be set

Parameter

Values

Comments

scope

sample|project

Create one assembly for all samples or one assembly per sample.

type

Use a prot or nucl fasta file for the search.

output_type

tblout|domtblout|pfamtblout

tblout: parseable table of per-sequence hits to file, domtblout: parseable table of per-domain hits to file, pfamtblout: table of hits and domains in Pfam format

hmmdb

A path to the hmmdb to search against.

Lines for parameter file

trino_hmmscan1_highExpr:
    module:             hmmscan
    base:               trino_Transdecode_highExpr
    script_path:        {Vars.paths.hmmscan}
    scope:              sample
    type:               prot
    output_type:        domtblout 
    hmmdb:              {Vars.databases.trinotate.pfam}
    qsub_params:
        -pe:            shared 10
    redirects:
        --cpu:          1

References

Finn, Robert D., Jody Clements, and Sean R. Eddy. “HMMER web server: interactive sequence similarity searching.” Nucleic acids research 39.suppl_2 (2011): W29-W37.

mash_sketch

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

Build mash sketches from sequence files.

Works in three modes:

  • scope=sample

    Builds a separate sketch for each sample

  • scope=project and src_scope=sample

    Builds a project wide sketch from sample sequence files. This can be used with mash_dist module to perform all-against-all comparisons.

  • scope=project

    Builds a sketch from project sequence files.

Requires:

  • fasta files in one of the following slots:

    • sample_data[<sample>]["fasta.nucl"]

  • or fastq files in the following slots:

    • sample_data[<sample>]["fastq.F"]

    • sample_data[<sample>]["fastq.R"]

    • sample_data[<sample>]["fastq.S"]

  • For scope = project, uses project-wide files.

Output:

  • puts ‘msh’ output files in the following slots for (scope=sample):

    • sample_data[<sample>]["msh.fasta"]

    • sample_data[<sample>]["msh.fastq"]

  • puts ‘msh’ output files in the following slots for (scope=project):

    • sample_data["project_data"]["msh.fasta"]

    • sample_data["project_data"]["msh.fastq"]

Parameters that can be set

Parameter

Values

Comments

scope

project|sample

The scope for which to build the sketch.

src_scope

project|sample

The scope from which to take the sequence files. Default - same as scope

type

nucl|prot

Use fastq or fasta files. By default, uses any that exist.

Lines for parameter file

  1. Create sketch for each sample based on fastq files

sketch_smp:
    module:         mash_sketch
    base:           trim_gal
    script_path:    "{Vars.paths.mash} sketch"
    scope:          sample
    type:           fastq
    rm_merged:
    qsub_params:
        -pe:        shared 10
    redirects:
        -m:         2
        -p:         10
  1. Create project sketch for all samples’ fastq files

sketch_proj:
    module:         mash_sketch
    base:           merge1
    script_path:    "{Vars.paths.mash} sketch"
    src_scope:      sample
    scope:          project
    type:           fastq
    rm_merged:
    redirects:
        -m:         2
        -p:         10

References

Ondov, Brian D., et al. Mash: fast genome and metagenome distance estimation using MinHash Genome biology, 17.1 (2016): 132.

mash_dist

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

Requires:

  • fasta files in one of the following slots:

    • sample_data[<sample>]["fasta.nucl"]

    • sample_data["fasta.nucl"]

  • OR fastq files in one of the following slots (merge fastq files first with mash_sketch or otherwise):

    • sample_data[<sample>]["fastq"]

    • sample_data["fastq"]

  • OR sketch files in one of the following slots:

    • sample_data[<sample>]["msh.fastq"]

    • sample_data[<sample>]["msh.fasta"]

    • sample_data["msh.fastq"]

    • sample_data["msh.fasta"]

Output:

  • puts ‘msh’ output files in the following slots for (scope=sample):

    • sample_data[<sample>]["msh.fasta"]

    • sample_data[<sample>]["msh.fastq"]

  • puts ‘msh’ output files in the following slots for (scope=project and scope=all_samples):

    • sample_data[<sample>]["mash.dist.table"]

    • sample_data["mash.dist.table"]

Parameters that can be set

Parameter

Values

Comments

reference

A block including ‘path’ or ‘scope’, ‘type’ and optionally ‘msh’

query

A block including ‘scope’ (sample, project or all_samples), ‘type’ and optionally ‘msh’

Lines for parameter file

  1. External reference. Sample-wise fastq files.

    Returns table of mash dist of sample against external reference. One table per sample

dist:
    module:         mash_dist
    base:           [sketch_proj,sketch_smp]
    script_path:    "{Vars.paths.mash} dist"
    reference:
        path:   /path/to/ref1
    query:
        scope:          sample
        type:           fastq
        msh:
  1. Project mashed fasta reference. Sample mashed fastq query

    Returns table of mash dist of sample against project reference. One table per sample

dist:
    module:         mash_dist
    base:           [sketch_proj,sketch_smp]
    script_path:    "{Vars.paths.mash} dist"
    reference:
        scope:      project
        type:       fasta
        msh:
    query:
        scope:      sample
        type:       fastq
        msh:
  1. Project mashed reference. Project mashed fastq query

    Returns table of mash dist of project sketch against project sketch. One table for the whole project.

    If the project sketch is built from sample sketches, as is created by mash_sketch using scope=project and src_scope=sample, the result will be an all-agianst-all mash dist table.

dist:
    module:         mash_dist
    base:           [sketch_proj,sketch_smp]
    script_path:    "{Vars.paths.mash} dist"
    reference:
        scope:      project
        type:       fastq
        msh:
    query:
        scope:      project
        type:       fastq
        msh:
  1. Project mashed fastq reference. Sample mashed fastq query

    Returns table of mash dist of project sketch against teach sample sketch. One table per sample.

dist: 
    module:         mash_dist
    base:           [sketch_proj,sketch_smp]
    script_path:    "{Vars.paths.mash} dist"
    reference:
        scope:      project
        type:       fastq
        msh:
    query:
        scope:      sample
        type:       fastq
        msh:

References

Ondov, Brian D., et al. Mash: fast genome and metagenome distance estimation using MinHash Genome biology, 17.1 (2016): 132.


© Copyright 2017, Menachem Sklarz.

Built with Sphinx using a theme provided by Read the Docs.