Building a workflow

Author: Menachem Sklarz

Affiliation: Bioinformatics Core Facility, National institute of Biotechnology in the Negev, Ben-Gurion University.

A typical usage of NeatSeq-Flow involves the following steps:

  1. Parameter file definition
  2. Sample file definition
  3. Executing NeatSeq-Flow
  4. Executing the workflow

Parameter file definition

Tip

The parameter file is rarely created from scratch. Take an existing parameter file defined for the analysis you require and modify it to suit your SGE cluster and specific requirements.

The parameter file is a YAML file which must include two (unequal) sections:

Global parameters

Several SGE and other parameters can be set globally so that all scripts use them for execution. Overriding the defaults on a step-wise basis is possible in the step-wise section of the parameter file.

All global parameters are set within a Global_params block in YAML format.

Qsub_q
Defines the default queue to send the jobs to (this is the value passed to qsub with the -q parameter).
Qsub_nodes

Limits the nodes to which to send the jobs. Must be nodes that are available to the queue requested in Qsub_q. The nodes should be passed in a YAML list format. e.g.

Qsub_nodes:
    - node1
    - node2
Qsub_opts
Other SGE parameters to be set as default for all scripts, e.g. -V -cwd etc. The parameters should be passed in one long string and not as a list.

Attention

It is highly recommended to pass the -notify argument to qsub in this string. If it is passed, all modules producing bash-based scripts will report early termination of scripts with qdel in the log file. If -notify is not passed, jobs killed with qdel will have a line in the log file reporting the job start time but there will be no indication that the job was terminated (besides it not having a line indicating finish time)

Qsub_path
The path to the qstat command. If not set, qstat will be used as-is with no path. Sometimes in remote nodes the qstat command is not in the path and if Qsub_path is not set, the step start and stop logging will fail.

Tip

The correct value for the Qsub_path parameter can be determined by executing the following command:

dirname `which qsub`
Default_wait
The time, in seconds, to wait for jobs to enter the queue before terminating the step-level script. Must be an integer. The default is 10, which is usually a good value to start with. If downstream jobs seem to be sent for execution before earlier jobs have terminated, increase this value.
module_path
Enables including modules not in the main NeatSeq-Flow package. This includes the modules downloaded from the NeatSeq-Flow Modules and workflows repository as well as modules you added yourself (see section For the Programmer - Adding Modules). Keep your modules in a separate path and pass the path to NeatSeq-Flow with module_path. Several of these can be passed in YAML list format for more than one external module path. The list will be searched in order, with the main NeatSeq-Flow package being searched last.

Attention

When executing NeatSeq-Flow within a conda environment, NeatSeq-Flow will add the path to the modules repository automatically (See Install and execute with Conda). You don’t have to worry about setting it in the parameter file unless you have your own modules installed in a different location.

job_limit

If there is an upper limit on the jobs you can send to the job manager, you can use the job_limit parameter to pass NeatSeq-Flow a file with one line, e.g.:

limit=1000 sleep=60

This will make the scripts check every 60 seconds if there are less than 1000 jobs registered for the user. New jobs will be released only when there are less than the specified limit.

conda

If you want to use a conda environment to execute the scripts, pass this parameter with the following two sub-parameters:

path
The path to the environment you want to use. If left empty, and a ``conda`` environment is active, NeatSeq-Flow will use the path to the active environment.
env
The name of the environment to use. If absent or left empty, NeatSeq-Flow will extract the name from the path parameter.

Following is an example of a global-parameters block:

Global_params:
    Default_wait: 10
    Qsub_path: /path/to/qstat
    Qsub_q: queue.q
    Qsub_nodes: [node1,node2,node3]
    Qsub_opts:  -V -cwd -notify
    module_path:
        - /path/to/modules1/
        - /path/to/modules2/

Step-wise parameters

Step-wise parameters define parameters which are specific to the various steps included in the workflow.

All step-wise parameters are set within a Step_params block in YAML format.

A parameter block for a step (a module instance) should look as follows:

Step_params:
    trim1:
        module: trimmo
        base: merge1
        script_path: java -jar trimmomatic-0.32.jar
        qsub_params:
            -pe: shared 20
            node: node1
        todo: LEADING:20 TRAILING:20
        redirects:
            -threads: 20

trim1 is the step name. This should be a single-word, informative name (alphanumeric and underscore are permitted) which will be included in the script names and output directory names.

Following the step name, with indentation, are the step parameters as defined below.

Step parameters can be divided into the following groups:

  1. Required parameters for each step
  2. Additional parameters
  3. Redirected parameters

Required parameters for each step

module
The name of the module of which this step is an instance.
base
The name of the step on which the current step is based (not required for the merge step, which is always first and single). base can be a YAML formatted list of base steps.
script_path
The full path to the script executed by this step.

Note

  1. If the program executed by the module is on the search PATH of all the nodes in the queue, you can just pass the program name without the full path. This is not usually recommended.
  2. If the program requires a specific version of python or Rscript, you can append those before the actual path, e.g. /path/to/python /path/to/executable
  3. Sometimes, modules can require a path to a directory rather than to an executable. See, e.g., module UCSC_BW_wig.

Additional parameters

Other parameters you can set for each step to control the execution of the step scripts:

setenv
Set various environment variables for the duration of script execution. This is useful when the software executed by the script requires setting specific environment variables which you do not want to set globally on all nodes.
qsub_params

Set cluster-related parameters which will be effective for the current step only:

node
A node or YAML list of nodes on which to run the step scripts (overrides global parameter Qsub_nodes)
queue or -q
Will limit the execution of the step’s scripts to this queue (overrides global parameter Qsub_q)
-pe
Will set the -pe parameter for all scripts for this module (see SGE qsub manual).
-XXX: YYY
Set the value of qsub parameter -XXX to YYY. This is a way to define other SGE parameters for all step scripts.
scope
Defines whether to use sample-wise files or project-wise files. Check per-module documentation for whether and how this parameter is defined (see, e.g., the blast module).
sample_list
This is an experimental feature. A comma-separated list of samples on which to execute the module. Scripts will be created only for the samples in the list. This selection will be valid for all instances based on this instance, untill the value all_samples is passed. Use this option with care since the samples not in the list will not own the step outputs, which may well be required downstream.

Tip

A use case could be when you want to run a step with different parameters for different sample subsets. Both versions of the instance should inherit from a common base and the downstream step can inherit both versions, thus all samples will have all files, created with different parameters.

conda
Is used to define step specific conda parameters. The syntax is the same as for the global conda definition (see here). If set, the path and env will be used to execute the scripts of this step only. If a global conda exists, the local definition will override the global definition.

Attention

If you have set global conda parameters, and want a step to execute not within a conda environment, pass an empty conda field.

local
A local folder which exists in all cluster nodes. Uses a local directory for intermediate files before copying results to final destination in data dir. This is useful when the cluster manager requires you to limit your IO to the central disk system.

Redirected parameters

Parameters to be redirected to the actual program executed by the step.

Redirected parameters are specified within a redirects: block (see example in Step-wise parameters above).

Note

the parameter name must include the - or -- required by the program defined in script_path.

Comments

  1. The local directory passed to local must exist on all nodes in the queue.
  2. For a list of qsub parameters, see the qsub man page
  3. The list of nodes passed to node within the qsub_params block will be appended to the queue name (global or step specific). Don’t add the queue name to the node names.

Sample file definition

Attention

As of version 1.3.0, NeatSeq-Flow no longer supports the List-format used in previous versions!

Attention

It is recommended to provide full paths to the files listed in the sample file. However, if relative paths are provided, NeatSeq-Flow will attempt to expand them to full paths, using the current directory as the base directory.

Important

When passing URLs as sample locations (see documentation for merge module), it is compulsory to append the protocol, or scheme, at the beginning of the URL.

  • Good: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR453/SRR453032/SRR453032_1.fastq.gz
  • Bad: ftp.sra.ebi.ac.uk/vol1/fastq/SRR453/SRR453032/SRR453032_1.fastq.gz

The sample file has, at the moment, 4 sections:

Project title

The project title is supplied in a line with the following structure:

Title       name_of_analysis

Attention

The word Title and the title name must be separated by a single TAB character. This is the rule for all sections of the sample file.

Caution

If more that one title line is included, one of them will be selected and a warning will be generated.

Sample files

The samples themselves are coded in a TAB-separated table with a header, as follows:

#SampleID   Type    Path

The table must be in consecutive lines following the header line.

  • The first field is the sample name (no spaces!),
  • the 2nd field is the file type and
  • the third field is the file path.

Additional columns are ignored.

You may comment out lines in the table by prepending a # character.

An example of a sample table follows:

#SampleID   Type    Path
Sample1     Forward /full/path/to/Sample1_R1_001.fastq.gz
Sample1     Reverse /full/path/to/Sample1_R2_001.fastq.gz
Sample2     Forward /full/path/to/Sample2_R1_001.fastq.gz
Sample2     Reverse /full/path/to/Sample2_R2_001.fastq.gz

The following file types are recognized by NeatSeq-Flow and will be automatically merged into the correct position in the file index (indicated in the second column):

File types recognized by NeatSeq-Flow
Source Target
Forward fastq.F
Reverse fastq.R
Single fastq.S
Nucleotide fasta.nucl
Protein fasta.prot
SAM sam
BAM bam
REFERENCE reference
VCF vcf
G.VCF g.vcf

Other types can be included, as well. For how to merge them correctly into NeatSeq-Flow, see the documentation for merge module.

Note

  1. Each line represents one file. For samples with multiple files, add lines with the same sample name.
  2. Keep forward and reverse files in pairs. Each forward file should have it’s reverse file in the following line.
  3. Each sample can contain different combinations of file types but the user must be careful when doing unexpected things like that…

Project files

As of NeatSeq-Flow version 1.3.0, you can pass project-wise files, such as reference files, through the sample file. This is done as above for the sample data, in a separate table with the following structure:

#Type       Path

For example, a project file section could look like this:

#Type       Path
Nucleotide  /path/to/reference.fasta
Protein     /path/to/reference.faa
# This is a comment line

The same file types that can be used in the Sample files section, can also be used in the project files section.

Warning

Project files in the sample file is an experimental feature, and should be used with caution. See merge documentation for details on how to import various types of sample files.

Attention

Up to NeatSeq-Flow version 1.2.0, the sample file can only contain sample files. No project files are permitted.

  • If you have project files, create a single sample which will represent your project.
  • If you have mainly sample files, such as fastq files, and some project level files such as reference genomes, pass them to the modules through the parameter file.

ChIP-seq specific definitions

For ChIP-seq experiments, one must define ChIP and Control (‘input’) pairs. This is done in the following manner (in the sample file):

Sample_Control        anti_sample1:input_sample1
Sample_Control        anti_sample2:input_sample2

Just replace anti_sample1 and input_sample1 with the relevant sample names.

Executing NeatSeq-Flow

Executing NeatSeq-Flow is the simplest step in the workflow (make sure python and neatseq_flow.py are in your search path):

python neatseq_flow.py                      \
    -s sample_file.nsfs                     \
    -p param_file1.nsfp,param_file2.nsfp    \
    -m "message"                            \
    -d /path/to/workflow/directory

Comments:

  • NeatSeq-Flow does not require installation. If you have a local copy, append the full path to neatseq_flow.py.
  • You can pass a comma-separated list of parameter files. NeatSeq-Flow concatenates the files in the order they’re passed. Make sure there are no conflicts or duplicated definitions in the files (this occurs mainly for global parameters)
  • Alternatively, you can pass many parameter by specifying more than one -p.
  • It is not compulsory to pass a message via -m but it is highly recommended for documentation and reproducibility.

Executing the workflow

The workflow can be executed fully automatically; on a step-by-step basis or for individual samples separately.

Automatic execution

Execute the following command within the workflow directory:

csh scripts/00.workflow.commands.csh

The scripts/00.workflow.commands.csh script runs all the steps at once, leaving flow control entirely to the cluster job manager.

Step-wise execution

Each line in scripts/00.workflow.commands.csh calls a step-wise script in scripts/, e.g. scripts/01.merge_merge1.csh, which contains a list of qsub commands executing the individual scripts on each sample.

The following command will execute only the merge1 step:

qsub scripts/01.merge_merge1.csh

Sample-wise execution

The individual sample-level scripts are stored in folders within scripts/. e.g all merge1 scripts are stored in scripts/01.merge_merge1/. To execute the step only for a specific sample, execute the relevant script from within the individual script folder.