Building a workflow¶
Author: Menachem Sklarz
Table of Contents
A typical usage of NeatSeq-Flow involves the following steps:
The parameter file is rarely created from scratch. Take an existing parameter file defined for the analysis you require and modify it to suit your SGE cluster and specific requirements.
The parameter file is a YAML file which must include two (unequal) sections:
Several SGE and other parameters can be set globally so that all scripts use them for execution. Overriding the defaults on a step-wise basis is possible in the step-wise section of the parameter file.
All global parameters are set within a
Global_params block in YAML format.
- Defines the default queue to send the jobs to (this is the value passed to
Limits the nodes to which to send the jobs. Must be nodes that are available to the queue requested in
Qsub_q. The nodes should be passed in a YAML list format. e.g.
Qsub_nodes: - node1 - node2
- Other SGE parameters to be set as default for all scripts, e.g.
-V -cwdetc. The parameters should be passed in one long string and not as a list.
It is highly recommended to pass the
-notify argument to qsub in this string. If it is passed, all modules producing bash-based scripts will report early termination of scripts with
qdel in the log file. If
-notify is not passed, jobs killed with
qdel will have a line in the log file reporting the job start time but there will be no indication that the job was terminated (besides it not having a line indicating finish time)
- The path to the
qstatcommand. If not set, qstat will be used as-is with no path. Sometimes in remote nodes the
qstatcommand is not in the path and if Qsub_path is not set, the step start and stop logging will fail.
The correct value for the
Qsub_path parameter can be determined by executing the following command:
dirname `which qsub`
- The time, in seconds, to wait for jobs to enter the queue before terminating the step-level script. Must be an integer. The default is 10, which is usually a good value to start with. If downstream jobs seem to be sent for execution before earlier jobs have terminated, increase this value.
- Enables including modules not in the main NeatSeq-Flow package. This includes the modules downloaded from the NeatSeq-Flow Modules and workflows repository as well as modules you added yourself (see section For the Programmer - Adding Modules). Keep your modules in a separate path and pass the path to NeatSeq-Flow with
module_path. Several of these can be passed in YAML list format for more than one external module path. The list will be searched in order, with the main NeatSeq-Flow package being searched last.
When executing NeatSeq-Flow within a conda environment, NeatSeq-Flow will add the path to the modules repository automatically (See Install and execute with Conda). You don’t have to worry about setting it in the parameter file unless you have your own modules installed in a different location.
If there is an upper limit on the jobs you can send to the job manager, you can use the
job_limitparameter to pass NeatSeq-Flow a file with one line, e.g.:
This will make the scripts check every 60 seconds if there are less than 1000 jobs registered for the user. New jobs will be released only when there are less than the specified limit.
If you want to use a conda environment to execute the scripts, pass this parameter with the following two sub-parameters:
- The path to the environment you want to use. If left empty, and a ``conda`` environment is active, NeatSeq-Flow will use the path to the active environment.
- The name of the environment to use. If absent or left empty, NeatSeq-Flow will extract the name from the
Following is an example of a global-parameters block:
Global_params: Default_wait: 10 Qsub_path: /path/to/qstat Qsub_q: queue.q Qsub_nodes: [node1,node2,node3] Qsub_opts: -V -cwd -notify module_path: - /path/to/modules1/ - /path/to/modules2/
Step-wise parameters define parameters which are specific to the various steps included in the workflow.
All step-wise parameters are set within a
Step_params block in YAML format.
A parameter block for a step (a module instance) should look as follows:
Step_params: trim1: module: trimmo base: merge1 script_path: java -jar trimmomatic-0.32.jar qsub_params: -pe: shared 20 node: node1 todo: LEADING:20 TRAILING:20 redirects: -threads: 20
trim1 is the step name. This should be a single-word, informative name (alphanumeric and underscore are permitted) which will be included in the script names and output directory names.
Following the step name, with indentation, are the step parameters as defined below.
Step parameters can be divided into the following groups:
Required parameters for each step¶
- The name of the module of which this step is an instance.
- The name of the step on which the current step is based (not required for the merge step, which is always first and single).
basecan be a YAML formatted list of base steps.
- The full path to the script executed by this step.
- If the program executed by the module is on the search PATH of all the nodes in the queue, you can just pass the program name without the full path. This is not usually recommended.
- If the program requires a specific version of python or Rscript, you can append those before the actual path, e.g.
- Sometimes, modules can require a path to a directory rather than to an executable. See, e.g., module
Other parameters you can set for each step to control the execution of the step scripts:
- Set various environment variables for the duration of script execution. This is useful when the software executed by the script requires setting specific environment variables which you do not want to set globally on all nodes.
Set cluster-related parameters which will be effective for the current step only:
- A node or YAML list of nodes on which to run the step scripts (overrides global parameter
- Will limit the execution of the step’s scripts to this queue (overrides global parameter
- Will set the
-peparameter for all scripts for this module (see SGE
- Set the value of qsub parameter
YYY. This is a way to define other SGE parameters for all step scripts.
- Defines whether to use sample-wise files or project-wise files. Check per-module documentation for whether and how this parameter is defined (see, e.g., the
- This is an experimental feature. A comma-separated list of samples on which to execute the module. Scripts will be created only for the samples in the list. This selection will be valid for all instances based on this instance, untill the value
all_samplesis passed. Use this option with care since the samples not in the list will not own the step outputs, which may well be required downstream.
A use case could be when you want to run a step with different parameters for different sample subsets. Both versions of the instance should inherit from a common
base and the downstream step can inherit both versions, thus all samples will have all files, created with different parameters.
- Is used to define step specific conda parameters. The syntax is the same as for the global
condadefinition (see here). If set, the
envwill be used to execute the scripts of this step only. If a global
condaexists, the local definition will override the global definition.
If you have set global conda parameters, and want a step to execute not within a conda environment, pass an empty
- A local folder which exists in all cluster nodes. Uses a local directory for intermediate files before copying results to final destination in
datadir. This is useful when the cluster manager requires you to limit your IO to the central disk system.
As of version 1.3.0, NeatSeq-Flow no longer supports the List-format used in previous versions!
It is recommended to provide full paths to the files listed in the sample file. However, if relative paths are provided, NeatSeq-Flow will attempt to expand them to full paths, using the current directory as the base directory.
When passing URLs as sample locations (see documentation for
merge module), it is compulsory to append the protocol, or scheme, at the beginning of the URL.
The sample file has, at the moment, 4 sections:
The project title is supplied in a line with the following structure:
Title and the title name must be separated by a single TAB character. This is the rule for all sections of the sample file.
If more that one title line is included, one of them will be selected and a warning will be generated.
The samples themselves are coded in a TAB-separated table with a header, as follows:
#SampleID Type Path
The table must be in consecutive lines following the header line.
- The first field is the sample name (no spaces!),
- the 2nd field is the file type and
- the third field is the file path.
Additional columns are ignored.
You may comment out lines in the table by prepending a
An example of a sample table follows:
#SampleID Type Path Sample1 Forward /full/path/to/Sample1_R1_001.fastq.gz Sample1 Reverse /full/path/to/Sample1_R2_001.fastq.gz Sample2 Forward /full/path/to/Sample2_R1_001.fastq.gz Sample2 Reverse /full/path/to/Sample2_R2_001.fastq.gz
The following file types are recognized by NeatSeq-Flow and will be automatically merged into the correct position in the file index (indicated in the second column):
Other types can be included, as well. For how to merge them correctly into NeatSeq-Flow, see the documentation for
- Each line represents one file. For samples with multiple files, add lines with the same sample name.
- Keep forward and reverse files in pairs. Each forward file should have it’s reverse file in the following line.
- Each sample can contain different combinations of file types but the user must be careful when doing unexpected things like that…
As of NeatSeq-Flow version 1.3.0, you can pass project-wise files, such as reference files, through the sample file. This is done as above for the sample data, in a separate table with the following structure:
For example, a project file section could look like this:
#Type Path Nucleotide /path/to/reference.fasta Protein /path/to/reference.faa # This is a comment line
The same file types that can be used in the Sample files section, can also be used in the project files section.
Project files in the sample file is an experimental feature, and should be used with caution. See
merge documentation for details on how to import various types of sample files.
Up to NeatSeq-Flow version 1.2.0, the sample file can only contain sample files. No project files are permitted.
- If you have project files, create a single sample which will represent your project.
- If you have mainly sample files, such as fastq files, and some project level files such as reference genomes, pass them to the modules through the parameter file.
For ChIP-seq experiments, one must define ChIP and Control (‘input’) pairs. This is done in the following manner (in the sample file):
Sample_Control anti_sample1:input_sample1 Sample_Control anti_sample2:input_sample2
input_sample1 with the relevant sample names.
Executing NeatSeq-Flow is the simplest step in the workflow (make sure
neatseq_flow.py are in your search path):
python neatseq_flow.py \ -s sample_file.nsfs \ -p param_file1.nsfp,param_file2.nsfp \ -m "message" \ -d /path/to/workflow/directory
- NeatSeq-Flow does not require installation. If you have a local copy, append the full path to
- You can pass a comma-separated list of parameter files. NeatSeq-Flow concatenates the files in the order they’re passed. Make sure there are no conflicts or duplicated definitions in the files (this occurs mainly for global parameters)
- Alternatively, you can pass many parameter by specifying more than one
- It is not compulsory to pass a message via
-mbut it is highly recommended for documentation and reproducibility.
The workflow can be executed fully automatically; on a step-by-step basis or for individual samples separately.
Execute the following command within the workflow directory:
scripts/00.workflow.commands.csh script runs all the steps at once, leaving flow control entirely to the cluster job manager.
Each line in
scripts/00.workflow.commands.csh calls a step-wise script in
scripts/01.merge_merge1.csh, which contains a list of
qsub commands executing the individual scripts on each sample.
The following command will execute only the