Specifying workflow design and input files

Author: Menachem Sklarz

Introduction

The workflow design information and the specification of input files are written to parameter and sample files, respectively.

Following is a description of the parameter and sample files which are required in order to execute NeatSeq-Flow.

The parameter file is stored in YAML format and the sample file in a tab-delimited format.

The files can be created either through the NeatSeq-Flow GUI or by using a text editor, such as Notepad++.

Parameter file definition

Tip

The parameter file is rarely created from scratch. Take an existing parameter file defined for the analysis you require and modify it to suit your SGE cluster and specific requirements.

The parameter file must include a Global parameters section and a Step-wise parameters section. It may also include a Variables section. All sections are described below:

Global parameters

Attention

In the NeatSeq-Flow GUI, the global parameters described below are set in the Cluster tab.

Several SGE and other parameters can be set globally so that all scripts use them for execution. Overriding the defaults on a step-wise basis is possible in the step-wise section of the parameter file.

All global parameters are set within a Global_params block in YAML format.

Executor
Define the cluster manager to use. Options are SGE (default), SLURM or Local. The SLURM and Local support are in Beta development stage.
Qsub_q
Defines the default queue to send the jobs to (this is the value passed to qsub with the -q parameter).
Qsub_nodes

Limits the nodes to which to send the jobs. Must be nodes that are available to the queue requested in Qsub_q. The nodes should be passed in a YAML list format. e.g.

Qsub_nodes:
    - node1
    - node2
Qsub_opts
Other SGE parameters to be set as default for all scripts, e.g. -V -cwd etc. The parameters should be passed in one long string and not as a list.

Attention

It is highly recommended to pass the -notify argument to qsub in this string. If it is passed, all modules producing bash-based scripts will report early termination of scripts with qdel in the log file. If -notify is not passed, jobs killed with qdel will have a line in the log file reporting the job start time but there will be no indication that the job was terminated (besides it not having a line indicating finish time)

Qsub_path
The path to the qstat command. If not set, qstat will be used as-is with no path. Sometimes in remote nodes the qstat command is not in the path and if Qsub_path is not set, the step start and stop logging will fail.

Tip

The correct value for the Qsub_path parameter can be determined by executing the following command:

dirname `which qsub`

For SLURM:

dirname `which sbatch`
Default_wait
The time, in seconds, to wait for jobs to enter the queue before terminating the step-level script. Must be an integer. The default is 10, which is usually a good value to start with. If downstream jobs seem to be sent for execution before earlier jobs have terminated, increase this value.
module_path
Enables including modules not in the main NeatSeq-Flow package. This includes the modules downloaded from the NeatSeq-Flow Modules and workflows repository as well as modules you added yourself (see section Module and Workflow Repository). Keep your modules in a separate path and pass the path to NeatSeq-Flow with module_path. Several of these can be passed in YAML list format for more than one external module path. The list will be searched in order, with the main NeatSeq-Flow package being searched last.

Attention

When executing NeatSeq-Flow within a conda environment, NeatSeq-Flow will add the path to the modules repository automatically (See Install and execute NeatSeq-Flow with Conda). You don’t have to worry about setting it in the parameter file unless you have your own modules installed in a different location.

job_limit

If there is an upper limit on the jobs you can send to the job manager, you can use the job_limit parameter to pass NeatSeq-Flow a file with one line, e.g.:

limit=1000 sleep=60

This will make the scripts check every 60 seconds if there are less than 1000 jobs registered for the user. New jobs will be released only when there are less than the specified limit.

conda

If you want to use a conda environment to execute the scripts, pass this parameter with the following two sub-parameters:

path

The path to the environment you want to use. If left empty, and a conda environment is active, NeatSeq-Flow will use the path to the active environment. However, you will have to define the base of the conda installation with:

export CONDA_BASE=$(conda info --root)
env
The name of the environment to use. If absent or left empty, NeatSeq-Flow will extract the name from the CONDA_DEFAULT_ENV environment variable, which contains the name of the active conda environment.
setenv
Enables setting environment variables for all steps in the workflow. Is equivalent to setting setenv in all steps (see setenv in step parameters.).

Following is an example of a global-parameters block:

Global_params:
    Default_wait: 10
    Qsub_path: /path/to/qstat
    Qsub_q: queue.q
    Qsub_nodes: [node1,node2,node3]
    Qsub_opts:  -V -cwd -notify
    module_path:
        - /path/to/modules1/
        - /path/to/modules2/

Attention

As of version 1.4.0, NeatSeq-Flow supports SLURM clusters, as well as stand-alone computers. This is done by adding the Executor parameter in the Global_params section, and setting it’s value to SLURM or Local. This is, however, in beta development stage.

Variables

Attention

In the NeatSeq-Flow GUI, the variables are set in the Vars tab.

In this section, you can set values to variables which can then be incorporated in required positions in the parameter file.

The values are incoporated by referencing them in curly braces. e.g. if you set blast: /path/to/blastp in the Vars section, then you you can reference it with {Vars.blastp} in the other global and step-wise parameters sections.

Step-wise parameters

Attention

In the NeatSeq-Flow GUI, the step-wise parameters described below are set in the Work-Flow tab.

Step-wise parameters define parameters which are specific to the various steps included in the workflow.

All step-wise parameters are set within a Step_params block in YAML format.

A parameter block for a step (a module instance) should look as follows:

Step_params:
    trim1:
        module: trimmo
        base: merge1
        script_path: java -jar trimmomatic-0.32.jar
        qsub_params:
            -pe: shared 20
            node: node1
        todo: LEADING:20 TRAILING:20
        redirects:
            -threads: 20

trim1 is the step name. This should be a single-word, informative name (alphanumeric and underscore are permitted) which will be included in the script names and output directory names.

Following the step name, with indentation, are the step parameters as defined below.

Step parameters can be divided into the following groups:

  1. Required parameters for each step
  2. Additional parameters
  3. Redirected parameters

Required parameters for each step

module
The name of the module of which this step is an instance.
base
The name of the step on which the current step is based (not required for the Import step, which is always first and single). base can be a YAML formatted list of base steps.
script_path
The full path to the script executed by this step.

Note

  1. If the program executed by the module is on the search PATH of all the nodes in the queue, you can just pass the program name without the full path. This is not usually recommended.
  2. If the program requires a specific version of python or Rscript, you can append those before the actual path, e.g. /path/to/python /path/to/executable
  3. Sometimes, modules can require a path to a directory rather than to an executable. See, e.g., module UCSC_BW_wig.
  4. Some modules, such as manage_types do not use the script_path parameter. For these modules, you must include an empty script_path, as it is a required parameter.

Additional parameters

Other parameters you can set for each step to control the execution of the step scripts:

tag
Set a tag for the instance. All instances downstream to the tagged instance will have the same tag. The scripts created by all instances with the same tag can be executed at once using the tag master-script created in directory scripts/tags_scripts.

Tip

  1. If an instance has multiple bases, the tag of the first tagged base will be used.
  2. To stop an instance from getting its base’s tag, set an empty tag: parameter.
intermediate
Will add a line to script scripts/95.remove_intermediates.sh for deleting the results of this step. If the data produced by this step is not required in the long term, add this flag and when you are done with the project, you can execute scripts/95.remove_intermediates.sh to remove all intermediate files.
setenv
Set various environment variables for the duration of script execution. This is useful when the software executed by the script requires setting specific environment variables which you do not want to set globally on all nodes. The step setenv takes precedence over gobal setenv settings. If setenv is empty, no variables will be set in scripts even when a global setenv is set.

Note

For bash scripts, export will automatically be used instead of setenv.

precode
Additional code to be added before the actual script, such as unsetting variables and what not. Rarely used.
qsub_params

Set cluster-related parameters which will be effective for the current step only:

node
A node or YAML list of nodes on which to run the step scripts (overrides global parameter Qsub_nodes)
queue or -q
Will limit the execution of the step’s scripts to this queue (overrides global parameter Qsub_q)
-pe
Will set the -pe parameter for all scripts for this module (see SGE qsub manual).
-XXX: YYY
Set the value of qsub parameter -XXX to YYY. This is a way to define other SGE parameters for all step scripts.
scope
Defines whether to use sample-wise files or project-wise files. Check per-module documentation for whether and how this parameter is defined (see, e.g., the blast module).
sample_list

Limit this step to a subset of the samples. Scripts will be created only for the samples in the list. This selection will be valid for all instances based on this instance.

The sample list can be expressed in two ways:

  • A yaml list or a comma-separated list of sample names:

    sample_list: [sample1, sample2]
    
  • A category and level(s) from a mapping file:

    sample_list:
        category:  Category1
        levels:     [level1,level2]
    

For using all but a subset of samples, use exclude_sample_list instead of sample_list.

Tip

A use case could be when you want to run a step with different parameters for different sample subsets. Both versions of the instance should inherit from a common base and the downstream step can inherit both versions, thus all samples will have all files, created with different parameters.

Tip

To return to a wider sample list, add a second base which contains the version of the rest of the samples which you need.

conda
Is used to define step specific conda parameters. The syntax is the same as for the global conda definition (see here). If set, the path and env will be used to execute the scripts of this step only. If a global conda exists, the local definition will override the global definition.

Attention

If you have set global conda parameters, and want a step to execute not within a conda environment, pass an empty conda field.

arg_separator
Sometimes, the delimiter between program argument and value is not blank space (’ ‘) but something else, like ‘=’. For these modules, you should set arg_separator to the separator character. e.g. arg_separator: '=' See PICARD programs for examples.
local
A local folder which exists in all cluster nodes. Uses a local directory for intermediate files before copying results to final destination in data dir. This is useful when the cluster manager requires you to limit your IO to the central disk system.

Redirected parameters

Parameters to be redirected to the actual program executed by the step.

Redirected parameters are specified within a redirects: block (see example in Step-wise parameters above).

Note

the parameter name must include the - or -- required by the program defined in script_path.

Comments

  1. The local directory passed to local must exist on all nodes in the queue.
  2. For a list of qsub parameters, see the qsub man page
  3. The list of nodes passed to node within the qsub_params block will be appended to the queue name (global or step specific). Don’t add the queue name to the node names.

Sample file definition

Attention

In the NeatSeq-Flow GUI, the samples can be defined in the Samples tab.

Attention

As of version 1.3.0, NeatSeq-Flow no longer supports the List-format used in previous versions!

Attention

It is recommended to provide full paths to the files listed in the sample file. However, if relative paths are provided, NeatSeq-Flow will attempt to expand them to full paths, using the current directory as the base directory.

Important

When passing URLs as sample locations (see documentation for Import module), it is compulsory to append the protocol, or scheme, at the beginning of the URL.

  • Good: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR453/SRR453032/SRR453032_1.fastq.gz
  • Bad: ftp.sra.ebi.ac.uk/vol1/fastq/SRR453/SRR453032/SRR453032_1.fastq.gz

The sample file has, at the moment, 4 sections:

Project title

The project title is supplied in a line with the following structure:

Title       name_of_analysis

Attention

The word Title and the title name must be separated by a single TAB character. This is the rule for all sections of the sample file.

Caution

If more that one title line is included, one of them will be selected and a warning will be generated.

Sample files

The samples themselves are coded in a TAB-separated table with a header, as follows:

#SampleID   Type    Path

The table must be in consecutive lines following the header line.

  • The first field is the sample name (no spaces!),
  • the 2nd field is the file type and
  • the third field is the file path.

Additional columns are ignored.

You may comment out lines in the table by prepending a # character.

An example of a sample table follows:

#SampleID   Type    Path
Sample1     Forward /full/path/to/Sample1_R1_001.fastq.gz
Sample1     Reverse /full/path/to/Sample1_R2_001.fastq.gz
Sample2     Forward /full/path/to/Sample2_R1_001.fastq.gz
Sample2     Reverse /full/path/to/Sample2_R2_001.fastq.gz

The following file types are recognized by NeatSeq-Flow and will be automatically imported into the correct position in the file index (indicated in the second column):

File types recognized by NeatSeq-Flow
Source Target
Forward fastq.F
Reverse fastq.R
Single fastq.S
Nucleotide fasta.nucl
Protein fasta.prot
SAM sam
BAM bam
REFERENCE reference
VCF vcf
G.VCF g.vcf
GTF gtf
GFF gff
GFF3 gff3
manifest qiime2.manifest
barcodes barcodes

Other types can be included, as well. For how to import them correctly into NeatSeq-Flow, see the documentation for Import module.

Note

  1. Each line represents one file. For samples with multiple files, add lines with the same sample name.
  2. Keep forward and reverse files in pairs. Each forward file should have it’s reverse file in the following line.
  3. Each sample can contain different combinations of file types but the user must be careful when doing unexpected things like that…

Project files

As of NeatSeq-Flow version 1.3.0, you can pass project-wise files, such as reference files, through the sample file. This is done as above for the sample data, in a separate table with the following structure:

#Type       Path

For example, a project file section could look like this:

#Type       Path
Nucleotide  /path/to/reference.fasta
Protein     /path/to/reference.faa
# This is a comment line

The same file types that can be used in the Sample files section, can also be used in the project files section.

Attention

Up to NeatSeq-Flow version 1.2.0, the sample file can only contain sample files. No project files are permitted.

  • If you have project files, create a single sample which will represent your project.
  • If you have mainly sample files, such as fastq files, and some project level files such as reference genomes, pass them to the modules through the parameter file.

ChIP-seq specific definitions

For ChIP-seq experiments, one must define ChIP and Control (‘input’) pairs. This is done in the following manner (in the sample file):

Sample_Control        anti_sample1:input_sample1
Sample_Control        anti_sample2:input_sample2

Just replace anti_sample1 and input_sample1 with the relevant sample names.