NeatSeq-Flow concept

Author: Menachem Sklarz

Affiliation: Bioinformatics Core Facility, National institute of Biotechnology in the Negev, Ben-Gurion University.

Introduction

NeatSeq-Flow can be operated at three levels, from easy to advanced:

  1. Basic usage: A workflow is pre-coded in a parameter file downloaded from the Workflows repository or elsewhere. All the user has to do is modify the program paths in the parameter file and create a sample file describing the samples and their associated files. (See Sample file definition.)
  2. Workflow construction: The user defines a sample file as above, but also defines a workflow parameter file based on existing modules (see Modules and workflows repository). Of course, the user can take and existing workflow and modify and expand it according to the job at hand.
  3. Adding modules: Creating a workflow based on tools not yet included in NeatSeq-Flow. This requires the user to add modules for each program he expects to run. Alternatively, the user can use the Generic module to include programs without defining modules.

Basic usage

Basic usage does not require much.

  1. Copy a ready-made parameter file (you can find some in the Workflows dir in the main NeatSeq-Flow directory and at the Modules and workflows repository).
  2. Adjust the script paths to the correct paths on your system (These are usually coded as variables in the Vars section at the top of the parameter file).
  3. Create a sample file following the directions in Sample file definition.
  4. Execute NeatSeq-Flow (see Executing NeatSeq-Flow) to create the workflow scripts.
  5. Execute the workflow scripts by executing the scripts/00.workflow.commands.csh script.

Workflow construction

In order to construct workflows, one needs to combine modules in such a way that files are transferred seamlessly between them.

The module output filenames are organised within NeatSeq-Flow in a python dictionary called sample_data (after executing NeatSeq-Flow, this dictionary can be found in the JSON file WorkflowData.json in the objects directory). The dictionary tracks the locations of the output files from each module, saving each file type in a dedicated slot. For instance, fastq reads are stored in reads.X slots, where X is either F, R or S for forward-, reverse- and single-end reads, respectively. FASTA, SAM and BAM files, too, have dedicated slots.

A workflow is a combination of module instances that inherit the above-mentioned dictionary from other modules (these are called the base step of the instance). Each module expects to find specific files in specific slots in the sample_data dictionary, which should be put there by one of the modules it inherits from. The instance then stores the filenames of its scripts outputs in slots in the dictionary. You can see these requirements in the module documentation, in the Requires and Output sections.

Often, the files are sample-specific, such as fastq files. In this case, they are stored in a dedicated sample slot in the dictionary, e.g. sample_data["Sample1"]. Project-wide files, such as an assembly created from all the project fastq files, are stored in the main sample_data dictionary.

Some of the modules take their inputs and put their outputs in the sample-specific slots and some use the project-wide slots. The sample-specific slots are indicated in the documentation as sample_data[<sample>]. Some modules can do both, and their exact behaviour is either controlled by a module parameter (e.g. scope in bowtie2_mapper) or guessed at by the module based on the dictionary structure.

Creating a workflow is then like assembling a puzzle. Each instance of a module must have an ancestor module (base module) that puts files in the slots required by the module. e.g. when the samtools module is executed, it expects to find a SAM file in sample_data[<sample>]["sam"]. It, in turn, produces a BAM and puts it in sample_data[<sample>]["bam"] for use by other modules downstream.

Sometimes, modules overwrite existing slots. This does not mean the files will be overwritten. It only means that access to these slots in downstream modules will refer to the newer files. e.g. the trimmo module puts its outputs in the same slots as the merge module. Therefore, a fastqc_html instance based on the merge instance will use the files created by merge while a fastqc_html instance based on the trimmo instance will use the files created by trimmo.

Note

This might seem complicated, but once you get used to the dictionary structure you will see how flexible the whole thing really is.

Tip

Module instances can be based on more than one instance. e.g. if instance i is based on instances j,k, it is the same as having j based on k and i based on j. In other words, if both k and j write to the same slot, i will have access only to the output from j.

If k and j are independent of each other, then basing i on j,k enables j and k to run in parallel, possibly reducing runtime.

Tip

If you run NeatSeq-Flow with the word stop_and_show: in one of the instances’ parameter blocks, NeatSeq-Flow will terminate at that instance and show the structure of the sample_data dictionary. You can use the output to decide which modules can inherit from the instance.

Adding modules

Adding modules is the most difficult part of creating a workflow. Please make sure a module does not already exist for the program you want to run before trying to create a module.

It is our hope that a community of users will provide access to a wide range of modules, making the process of developing new workflows more straightforward for non-programmers.

For detailed instructions for writing modules, see For the Programmer - Adding Modules. The idea is to use the sample_data dictionary for input and output files while leaving as many of the other parameters as possible to the user. This will enable as much flexibility as possible while releaving the user of the need to track input and output files.

For standard file types, you should use the appropriate slots (check out similar modules for proper slots to use).

Slots for commonly used files
File type Scope Slot
fastq Sample sample_data[<sample>]['fastq.F|fastq.R|fastq.S']
fasta Sample sample_data[<sample>]['fasta.nucl|fasta.prot']
fasta Project sample_data['fasta.nucl|fasta.prot']
SAM Sample sample_data[<sample>]['sam']
BAM Sample sample_data[<sample>]['bam']
Aligner index Sample sample_data[<sample>][<aligner name>_index']
Aligner index Project sample_data[<aligner name>_index']
Aligner reference Sample sample_data[<sample>]['reference']
GFF Sample sample_data[<sample>]['gff']
GFF Project sample_data['gff']

Tip

As mentioned above, module instances can be based on more than one instance. i.e. i can be based on j,k. It was stated that in this case, if j and k instances write to the same slot, i will have access only to the version created by j.

However, you can write modules such that i has access to the same slot both in k and in j: Previous versions of the sample_data dict are stored in the base_sample_data dictionary in the module class. The base_sample_data dict is keyed by the base instance name. This can be used to access overwridden versions of files created by instances upstream to the present module.

For example, if base contains the name of the base instance (e.g. merge1), you can access the base’s sample data as follows:

self.base_sample_data[base]