Author: Menachem Sklarz
NeatSeq-Flow can be operated at three levels, from easy to advanced:
- Basic usage: A workflow is pre-coded in a parameter file downloaded from the Workflows repository or elsewhere. All the user has to do is modify the program paths in the parameter file and create a sample file describing the samples and their associated files. (See Sample file definition.)
- Workflow construction: The user defines a sample file as above, but also defines a workflow parameter file based on existing modules (see Modules and workflows repository). Of course, the user can take and existing workflow and modify and expand it according to the job at hand.
- Adding modules: Creating a workflow based on tools not yet included in NeatSeq-Flow. This requires the user to add modules for each program he expects to run. Alternatively, the user can use the Generic module to include programs without defining modules.
Basic usage does not require much.
- Copy a ready-made parameter file (you can find some in the
Workflowsdir in the main NeatSeq-Flow directory and at the Modules and workflows repository).
- Adjust the script paths to the correct paths on your system (These are usually coded as variables in the
Varssection at the top of the parameter file).
- Create a sample file following the directions in Sample file definition.
- Execute NeatSeq-Flow (see Executing NeatSeq-Flow) to create the workflow scripts.
- Execute the workflow scripts by executing the
In order to construct workflows, one needs to combine modules in such a way that files are transferred seamlessly between them.
The module output filenames are organised within NeatSeq-Flow in a python dictionary called
sample_data (after executing NeatSeq-Flow, this dictionary can be found in the JSON file
WorkflowData.json in the
objects directory). The dictionary tracks the locations of the output files from each module, saving each file type in a dedicated slot. For instance, fastq reads are stored in
reads.X slots, where
X is either
S for forward-, reverse- and single-end reads, respectively. FASTA, SAM and BAM files, too, have dedicated slots.
A workflow is a combination of module instances that inherit the above-mentioned dictionary from other modules (these are called the
base step of the instance). Each module expects to find specific files in specific slots in the
sample_data dictionary, which should be put there by one of the modules it inherits from. The instance then stores the filenames of its scripts outputs in slots in the dictionary. You can see these requirements in the module documentation, in the Requires and Output sections.
Often, the files are sample-specific, such as fastq files. In this case, they are stored in a dedicated sample slot in the dictionary, e.g.
sample_data["Sample1"]. Project-wide files, such as an assembly created from all the project fastq files, are stored in the main
Some of the modules take their inputs and put their outputs in the sample-specific slots and some use the project-wide slots. The sample-specific slots are indicated in the documentation as
sample_data[<sample>]. Some modules can do both, and their exact behaviour is either controlled by a module parameter (e.g.
bowtie2_mapper) or guessed at by the module based on the dictionary structure.
Creating a workflow is then like assembling a puzzle. Each instance of a module must have an ancestor module (
base module) that puts files in the slots required by the module. e.g. when the
samtools module is executed, it expects to find a SAM file in
sample_data[<sample>]["sam"]. It, in turn, produces a BAM and puts it in
sample_data[<sample>]["bam"] for use by other modules downstream.
Sometimes, modules overwrite existing slots. This does not mean the files will be overwritten. It only means that access to these slots in downstream modules will refer to the newer files. e.g. the
trimmo module puts its outputs in the same slots as the
merge module. Therefore, a
fastqc_html instance based on the
merge instance will use the files created by
merge while a
fastqc_html instance based on the
trimmo instance will use the files created by
This might seem complicated, but once you get used to the dictionary structure you will see how flexible the whole thing really is.
Module instances can be based on more than one instance. e.g. if instance
i is based on instances
j,k, it is the same as having
j based on
i based on
j. In other words, if both
j write to the same slot,
i will have access only to the output from
j are independent of each other, then basing
k to run in parallel, possibly reducing runtime.
If you run NeatSeq-Flow with the word
stop_and_show: in one of the instances’ parameter blocks, NeatSeq-Flow will terminate at that instance and show the structure of the
sample_data dictionary. You can use the output to decide which modules can inherit from the instance.
Adding modules is the most difficult part of creating a workflow. Please make sure a module does not already exist for the program you want to run before trying to create a module.
It is our hope that a community of users will provide access to a wide range of modules, making the process of developing new workflows more straightforward for non-programmers.
For detailed instructions for writing modules, see For the Programmer - Adding Modules. The idea is to use the
sample_data dictionary for input and output files while leaving as many of the other parameters as possible to the user. This will enable as much flexibility as possible while releaving the user of the need to track input and output files.
For standard file types, you should use the appropriate slots (check out similar modules for proper slots to use).
As mentioned above, module instances can be based on more than one instance. i.e.
i can be based on
j,k. It was stated that in this case, if
k instances write to the same slot,
i will have access only to the version created by
However, you can write modules such that
i has access to the same slot both in
k and in
j: Previous versions of the
sample_data dict are stored in the
base_sample_data dictionary in the module class. The
base_sample_data dict is keyed by the base instance name. This can be used to access overwridden versions of files created by instances upstream to the present module.
For example, if base contains the name of the base instance (e.g. merge1), you can access the base’s sample data as follows: