How NeatSeq-Flow works¶
Author: Menachem Sklarz
A detailed description of how NeatSeq-Flow works is provided in NeatSeq-Flow article on BioRXiv. Pay special attention to Supplementary Figures S3 and S4.
Here we describe how file locations are internally managed and how they are transferred between workflow steps.
In NeatSeq-Flow, locations of files produced by the programs being executed are stored in a python dictionary called
sample_data (after executing NeatSeq-Flow, this dictionary can be found in the JSON file
WorkflowData.json in the
objects directory). The dictionary stores each file type in a dedicated slot. For instance, fastq reads are stored in
fastq.X slots, where
X is either
S for forward-, reverse- and single-end reads, respectively. FASTA, SAM and BAM files, too, have dedicated slots.
A workflow is a combination of module instances that inherit the above-mentioned dictionary from other modules (these are called the
base step of the instance). Each module expects to find files in specific slots in the
sample_data dictionary, which should be put there by one of the modules it inherits from. The instance then stores the filenames of its scripts’ outputs in slots in the dictionary. You can see these requirements in the module documentation, in the Requires and Output sections.
Often, the files are sample-specific, such as fastq files. In this case, they are stored in a dedicated sample slot in the dictionary, e.g.
sample_data["Sample1"]. Project-wide files, such as an assembly created from all the project fastq files, are stored in the
Some of the modules take their inputs and put their outputs in the sample-specific slots and some use the project-wide slots. The sample-specific slots are indicated in the documentation as
sample_data[<sample>]. Some modules can do both, and their exact behaviour is either controlled by a module parameter (e.g.
bowtie2_mapper) or guessed at by the module based on the dictionary structure.
Creating a workflow is then like assembling a puzzle. Each instance of a module must have an ancestor module (
base module) that puts files in the slots required by the module. e.g. when the
samtools module is executed, it expects to find a SAM file in
sample_data[<sample>]["sam"]. It, in turn, produces a BAM and puts it in
sample_data[<sample>]["bam"] for use by other modules that are based on it.
Sometimes, module instances overwrite existing slots. This does not mean the files will be overwritten. It only means that access to these slots in downstream instances will refer to the newer files. e.g. the
trimmo module puts its outputs in the same slots as the
Import module. Therefore, a
fastqc_html instance based on the
Import instance will use the files created by
Import while a
fastqc_html instance based on the
trimmo instance will use the files created by
This might seem complicated, but once you get used to the dictionary structure you will see how flexible the whole thing really is.
Module instances can be based on more than one instance. e.g. if instance i is based on instances j,k, it is the same as having j based on k and i based on j. In other words, if both k and j write to the same slot, i will have access only to the output from j.
If k and j are independent of each other, then basing i on j,k enables j and k to run in parallel, thus reducing runtime.
If you run NeatSeq-Flow with the word
stop_and_show: in one of the instances’ parameter blocks, NeatSeq-Flow will terminate at that instance and show the structure of the
sample_data dictionary. You can use the output to decide which modules can inherit from the instance.
As of version 1.4,
stop_and_show: output includes the provenance of the file types, i.e. the histroy of instances modifying the file types. For examples, the following output:
Samples: Sample1, Sample2, Sample3
- fastq.R.unpaired (>trim_gal)
- Reverse (>merge1)
- fastq.F (>merge1->trim_gal)
- fastq.F.unpaired (>trim_gal)
- Forward (>merge1)
- fastq.R (>merge1->trim_gal)
fastq.R files were created by
merge1 and modified by
trim_gal, while files
fastq.F.unpaired were created by
Read more on how NeatSeq-Flow works: NeatSeq-Flow article on BioRXiv