How NeatSeq-Flow works
Author: Menachem Sklarz
A detailed description of how NeatSeq-Flow works is provided in NeatSeq-Flow article on BioRXiv. Pay special attention to Supplementary Figures S3 and S4.
Here we describe how file locations are internally managed and how they are transferred between workflow steps.
In NeatSeq-Flow, locations of files produced by the programs being executed are stored in a python dictionary called sample_data
(after executing NeatSeq-Flow, this dictionary can be found in the JSON file WorkflowData.json
in the objects
directory). The dictionary stores each file type in a dedicated slot. For instance, fastq reads are stored in fastq.X
slots, where X
is either F
, R
or S
for forward-, reverse- and single-end reads, respectively. FASTA, SAM and BAM files, too, have dedicated slots.
A workflow is a combination of module instances that inherit the above-mentioned dictionary from other modules (these are called the base step
of the instance). Each module expects to find files in specific slots in the sample_data
dictionary, which should be put there by one of the modules it inherits from. The instance then stores the filenames of its scripts’ outputs in slots in the dictionary. You can see these requirements in the module documentation, in the Requires and Output sections.
Often, the files are sample-specific, such as fastq files. In this case, they are stored in a dedicated sample slot in the dictionary, e.g. sample_data["Sample1"]
. Project-wide files, such as an assembly created from all the project fastq files, are stored in the sample_data["project_data"]
dictionary.
Some of the modules take their inputs and put their outputs in the sample-specific slots and some use the project-wide slots. The sample-specific slots are indicated in the documentation as sample_data[<sample>]
. Some modules can do both, and their exact behaviour is either controlled by a module parameter (e.g. scope
in bowtie2_mapper
) or guessed at by the module based on the dictionary structure.
Creating a workflow is then like assembling a puzzle. Each instance of a module must have an ancestor module (base
module) that puts files in the slots required by the module. e.g. when the samtools
module is executed, it expects to find a SAM file in sample_data[<sample>]["sam"]
. It, in turn, produces a BAM and puts it in sample_data[<sample>]["bam"]
for use by other modules that are based on it.
Sometimes, module instances overwrite existing slots. This does not mean the files will be overwritten. It only means that access to these slots in downstream instances will refer to the newer files. e.g. the trimmo
module puts its outputs in the same slots as the Import
module. Therefore, a fastqc_html
instance based on the Import
instance will use the files created by Import
while a fastqc_html
instance based on the trimmo
instance will use the files created by trimmo
.
Note
This might seem complicated, but once you get used to the dictionary structure you will see how flexible the whole thing really is.
Tip
Module instances can be based on more than one instance. e.g. if instance i is based on instances j,k, it is the same as having j based on k and i based on j. In other words, if both k and j write to the same slot, i will have access only to the output from j.
If k and j are independent of each other, then basing i on j,k enables j and k to run in parallel, thus reducing runtime.
Tip
If you run NeatSeq-Flow with the word stop_and_show:
in one of the instances’ parameter blocks, NeatSeq-Flow will terminate at that instance and show the structure of the sample_data
dictionary. You can use the output to decide which modules can inherit from the instance.
As of version 1.4, stop_and_show:
output includes the provenance of the file types, i.e. the histroy of instances modifying the file types. For examples, the following output:
Samples: Sample1, Sample2, Sample3
Slots:
- fastq.R.unpaired (>trim_gal)
- Reverse (>merge1)
- fastq.F (>merge1->trim_gal)
- fastq.F.unpaired (>trim_gal)
- Forward (>merge1)
- fastq.R (>merge1->trim_gal)
shows that fastq.F
and fastq.R
files were created by merge1
and modified by trim_gal
, while files fastq.R.unpaired
and fastq.F.unpaired
were created by trim_gal
instance.
Read more on how NeatSeq-Flow works: NeatSeq-Flow article on BioRXiv