How NeatSeq-Flow works

Author: Menachem Sklarz

A detailed description of how NeatSeq-Flow works is provided in NeatSeq-Flow article on BioRXiv. Pay special attention to Supplementary Figures S3 and S4.

Here we describe how file locations are internally managed and how they are transferred between workflow steps.

In NeatSeq-Flow, locations of files produced by the programs being executed are stored in a python dictionary called sample_data (after executing NeatSeq-Flow, this dictionary can be found in the JSON file WorkflowData.json in the objects directory). The dictionary stores each file type in a dedicated slot. For instance, fastq reads are stored in fastq.X slots, where X is either F, R or S for forward-, reverse- and single-end reads, respectively. FASTA, SAM and BAM files, too, have dedicated slots.

A workflow is a combination of module instances that inherit the above-mentioned dictionary from other modules (these are called the base step of the instance). Each module expects to find files in specific slots in the sample_data dictionary, which should be put there by one of the modules it inherits from. The instance then stores the filenames of its scripts’ outputs in slots in the dictionary. You can see these requirements in the module documentation, in the Requires and Output sections.

Often, the files are sample-specific, such as fastq files. In this case, they are stored in a dedicated sample slot in the dictionary, e.g. sample_data["Sample1"]. Project-wide files, such as an assembly created from all the project fastq files, are stored in the sample_data["project_data"] dictionary.

Some of the modules take their inputs and put their outputs in the sample-specific slots and some use the project-wide slots. The sample-specific slots are indicated in the documentation as sample_data[<sample>]. Some modules can do both, and their exact behaviour is either controlled by a module parameter (e.g. scope in bowtie2_mapper) or guessed at by the module based on the dictionary structure.

Creating a workflow is then like assembling a puzzle. Each instance of a module must have an ancestor module (base module) that puts files in the slots required by the module. e.g. when the samtools module is executed, it expects to find a SAM file in sample_data[<sample>]["sam"]. It, in turn, produces a BAM and puts it in sample_data[<sample>]["bam"] for use by other modules that are based on it.

Sometimes, module instances overwrite existing slots. This does not mean the files will be overwritten. It only means that access to these slots in downstream instances will refer to the newer files. e.g. the trimmo module puts its outputs in the same slots as the Import module. Therefore, a fastqc_html instance based on the Import instance will use the files created by Import while a fastqc_html instance based on the trimmo instance will use the files created by trimmo.

Note

This might seem complicated, but once you get used to the dictionary structure you will see how flexible the whole thing really is.

Tip

Module instances can be based on more than one instance. e.g. if instance i is based on instances j,k, it is the same as having j based on k and i based on j. In other words, if both k and j write to the same slot, i will have access only to the output from j.

If k and j are independent of each other, then basing i on j,k enables j and k to run in parallel, thus reducing runtime.

Tip

If you run NeatSeq-Flow with the word stop_and_show: in one of the instances’ parameter blocks, NeatSeq-Flow will terminate at that instance and show the structure of the sample_data dictionary. You can use the output to decide which modules can inherit from the instance.

As of version 1.4, stop_and_show: output includes the provenance of the file types, i.e. the histroy of instances modifying the file types. For examples, the following output:

Samples: Sample1, Sample2, Sample3
Slots:
- fastq.R.unpaired (>trim_gal)
- Reverse (>merge1)
- fastq.F (>merge1->trim_gal)
- fastq.F.unpaired (>trim_gal)
- Forward (>merge1)
- fastq.R (>merge1->trim_gal)

shows that fastq.F and fastq.R files were created by merge1 and modified by trim_gal, while files fastq.R.unpaired and fastq.F.unpaired were created by trim_gal instance.

Read more on how NeatSeq-Flow works: NeatSeq-Flow article on BioRXiv