For the Programmer - Adding Modules

Author: Menachem Sklarz

Affiliation: Bioinformatics Core Facility, National institute of Biotechnology in the Negev, Ben-Gurion University.

Steps in writing NeatSeq-Flow modules

Preparing the module file

  1. Choose a name for the module. e.g. bowtie2_mapper

  2. Decide which level the module will work on: samples or project-wide?

  3. Change the name of the template file to to <module_name>.py.

  4. Make sure the file is within a directory which includes an empty __init__.py file. This directory is passed to NeatSeq-Flow through the module_path global parameter (see Parameter file definition)

  5. Change the class name to Step_<module_name> in the line beginning with class Step_.... Make sure <module_name> here is identical to the one you used in the filename above.

Things to modify in the actual code

  1. In step_specific_init(), set self.shell to csh or bash, depending on the shell language you want your scripts to be coded in (It is best to use bash because it will work with Install and execute with Conda).
  2. In step_sample_initiation() method, you can do things on the sample_data structure before actual script preparing, such as assertion checking (Exceptions and Warnings) to make sure the data the step requires exists in the sample_data structure.
  3. build_scripts() is the actual place to put the step script building code. See Instructions for build_scripts() function.
  4. make_sample_file_index() is a place to put code that produces an index file of the files produced by this step (BLAST uses this function, so you can check it out in blast.py)
  5. In create_spec_preliminary_script() you create the code for a script that will be run before all other step scripts are executed. If not defined or returns nothing, it will be ignored (i.e. you can set it to pass). This is useful if you need to prepare a database, for example, before the other scripts use it.
  6. In create_spec_wrapping_up_script() you create the code for a script that will be run after all other step scripts are executed. If not defined or returns nothing, it will be ignored (i.e. you can set it to “pass”). This is the place to call make_sample_file_index() to create an index of the files produced in this step; and to call a script that takes the index file and does some kind of data agglomeration.
  7. It is highly recommended to create an instance-scope list of the redirected parameters that the user should not pass because they are dealt with by your module. The list should be called auto_redirs and you should place it directly after the class definition line (i.e. the line beginning with class Step_...). After instance creation, the list is checked by NeatSeq-Flow to make sure the user did not pass forbidden parameters.

Instructions for build_scripts() function

  • If sample-level scripts are required, the function should contain a loop:

    for sample in self.sample_data["samples"]:
    
  • Set self.script to contain the command/s executed by the script (This will go inside the for loop for sample-level steps)

    1. Initialize it with self.script = ""

    2. Calling self.script += self.get_script_const() will add the setenv parameter, if it exists; the script_path parameter and the redirected parameters. Then all that remains is to see to input and output parameters.

    3. The input parameter, typically -i, is usually based on the sample data structure, e.g.:

      self.script += "-i %s \\\n\t" % self.sample_data[sample]["fasta.nucl"]
      

    Note

    The "\\\n\t" at the end of the string makes the final script more readable.

    1. The output parameter (typicall -o) should be set to a filename within self.base_dir. If the step is a sample-level step, get a directory for the output files by calling:

      sample_dir = self.make_folder_for_sample(sample)
      
  • Place the output file somewhere in the sample_data structure. e.g.:

    self.sample_data[sample]["bam"] = (sample_dir + os.path.basename(output_filename))
    
  • If the output is a standard file, e.g. BAM or fastq files, put them in the respective places in sample_data. See documentation for similar modules to find out the naming scheme. Otherwise, use a concise file-type descriptor for the file and specify the location you decided on in the module documentation.

  • You can add more than one command in the self.script variable if the two commands are typically executed together. See samtools module for an example.

  • The function should end with the following line (within the sample-loop, if one exists):

    self.create_low_level_script()
    
  • That, and a little bit of debugging, usually, is all it requires to add a module to the pipeline.

Attention

The steps above assume you don’t want to support the option of working on a local directory and transferring the finished results to the final location (see local parameter). If you do want to support it, you have to create a temporary directory with:

use_dir = self.local_start(sample_dir)

or:

use_dir = self.local_start(self.base_dir)

Use use_dir when defining the script, but use sample_dir and self.base_dir when assigining to self.sample_data (see the templates for examples).

Finally, add the following line before self.create_low_level_script():

self.local_finish(use_dir,sample_dir)

Note that the above procedure enables the user to decide whether to run locally by adding the ``local`` parameter to the step parameter block in the parameter file!

Exceptions and Warnings

When programming a module, the programmer usually has certain requirements from the user, for instance parameters that are required to be set in the parameter file, sets of parameters which the user has to choose from and parameters which can take only specific values.

This kind of condition is typically programmed in python using assertions.

In NeatSeq-Flow, assertions are managed with the AssertionExcept exception class. For testing the parameters, create an if condition which raises an AssertionExcept. The arguments to AssertionExcept are as follows:

  1. An error message to be displayed. AssertionExcept will automatically add the step name to the message.
  2. Optional: The sample name, in case the condition failed for a particular sample (e.g. a particular sample does not have a BAM file defined.)

A typical condition testing code snippet:

for sample in self.sample_data["samples"]:
    if not CONDITION:
        raise AssertionExcept("INFORMATIVE error message\n", sample)

Note

The reason for using if not CONDITION rather than if CONDITION is that the condition is a condition for success rather than for failure, which is more intuitive (for me at least)

If you only want to warn the user about a certain issue, rather than failing, you can induce NeatSeq-Flow to produce a warning message with the same format as an AssertionExcept message, as follows:

for sample in self.sample_data["samples"]:
    if CONDITION:
        self.write_warning("Warning message.\n", sample)

Note

As for AssertionExcept, the sample argument is optional.