Preparation and QC

Import *

Authors:Menachem Sklarz
Affiliation:Bioinformatics core facility
Organization:National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for importing and merging files from the sample file into NeatSeq-Flow.

Files can be imported in three ways:

  1. If there is a single file in the type (per sample), it can be imported, i.e. the exiting file path will be used as source for the workflow.
  2. If there are multiple files in the type, or you would like to make a local copy of the raw files, the file(s) can be copied and concatenated to the workflow directory.
  3. If the raw files are compressed, importing can include decompression as well as concatenation.

Tip

If you have plenty of disk space, the 2nd and 3rd options are the recommended approach. It ensures the original files go untouched and when the workflow is complete you can discard the copies produced by NeatSeq-Flow.

The Import module can be used in two modes:

The Basic mode

NeatSeq-Flow will attempt to guess all the parameters it requires. Multiple files will be concatenated and stored in the file type index according to the table below. File types not included in the table will be stored in the file type index by the type specified in the sample file.

You have to make sure that all files of each file type have the same extension for NeatSeq-Flow to guess the script_path and pipe parameters.

The Advanced mode

is used when more control on data importing and concatenation is required. It enables full control over which file types are imported, how they are copied and in which slots they are placed in the file type index. It also enables importing file types not recognized by NeatSeq-Flow (see list below).

In this mode, you have to define the following lists: src, trg, script_path, scope and ext. For each file type in the sample file, you should have an entry in the src list. The other lists should apply to the equivalent entry in src. trg is the target file type (in the file type index) for the imported files, script_path is the shell command to use to concatenate the source type files, scope is the scope for which the source type is defined and ext is the suffix to append to the final filenames. Strings are expanded to the length of src list, so if script_path is the same for all source types, it is enough to specify it once.

When using the Advanced mode, by passing the src list, you must also define the other lists, i.e. trg, ext, scope and script_path. However, NeatSeq-Flow will try guessing the lists based on the lists of recognized file types and extensions.

If some of the file types in src are recognized and some are not, you can pass the lists mentioned above with values for the unrecognized types, leaving null in the positions of the recognized types. These null values will be guessed by NeatSeq-Flow.

The advanced mode is experimental, and documentation will hopefully improve as we gain experience with it.

Note

Definition of script_path in the import module
script_path should be a shell program that receives a list of files and produces one single output file to the standard error. Examples of such programs are cat for text files and gzip -cd for gzipped files. Other types of compressed files should have such a command as well.

Tip

NeatSeq-Flow attempts to guess the script_path and pipe values based on the input file extensions. For this to work, leave the script_path and pipe lists empty and make sure all files from the same source have the same extensions (e.g. all gzipped files should have .gz as file extension).

If you want NeatSeq-Flow to guess only some of the script_path values, set them to null or to ..guess.., e.g. if src is [Single,TYP1] and script_path is [null,cat], then the script_path for Single will be guessed and the script_path for TYP1 will be set to cat.

Two more options are available for script path: ..skip.. will skip the type entirely, while ..import.. will import the values from the sample file into the relevant slots without actually producing any scripts (This is useful for including entities which are not files in the sample file. e.g. in the qiime2 pipeline you might want to include a semantic type in the sample file).

The following extensions are recognized:

File extensions recognized by NeatSeq-Flow
Extension script_path pipe
.fasta cat  
.faa cat  
.fna cat  
.txt cat  
.tsv cat  
.csv cat  
.fastq cat  
.fa cat  
.fq cat  
.gz gzip -cd  
.zip echo ‘xargs -d ” ” -I % sh -c “unzip -p %”’
.bz2 bzip -cd  
.dsrc2 echo ‘xargs -d ” ” -I % sh -c “dsrc2 d -s %”’
.dsrc echo ‘xargs -d ” ” -I % sh -c “dsrc d -s %”’

Requires

  • For the basic mode:
    • A list of files of the following types, either in [<sample>] or in [project_data]:
File types recognized by NeatSeq-Flow
Source Target
Forward fastq.F
Reverse fastq.R
Single fastq.S
Nucleotide fasta.nucl
Protein fasta.prot
SAM sam
BAM bam
REFERENCE reference
VCF vcf
G.VCF g.vcf
GTF gtf
GFF gff
GFF3 gff3
manifest qiime2.manifest
barcodes barcodes
  • For the Advanced mode:
    • Lists of files in any file type, either in [<sample>] or in [project_data].

Output

  • Imported files of the types in the table above are placed in slots according to the types in the 2nd column of the table.

Attention

If you want to do something more complex with the combined files, you can use the pipe parameter to send extra commands to be piped on the files after the main command. This is an experimental feature and should be used with care.

e.g.: You can get files from a remote location by setting script_path to curl and pipe to gzip -cd. This will download the files with curl, unzip them and concatenate them into the target file. In the sample file, specify remote URLs instead of local pathes. This will work only for one file per sample.

As of version 1.3.0, pipe can be a list of the same length as src and it we be treated like the other lists describe above.

Parameters that can be set

Parameter Values Comments
script_path   The shell command to use for merging the source files.
src   A list of source file types as the appear in the sample file.
trg   A list of target file type for the imported files.
scope sample | project The scope at which each of the sources can be found.
ext   The suffix to append to the imported filename.
pipe   Additional commands to be piped on the files before writing to file.

Lines for parameter file

Basic mode, gzipped files:

import1:
    module: Import
    script_path: gzip -cd

Basic mode, remote files:

Import1:
    module: Import
    script_path: curl
    pipe:  gzip -cd

Advanced mode, mixture of types and scopes:

Import1:
    module:         Import
    src:            [UR1,       UR2]
    script_path:    [gzip -cd,  cat]
    scope:          [sample,    project]
    trg:            [unrecog1,  unrecog2]
    ext:            [ur1,       ur2]

Advanced mode, both recognized and unrecognized file types:

Import1:
    module:         Import
    src:            [UR1,       Forward,    Reverse]
    script_path:    [gzip -cd,  null,       null]
    scope:          # Guess!
    trg:            [unrecog1,  null,       null]
    ext:            [ur1,       null,       null]

Advanced mode, same types in samples and project:

Import1:
    module:         Import
    src:            [Nucleotide,    Nucleotide]
    script_path:    [cat,           cat]
    scope:          [sample,        project]
    trg:            
    ext:            

fastqc_html *

Authors:Menachem Sklarz
Affiliation:Bioinformatics core facility
Organization:National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running fastqc.

Creates scripts that run fastqc on all available fastq files.

Requires

  • fastq files in one of the following slots:

    • sample_data[<sample>]["fastq.F"]
    • sample_data[<sample>]["fastq.R"]
    • sample_data[<sample>]["fastq.S"]

Output

  • puts fastqc output files in the following slots:

    • sample_data[<sample>]["fastqc_fastq.F_html"]
    • sample_data[<sample>]["fastqc_fastq.R_html"]
    • sample_data[<sample>]["fastqc_fastq.S_html"]
  • puts fastqc zip files in the following slots:

    • sample_data[<sample>]["fastqc_fastq.F_zip"]
    • sample_data[<sample>]["fastqc_fastq.R_zip"]
    • sample_data[<sample>]["fastqc_fastq.S_zip"]

Lines for parameter file

fqc_merge1:
    module: fastqc_html
    base: merge1
    script_path: /path/to/FastQC/fastqc
    qsub_params:
        -pe: shared 15
    redirects:
        --threads: 15

References

Andrews, S., 2010. FastQC: a quality control tool for high throughput sequence data.

trimmo *

Authors:Menachem Sklarz
Affiliation:Bioinformatics core facility
Organization:National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running trimmomatic on fastq files

Requires

  • fastq files in at least one of the following slots:

    • sample_data[<sample>]["fastq.F"]
    • sample_data[<sample>]["fastq.R"]
    • sample_data[<sample>]["fastq.S"]

Output

  • puts fastq output files in the following slots:

    • sample_data[<sample>]["fastq.F"|"fastq.R"|"fastq.S"]

Parameters that can be set

Parameter Values Comments
spec_dir path If trimmomatic must be executed within a particular directory, specify that directory here
todo LEADING:20 TRAILING:20 The trimmomatic arguments

Lines for parameter file

trim1:
    module: trimmo
    base: merge1
    script_path: java -jar trimmomatic-0.32.jar
    qsub_params:
        -pe: shared 20
        node: node1
    spec_dir: /path/to/Trimmomatic_dir/
    todo: LEADING:20 TRAILING:20
    redirects:
        -threads: 20

References

Bolger, A.M., Lohse, M. and Usadel, B., 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30(15), pp.2114-2120.

Multiqc *

Authors:Menachem Sklarz
Affiliation:Bioinformatics core facility
Organization:National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for preparing a MultiQC report for all samples.

Tip

By default, the module will search for parsable reports in the directories of all the modules in the branch leading to this instance. To search only in the directories of the explicit base steps, specify the bases_only parameter.

Requires

  • No real requirements. Will give a report with information if one of the base steps produces reports that MultiQC can read, e.g. fastqc, bowtie2, samtools etc.

Output

  • puts report dir in the following slot:

    • self.sample_data[<sample>]["Multiqc_report"]

Parameters that can be set

Parameter Values Comments
bases_only
Search directories of explicit base steps only.

Lines for parameter file

firstMultQC:
    module: Multiqc
    base:
        - sam_bwt2_1
        - fqc_trim1
    bases_only:
    script_path: /path/to/multiqc

References

Ewels, P., Magnusson, M., Lundin, S. and Käller, M., 2016. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), pp.3047-3048.

Cutadapt

Authors:Levin Liron
Affiliation:Bioinformatics core facility
Organization:National Institute of Biotechnology in the Negev, Ben Gurion University.

Short Description

A module for running cutadapt on fastqc files

Requires

  • fastq files in at least one of the following slots:
    sample_data[<sample>]["fastq.F"] sample_data[<sample>]["fastq.R"] sample_data[<sample>]["fastq.S"]

Output

  • puts fastq output files in the following slots:
    sample_data[<sample>]["fastq.F"] sample_data[<sample>]["fastq.R"] sample_data[<sample>]["fastq.S"]

Parameters that can be set

Parameter Values Comments
     

Comments

  • This module was tested on:
    Cutadapt v1.12.1

Lines for parameter file

Step_Name:                       # Name of this step
    module: Cutadapt             # Name of the module used
    base:                        # Name of the step [or list of names] to run after [must be after a merge step]
    script_path:                 # Command for running the Cutadapt script
    paired:                      # Analyse Forward and Reverse reads together.
    Demultiplexing:              # Use to Demultiplex the adaptors, needs to be in the format of name=adaptor_seq
    qsub_params:
        -pe:                     # Number of CPUs to reserve for this analysis
    redirects:
        --too-short-output:      # will replace @ with the location of the sample dir  [e.g. @too_short.fq] 
        -a:                      # Use to trim poly A in SE reads [e.g. "A{100} -A T{100}"]

References

Martin, Marcel. “Cutadapt removes adapter sequences from high-throughput sequencing reads.” EMBnet. journal 17.1 (2011): pp-10

Trim_Galore

Authors:Liron Levin
Affiliation:Bioinformatics core facility
Organization:National Institute of Biotechnology in the Negev, Ben Gurion University.

Short Description

A module for running Trim Galore on fastq files

Requires

  • fastq files in at least one of the following slots:
    sample_data[<sample>]["fastq.F"] sample_data[<sample>]["fastq.R"] sample_data[<sample>]["fastq.S"]

Output

  • puts fastq output files in the following slots:
    sample_data[<sample>]["fastq.F"] sample_data[<sample>]["fastq.R"] sample_data[<sample>]["fastq.S"]
  • puts unpaired fastq output files in the following slots:
    sample_data[<sample>]["fastq.F.unpaired"] sample_data[<sample>]["fastq.R.unpaired"]

Parameters that can be set

Parameter Values Comments
     

Comments

  • This module was tested on:
    Trim Galore v0.4.2 Cutadapt v1.12.1

Lines for parameter file

Step_Name:                       # Name of this step
    module: Trim_Galore          # Name of the module used
    base:                        # Name of the step [or list of names] to run after [must be after a merge step]
    script_path:                 # Command for running the Trim Galore script
    qsub_params:
        -pe:                     # Number of CPUs to reserve for this analysis
    cutadapt_path:               # Location of cutadapt executable 
    redirects:
        --length:                # Parameters for running Trim Galore
        -q:                      # Parameters for running Trim Galore

References

fastq_screen

Authors:Menachem Sklarz
Affiliation:Bioinformatics core facility
Organization:National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for executing fastq_screen on sequence files.

Input files are specified with the type parameter or taken from the fastq slots, one script per fastq file.

In regular mode, no output file are produced. However, if the --tag is included, the tagged file will be stored in the equivalent fastq.X slot. If a --filter tag is included, the filtered file will be stored in the equivalent fastq.X slot.

The parameters can be passed through a configuration file specified in the redirected parameters with the --conf parameter.

Alternatively, if you do not specify the configuration file, one will be produced for you. For this, you must include:

  1. A genomes section specifying genome indices to screen against (see examples below) and
  2. an aligner section specifying the aligning program to use and it’s path.

Additionally, if a --threads parameter is included in the redirects, it will be incorporated into the configuration file.

Attention

If a --bisulfite redirected parameter is included, it should contain the path to Bismark, which will be included in the configuration file.

Requires

  • fastq files in at least one of the following slots:

    • sample_data[<sample>]["fastq.F"]
    • sample_data[<sample>]["fastq.R"]
    • sample_data[<sample>]["fastq.S"]

Output

  • If --tag and/or --filter or --nohits are included, puts output fastq files in:

    • sample_data[<sample>]["fastq.F"]
    • sample_data[<sample>]["fastq.R"]
    • sample_data[<sample>]["fastq.S"]

Parameters that can be set

Parameter Values Comments
genomes name: index pairs (see examples) If --conf not provided, genomes to screen against.
aligner name: index single pair If --conf not provided, path to aligner to use.

Lines for parameter file

No configuration file:

fastq_screen:
    module:         fastq_screen
    base:           merge1
    script_path:    {Vars.paths.fastq_screen}
    qsub_params:
        -pe:        shared 60
    aligner:
        bowtie2:    {Vars.paths.bowtie2}
    genomes:
        Human:      {Vars.databases.human}
        Mouse:      {Vars.databases.moiuse}
        PhiX:       {Vars.databases.phix}
    redirects:
        --filter:   200
        --tag:
        # --nohits:
        --force: 
        --threads:  60 

With configuration file:

fastq_screen:
    module:         fastq_screen
    base:           merge1
    script_path:    {Vars.paths.fastq_screen}
    qsub_params:
        -pe:        shared 60
    redirects:
        --conf:     {Vars.paths.fastq_screen_conf_file}
        --filter:   200
        --tag:
        # --nohits:
        --force: 

References

Wingett, S.W. and Andrews, S., 2018. FastQ Screen: A tool for multi-genome mapping and quality control. F1000Research, 7.