Preparation and QC

Import *

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for importing and merging files from the sample file into NeatSeq-Flow.

Files can be imported in three ways:

  1. If there is a single file in the type (per sample), it can be imported, i.e. the exiting file path will be used as source for the workflow.

  2. If there are multiple files in the type, or you would like to make a local copy of the raw files, the file(s) can be copied and concatenated to the workflow directory.

  3. If the raw files are compressed, importing can include decompression as well as concatenation.

Tip

If you have plenty of disk space, the 2nd and 3rd options are the recommended approach. It ensures the original files go untouched and when the workflow is complete you can discard the copies produced by NeatSeq-Flow.

The Import module can be used in two modes:

The Basic mode

NeatSeq-Flow will attempt to guess all the parameters it requires. Multiple files will be concatenated and stored in the file type index according to the table below. File types not included in the table will be stored in the file type index by the type specified in the sample file.

You have to make sure that all files of each file type have the same extension for NeatSeq-Flow to guess the script_path and pipe parameters.

The Advanced mode

is used when more control on data importing and concatenation is required. It enables full control over which file types are imported, how they are copied and in which slots they are placed in the file type index. It also enables importing file types not recognized by NeatSeq-Flow (see list below).

In this mode, you have to define the following lists: src, trg, script_path, scope and ext. For each file type in the sample file, you should have an entry in the src list. The other lists should apply to the equivalent entry in src. trg is the target file type (in the file type index) for the imported files, script_path is the shell command to use to concatenate the source type files, scope is the scope for which the source type is defined and ext is the suffix to append to the final filenames. Strings are expanded to the length of src list, so if script_path is the same for all source types, it is enough to specify it once.

When using the Advanced mode, by passing the src list, you must also define the other lists, i.e. trg, ext, scope and script_path. However, NeatSeq-Flow will try guessing the lists based on the lists of recognized file types and extensions.

If some of the file types in src are recognized and some are not, you can pass the lists mentioned above with values for the unrecognized types, leaving null in the positions of the recognized types. These null values will be guessed by NeatSeq-Flow.

The advanced mode is experimental, and documentation will hopefully improve as we gain experience with it.

Note

Definition of script_path in the import module

script_path should be a shell program that receives a list of files and produces one single output file to the standard error. Examples of such programs are cat for text files and gzip -cd for gzipped files. Other types of compressed files should have such a command as well.

Tip

NeatSeq-Flow attempts to guess the script_path and pipe values based on the input file extensions. For this to work, leave the script_path and pipe lists empty and make sure all files from the same source have the same extensions (e.g. all gzipped files should have .gz as file extension).

If you want NeatSeq-Flow to guess only some of the script_path values, set them to null or to ..guess.., e.g. if src is [Single,TYP1] and script_path is [null,cat], then the script_path for Single will be guessed and the script_path for TYP1 will be set to cat.

Two more options are available for script path: ..skip.. will skip the type entirely, while ..import.. will import the values from the sample file into the relevant slots without actually producing any scripts (This is useful for including entities which are not files in the sample file. e.g. in the qiime2 pipeline you might want to include a semantic type in the sample file).

The following extensions are recognized:

File extensions recognized by NeatSeq-Flow

Extension

script_path

pipe

.fasta

cat

.faa

cat

.fna

cat

.txt

cat

.tsv

cat

.csv

cat

.fastq

cat

.fa

cat

.fq

cat

.gz

gzip -cd

.zip

echo

‘xargs -d ” ” -I % sh -c “unzip -p %”’

.bz2

bzip -cd

.dsrc2

echo

‘xargs -d ” ” -I % sh -c “dsrc2 d -s %”’

.dsrc

echo

‘xargs -d ” ” -I % sh -c “dsrc d -s %”’

Requires

  • For the basic mode:
    • A list of files of the following types, either in [<sample>] or in [project_data]:

File types recognized by NeatSeq-Flow

Source

Target

Forward

fastq.F

Reverse

fastq.R

Single

fastq.S

Nucleotide

fasta.nucl

Protein

fasta.prot

SAM

sam

BAM

bam

REFERENCE

reference

VCF

vcf

G.VCF

g.vcf

GTF

gtf

GFF

gff

GFF3

gff3

manifest

qiime2.manifest

barcodes

barcodes

  • For the Advanced mode:
    • Lists of files in any file type, either in [<sample>] or in [project_data].

Output

  • Imported files of the types in the table above are placed in slots according to the types in the 2nd column of the table.

Attention

If you want to do something more complex with the combined files, you can use the pipe parameter to send extra commands to be piped on the files after the main command. This is an experimental feature and should be used with care.

e.g.: You can get files from a remote location by setting script_path to curl and pipe to gzip -cd. This will download the files with curl, unzip them and concatenate them into the target file. In the sample file, specify remote URLs instead of local pathes. This will work only for one file per sample.

As of version 1.3.0, pipe can be a list of the same length as src and it we be treated like the other lists describe above.

Parameters that can be set

Parameter

Values

Comments

script_path

The shell command to use for merging the source files.

src

A list of source file types as the appear in the sample file.

trg

A list of target file type for the imported files.

scope

sample | project

The scope at which each of the sources can be found.

ext

The suffix to append to the imported filename.

pipe

Additional commands to be piped on the files before writing to file.

Lines for parameter file

Basic mode, gzipped files:

import1:
    module: Import
    script_path: gzip -cd

Basic mode, remote files:

Import1:
    module: Import
    script_path: curl
    pipe:  gzip -cd

Advanced mode, mixture of types and scopes:

Import1:
    module:         Import
    src:            [UR1,       UR2]
    script_path:    [gzip -cd,  cat]
    scope:          [sample,    project]
    trg:            [unrecog1,  unrecog2]
    ext:            [ur1,       ur2]

Advanced mode, both recognized and unrecognized file types:

Import1:
    module:         Import
    src:            [UR1,       Forward,    Reverse]
    script_path:    [gzip -cd,  null,       null]
    scope:          # Guess!
    trg:            [unrecog1,  null,       null]
    ext:            [ur1,       null,       null]

Advanced mode, same types in samples and project:

Import1:
    module:         Import
    src:            [Nucleotide,    Nucleotide]
    script_path:    [cat,           cat]
    scope:          [sample,        project]
    trg:            
    ext:            

fastqc_html *

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running fastqc.

Creates scripts that run fastqc on all available fastq files.

Requires

  • fastq files in one of the following slots:

    • sample_data[<sample>]["fastq.F"]

    • sample_data[<sample>]["fastq.R"]

    • sample_data[<sample>]["fastq.S"]

Output

  • puts fastqc output files in the following slots:

    • sample_data[<sample>]["fastqc_fastq.F_html"]

    • sample_data[<sample>]["fastqc_fastq.R_html"]

    • sample_data[<sample>]["fastqc_fastq.S_html"]

  • puts fastqc zip files in the following slots:

    • sample_data[<sample>]["fastqc_fastq.F_zip"]

    • sample_data[<sample>]["fastqc_fastq.R_zip"]

    • sample_data[<sample>]["fastqc_fastq.S_zip"]

Lines for parameter file

fqc_merge1:
    module: fastqc_html
    base: merge1
    script_path: /path/to/FastQC/fastqc
    qsub_params:
        -pe: shared 15
    redirects:
        --threads: 15

References

Andrews, S., 2010. FastQC: a quality control tool for high throughput sequence data.

trimmo *

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running trimmomatic on fastq files

Requires

  • fastq files in at least one of the following slots:

    • sample_data[<sample>]["fastq.F"]

    • sample_data[<sample>]["fastq.R"]

    • sample_data[<sample>]["fastq.S"]

Output

  • puts fastq output files in the following slots:

    • sample_data[<sample>]["fastq.F"|"fastq.R"|"fastq.S"]

Parameters that can be set

Parameter

Values

Comments

spec_dir

path

If trimmomatic must be executed within a particular directory, specify that directory here

todo

LEADING:20 TRAILING:20

The trimmomatic arguments

Lines for parameter file

trim1:
    module: trimmo
    base: merge1
    script_path: java -jar trimmomatic-0.32.jar
    qsub_params:
        -pe: shared 20
        node: node1
    spec_dir: /path/to/Trimmomatic_dir/
    todo: LEADING:20 TRAILING:20
    redirects:
        -threads: 20

References

Bolger, A.M., Lohse, M. and Usadel, B., 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30(15), pp.2114-2120.

Multiqc *

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for preparing a MultiQC report for all samples.

Tip

By default, the module will search for parsable reports in the directories of all the modules in the branch leading to this instance. To search only in the directories of the explicit base steps, specify the bases_only parameter.

Requires

  • No real requirements. Will give a report with information if one of the base steps produces reports that MultiQC can read, e.g. fastqc, bowtie2, samtools etc.

Output

  • puts report dir in the following slot:

    • self.sample_data[<sample>]["Multiqc_report"]

Parameters that can be set

Parameter

Values

Comments

bases_only

Search directories of explicit base steps only.

Lines for parameter file

firstMultQC:
    module: Multiqc
    base:
        - sam_bwt2_1
        - fqc_trim1
    bases_only:
    script_path: /path/to/multiqc

References

Ewels, P., Magnusson, M., Lundin, S. and Käller, M., 2016. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), pp.3047-3048.

Cutadapt

Authors

Levin Liron

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

Short Description

A module for running cutadapt on fastqc files

Requires

  • fastq files in at least one of the following slots:

    sample_data[<sample>]["fastq.F"] sample_data[<sample>]["fastq.R"] sample_data[<sample>]["fastq.S"]

Output

  • puts fastq output files in the following slots:

    sample_data[<sample>]["fastq.F"] sample_data[<sample>]["fastq.R"] sample_data[<sample>]["fastq.S"]

Parameters that can be set

Parameter

Values

Comments

Comments

  • This module was tested on:

    Cutadapt v1.12.1

Lines for parameter file

Step_Name:                       # Name of this step
    module: Cutadapt             # Name of the module used
    base:                        # Name of the step [or list of names] to run after [must be after a merge step]
    script_path:                 # Command for running the Cutadapt script
    paired:                      # Analyse Forward and Reverse reads together.
    Demultiplexing:              # Use to Demultiplex the adaptors, needs to be in the format of name=adaptor_seq
    qsub_params:
        -pe:                     # Number of CPUs to reserve for this analysis
    redirects:
        --too-short-output:      # will replace @ with the location of the sample dir  [e.g. @too_short.fq] 
        -a:                      # Use to trim poly A in SE reads [e.g. "A{100} -A T{100}"]

References

Martin, Marcel. “Cutadapt removes adapter sequences from high-throughput sequencing reads.” EMBnet. journal 17.1 (2011): pp-10

Trim_Galore

Authors

Liron Levin

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

Short Description

A module for running Trim Galore on fastq files

Requires

  • fastq files in at least one of the following slots:

    sample_data[<sample>]["fastq.F"] sample_data[<sample>]["fastq.R"] sample_data[<sample>]["fastq.S"]

Output

  • puts fastq output files in the following slots:

    sample_data[<sample>]["fastq.F"] sample_data[<sample>]["fastq.R"] sample_data[<sample>]["fastq.S"]

  • puts unpaired fastq output files in the following slots:

    sample_data[<sample>]["fastq.F.unpaired"] sample_data[<sample>]["fastq.R.unpaired"]

Parameters that can be set

Parameter

Values

Comments

Comments

  • This module was tested on:

    Trim Galore v0.4.2 Cutadapt v1.12.1

Lines for parameter file

Step_Name:                       # Name of this step
    module: Trim_Galore          # Name of the module used
    base:                        # Name of the step [or list of names] to run after [must be after a merge step]
    script_path:                 # Command for running the Trim Galore script
    qsub_params:
        -pe:                     # Number of CPUs to reserve for this analysis
    cutadapt_path:               # Location of cutadapt executable 
    redirects:
        --length:                # Parameters for running Trim Galore
        -q:                      # Parameters for running Trim Galore

References

fastq_screen

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for executing fastq_screen on sequence files.

Input files are specified with the type parameter or taken from the fastq slots, one script per fastq file.

In regular mode, no output file are produced. However, if the --tag is included, the tagged file will be stored in the equivalent fastq.X slot. If a --filter tag is included, the filtered file will be stored in the equivalent fastq.X slot.

The parameters can be passed through a configuration file specified in the redirected parameters with the --conf parameter.

Alternatively, if you do not specify the configuration file, one will be produced for you. For this, you must include:

  1. A genomes section specifying genome indices to screen against (see examples below) and

  2. an aligner section specifying the aligning program to use and it’s path.

Additionally, if a --threads parameter is included in the redirects, it will be incorporated into the configuration file.

Attention

If a --bisulfite redirected parameter is included, it should contain the path to Bismark, which will be included in the configuration file.

Requires

  • fastq files in at least one of the following slots:

    • sample_data[<sample>]["fastq.F"]

    • sample_data[<sample>]["fastq.R"]

    • sample_data[<sample>]["fastq.S"]

Output

  • If --tag and/or --filter or --nohits are included, puts output fastq files in:

    • sample_data[<sample>]["fastq.F"]

    • sample_data[<sample>]["fastq.R"]

    • sample_data[<sample>]["fastq.S"]

Parameters that can be set

Parameter

Values

Comments

genomes

name: index pairs (see examples)

If --conf not provided, genomes to screen against.

aligner

name: index single pair

If --conf not provided, path to aligner to use.

Lines for parameter file

No configuration file:

fastq_screen:
    module:         fastq_screen
    base:           merge1
    script_path:    {Vars.paths.fastq_screen}
    qsub_params:
        -pe:        shared 60
    aligner:
        bowtie2:    {Vars.paths.bowtie2}
    genomes:
        Human:      {Vars.databases.human}
        Mouse:      {Vars.databases.moiuse}
        PhiX:       {Vars.databases.phix}
    redirects:
        --filter:   200
        --tag:
        # --nohits:
        --force: 
        --threads:  60 

With configuration file:

fastq_screen:
    module:         fastq_screen
    base:           merge1
    script_path:    {Vars.paths.fastq_screen}
    qsub_params:
        -pe:        shared 60
    redirects:
        --conf:     {Vars.paths.fastq_screen_conf_file}
        --filter:   200
        --tag:
        # --nohits:
        --force: 

References

Wingett, S.W. and Andrews, S., 2018. FastQ Screen: A tool for multi-genome mapping and quality control. F1000Research, 7.