Preparation and QC¶
Modules included in this section
Import
*¶
Authors: | Menachem Sklarz |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A module for importing and merging files from the sample file into NeatSeq-Flow.
Files can be imported in three ways:
- If there is a single file in the type (per sample), it can be imported, i.e. the exiting file path will be used as source for the workflow.
- If there are multiple files in the type, or you would like to make a local copy of the raw files, the file(s) can be copied and concatenated to the workflow directory.
- If the raw files are compressed, importing can include decompression as well as concatenation.
Tip
If you have plenty of disk space, the 2nd and 3rd options are the recommended approach. It ensures the original files go untouched and when the workflow is complete you can discard the copies produced by NeatSeq-Flow.
The Import
module can be used in two modes:
- The Basic mode
NeatSeq-Flow will attempt to guess all the parameters it requires. Multiple files will be concatenated and stored in the file type index according to the table below. File types not included in the table will be stored in the file type index by the type specified in the sample file.
You have to make sure that all files of each file type have the same extension for NeatSeq-Flow to guess the
script_path
andpipe
parameters.- The Advanced mode
is used when more control on data importing and concatenation is required. It enables full control over which file types are imported, how they are copied and in which slots they are placed in the file type index. It also enables importing file types not recognized by NeatSeq-Flow (see list below).
In this mode, you have to define the following lists:
src
,trg
,script_path
,scope
andext
. For each file type in the sample file, you should have an entry in thesrc
list. The other lists should apply to the equivalent entry insrc
.trg
is the target file type (in the file type index) for the imported files,script_path
is the shell command to use to concatenate the source type files,scope
is the scope for which the source type is defined andext
is the suffix to append to the final filenames. Strings are expanded to the length ofsrc
list, so ifscript_path
is the same for all source types, it is enough to specify it once.When using the Advanced mode, by passing the
src
list, you must also define the other lists, i.e.trg
,ext
,scope
andscript_path
. However, NeatSeq-Flow will try guessing the lists based on the lists of recognized file types and extensions.If some of the file types in
src
are recognized and some are not, you can pass the lists mentioned above with values for the unrecognized types, leaving null in the positions of the recognized types. These null values will be guessed by NeatSeq-Flow.The advanced mode is experimental, and documentation will hopefully improve as we gain experience with it.
Note
- Definition of
script_path
in theimport
module script_path
should be a shell program that receives a list of files and produces one single output file to the standard error. Examples of such programs arecat
for text files andgzip -cd
for gzipped files. Other types of compressed files should have such a command as well.
Tip
NeatSeq-Flow attempts to guess the script_path
and pipe
values based on the input file extensions. For this to work, leave the script_path
and pipe
lists empty and make sure all files from the same source have the same extensions (e.g. all gzipped files should have .gz as file extension).
If you want NeatSeq-Flow to guess only some of the script_path
values, set them to null or to ..guess..
, e.g. if src
is [Single,TYP1]
and script_path
is [null,cat]
, then the script_path
for Single will be guessed and the script_path
for TYP1 will be set to cat.
Two more options are available for script path
: ..skip..
will skip the type entirely, while ..import..
will import the values from the sample file into the relevant slots without actually producing any scripts (This is useful for including entities which are not files in the sample file. e.g. in the qiime2 pipeline you might want to include a semantic type in the sample file).
The following extensions are recognized:
Extension | script_path |
pipe |
---|---|---|
.fasta | cat | |
.faa | cat | |
.fna | cat | |
.txt | cat | |
.tsv | cat | |
.csv | cat | |
.fastq | cat | |
.fa | cat | |
.fq | cat | |
.gz | gzip -cd | |
.zip | echo | ‘xargs -d ” ” -I % sh -c “unzip -p %”’ |
.bz2 | bzip -cd | |
.dsrc2 | echo | ‘xargs -d ” ” -I % sh -c “dsrc2 d -s %”’ |
.dsrc | echo | ‘xargs -d ” ” -I % sh -c “dsrc d -s %”’ |
Requires¶
- For the basic mode:
- A list of files of the following types, either in
[<sample>]
or in[project_data]
:
- A list of files of the following types, either in
Source | Target |
---|---|
Forward | fastq.F |
Reverse | fastq.R |
Single | fastq.S |
Nucleotide | fasta.nucl |
Protein | fasta.prot |
SAM | sam |
BAM | bam |
REFERENCE | reference |
VCF | vcf |
G.VCF | g.vcf |
GTF | gtf |
GFF | gff |
GFF3 | gff3 |
manifest | qiime2.manifest |
barcodes | barcodes |
- For the Advanced mode:
- Lists of files in any file type, either in
[<sample>]
or in[project_data]
.
- Lists of files in any file type, either in
Output¶
- Imported files of the types in the table above are placed in slots according to the types in the 2nd column of the table.
Attention
If you want to do something more complex with the combined files, you can use the pipe
parameter to send extra commands to be piped on the files after the main command. This is an experimental feature and should be used with care.
e.g.: You can get files from a remote location by setting script_path
to curl
and pipe
to gzip -cd
. This will download the files with curl, unzip them and concatenate them into the target file. In the sample file, specify remote URLs instead of local pathes. This will work only for one file per sample.
As of version 1.3.0, pipe
can be a list of the same length as src
and it we be treated like the other lists describe above.
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
script_path | The shell command to use for merging the source files. | |
src | A list of source file types as the appear in the sample file. | |
trg | A list of target file type for the imported files. | |
scope | sample | project | The scope at which each of the sources can be found. |
ext | The suffix to append to the imported filename. | |
pipe | Additional commands to be piped on the files before writing to file. |
Lines for parameter file¶
Basic mode, gzipped files:
import1:
module: Import
script_path: gzip -cd
Basic mode, remote files:
Import1:
module: Import
script_path: curl
pipe: gzip -cd
Advanced mode, mixture of types and scopes:
Import1:
module: Import
src: [UR1, UR2]
script_path: [gzip -cd, cat]
scope: [sample, project]
trg: [unrecog1, unrecog2]
ext: [ur1, ur2]
Advanced mode, both recognized and unrecognized file types:
Import1:
module: Import
src: [UR1, Forward, Reverse]
script_path: [gzip -cd, null, null]
scope: # Guess!
trg: [unrecog1, null, null]
ext: [ur1, null, null]
Advanced mode, same types in samples and project:
Import1:
module: Import
src: [Nucleotide, Nucleotide]
script_path: [cat, cat]
scope: [sample, project]
trg:
ext:
fastqc_html
*¶
Authors: | Menachem Sklarz |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A module for running fastqc.
Creates scripts that run fastqc on all available fastq files.
Requires¶
fastq files in one of the following slots:
sample_data[<sample>]["fastq.F"]
sample_data[<sample>]["fastq.R"]
sample_data[<sample>]["fastq.S"]
Output¶
puts fastqc output files in the following slots:
sample_data[<sample>]["fastqc_fastq.F_html"]
sample_data[<sample>]["fastqc_fastq.R_html"]
sample_data[<sample>]["fastqc_fastq.S_html"]
puts fastqc zip files in the following slots:
sample_data[<sample>]["fastqc_fastq.F_zip"]
sample_data[<sample>]["fastqc_fastq.R_zip"]
sample_data[<sample>]["fastqc_fastq.S_zip"]
Lines for parameter file¶
fqc_merge1:
module: fastqc_html
base: merge1
script_path: /path/to/FastQC/fastqc
qsub_params:
-pe: shared 15
redirects:
--threads: 15
References¶
Andrews, S., 2010. FastQC: a quality control tool for high throughput sequence data.
trimmo
*¶
Authors: | Menachem Sklarz |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A module for running trimmomatic on fastq files
Requires¶
fastq files in at least one of the following slots:
sample_data[<sample>]["fastq.F"]
sample_data[<sample>]["fastq.R"]
sample_data[<sample>]["fastq.S"]
Output¶
puts fastq output files in the following slots:
sample_data[<sample>]["fastq.F"|"fastq.R"|"fastq.S"]
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
spec_dir | path | If trimmomatic must be executed within a particular directory, specify that directory here |
todo | LEADING:20 TRAILING:20 | The trimmomatic arguments |
Lines for parameter file¶
trim1:
module: trimmo
base: merge1
script_path: java -jar trimmomatic-0.32.jar
qsub_params:
-pe: shared 20
node: node1
spec_dir: /path/to/Trimmomatic_dir/
todo: LEADING:20 TRAILING:20
redirects:
-threads: 20
References¶
Bolger, A.M., Lohse, M. and Usadel, B., 2014. Trimmomatic: a flexible trimmer for Illumina sequence data. Bioinformatics, 30(15), pp.2114-2120.
Multiqc
*¶
Authors: | Menachem Sklarz |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A module for preparing a MultiQC report for all samples.
Tip
By default, the module will search for parsable reports in the directories of all the modules in the branch leading to this instance. To search only in the directories of the explicit base steps, specify the bases_only
parameter.
Requires¶
- No real requirements. Will give a report with information if one of the base steps produces reports that MultiQC can read, e.g. fastqc, bowtie2, samtools etc.
Output¶
puts report dir in the following slot:
self.sample_data[<sample>]["Multiqc_report"]
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
bases_only | Search directories of explicit base steps only. |
Lines for parameter file¶
firstMultQC:
module: Multiqc
base:
- sam_bwt2_1
- fqc_trim1
bases_only:
script_path: /path/to/multiqc
References¶
Ewels, P., Magnusson, M., Lundin, S. and Käller, M., 2016. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), pp.3047-3048.
Cutadapt
¶
Authors: | Levin Liron |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
Short Description¶
A module for running cutadapt on fastqc files
Requires¶
- fastq files in at least one of the following slots:
sample_data[<sample>]["fastq.F"]
sample_data[<sample>]["fastq.R"]
sample_data[<sample>]["fastq.S"]
Output¶
- puts fastq output files in the following slots:
sample_data[<sample>]["fastq.F"]
sample_data[<sample>]["fastq.R"]
sample_data[<sample>]["fastq.S"]
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
Comments¶
- This module was tested on:
Cutadapt v1.12.1
Lines for parameter file¶
Step_Name: # Name of this step
module: Cutadapt # Name of the module used
base: # Name of the step [or list of names] to run after [must be after a merge step]
script_path: # Command for running the Cutadapt script
paired: # Analyse Forward and Reverse reads together.
Demultiplexing: # Use to Demultiplex the adaptors, needs to be in the format of name=adaptor_seq
qsub_params:
-pe: # Number of CPUs to reserve for this analysis
redirects:
--too-short-output: # will replace @ with the location of the sample dir [e.g. @too_short.fq]
-a: # Use to trim poly A in SE reads [e.g. "A{100} -A T{100}"]
References¶
Martin, Marcel. “Cutadapt removes adapter sequences from high-throughput sequencing reads.” EMBnet. journal 17.1 (2011): pp-10
Trim_Galore
¶
Authors: | Liron Levin |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
Short Description¶
A module for running Trim Galore on fastq files
Requires¶
- fastq files in at least one of the following slots:
sample_data[<sample>]["fastq.F"]
sample_data[<sample>]["fastq.R"]
sample_data[<sample>]["fastq.S"]
Output¶
- puts fastq output files in the following slots:
sample_data[<sample>]["fastq.F"]
sample_data[<sample>]["fastq.R"]
sample_data[<sample>]["fastq.S"]
- puts unpaired fastq output files in the following slots:
sample_data[<sample>]["fastq.F.unpaired"]
sample_data[<sample>]["fastq.R.unpaired"]
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
Comments¶
- This module was tested on:
Trim Galore v0.4.2
Cutadapt v1.12.1
Lines for parameter file¶
Step_Name: # Name of this step
module: Trim_Galore # Name of the module used
base: # Name of the step [or list of names] to run after [must be after a merge step]
script_path: # Command for running the Trim Galore script
qsub_params:
-pe: # Number of CPUs to reserve for this analysis
cutadapt_path: # Location of cutadapt executable
redirects:
--length: # Parameters for running Trim Galore
-q: # Parameters for running Trim Galore
References¶
- Cutadapt:
- Martin, Marcel. “Cutadapt removes adapter sequences from high-throughput sequencing reads.” EMBnet journal 17.1 (2011):pp-10
- Trim Galore:
- Krueger F: Trim Galore. [http://www.bioinformatics.babraham.ac.uk/projects/]
fastq_screen
¶
Authors: | Menachem Sklarz |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A module for executing fastq_screen
on sequence files.
Input files are specified with the type
parameter or taken from the fastq slots, one script per fastq file.
In regular mode, no output file are produced. However, if the --tag
is included, the tagged file will be stored in the equivalent fastq.X
slot.
If a --filter
tag is included, the filtered file will be stored in the equivalent fastq.X
slot.
The parameters can be passed through a configuration file specified in the redirected parameters with the --conf
parameter.
Alternatively, if you do not specify the configuration file, one will be produced for you. For this, you must include:
- A
genomes
section specifying genome indices to screen against (see examples below) and - an
aligner
section specifying the aligning program to use and it’s path.
Additionally, if a --threads
parameter is included in the redirects, it will be incorporated into the configuration file.
Attention
If a --bisulfite
redirected parameter is included, it should contain the path to Bismark
, which will be included in the configuration file.
Requires¶
fastq files in at least one of the following slots:
sample_data[<sample>]["fastq.F"]
sample_data[<sample>]["fastq.R"]
sample_data[<sample>]["fastq.S"]
Output¶
If
--tag
and/or--filter
or--nohits
are included, puts output fastq files in:sample_data[<sample>]["fastq.F"]
sample_data[<sample>]["fastq.R"]
sample_data[<sample>]["fastq.S"]
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
genomes | name: index pairs (see examples) |
If --conf not provided, genomes to screen against. |
aligner | name: index single pair |
If --conf not provided, path to aligner to use. |
Lines for parameter file¶
No configuration file:
fastq_screen:
module: fastq_screen
base: merge1
script_path: {Vars.paths.fastq_screen}
qsub_params:
-pe: shared 60
aligner:
bowtie2: {Vars.paths.bowtie2}
genomes:
Human: {Vars.databases.human}
Mouse: {Vars.databases.moiuse}
PhiX: {Vars.databases.phix}
redirects:
--filter: 200
--tag:
# --nohits:
--force:
--threads: 60
With configuration file:
fastq_screen:
module: fastq_screen
base: merge1
script_path: {Vars.paths.fastq_screen}
qsub_params:
-pe: shared 60
redirects:
--conf: {Vars.paths.fastq_screen_conf_file}
--filter: 200
--tag:
# --nohits:
--force:
References¶
Wingett, S.W. and Andrews, S., 2018. FastQ Screen: A tool for multi-genome mapping and quality control. F1000Research, 7.