Miscellaneous Modules
Modules included in this section
manage_types
*
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A module for managing file type without script creation.
Supports adding, deleting, copying and moving file types.
Requires
Output
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
operation |
add|del|mv|cp |
The operation to perform on the file type dictionary |
scope |
project|sample |
The scope on which to perform the operation. For ‘mv’ and ‘cp’ this is the source scope |
type |
The type on which to perform the operation. For ‘mv’ and ‘cp’ this is the source type |
|
scope_trg |
project|sample |
The destination scope for ‘mv’ and ‘cp’ operations |
type_trg |
The destination type for ‘mv’ and ‘cp’ operations |
|
path |
For ‘add’ operation, the value to insert in the file type. |
Attention
The operations do NOT operate on the actual files! They only modify internal file types index.
Tip
You can combine several operations in one module instance, by passing lists to the parameters in the table above. All lists should be of the same length, or of length 1 (i.e. plain strings). Plain strings will be extrapolated to all operations. e.g., to delete one file type and add another, both at the project scope, pass [del,add] to the ‘operation’ parameter, and ‘project’ to the ‘scope’ parameter. The ‘path’ can also be a plain string. It will be extrapolated to ‘del’, as well, but will be ignored by it. See example lines below.
Lines for parameter file
manage_types1:
module: manage_types
base: STAR_bld_ind
script_path:
scope: project
operation: mv
type: trinity.contigs
type_trg: trinity.contigs
scope_trg: sample
manage_types1:
module: manage_types
base: trinity1
script_path:
scope: - project
- sample
- sample
- project
operation: - mv
- del
- cp
- add
type: - fasta.nucl
- fasta.nucl
- fastq.F
- bam
type_trg: [transcripts.nucl, None ,fastq.main, None]
scope_trg: sample
path: /path/to/mapping.bam
merge_table
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A module for merging sample tables into a single project-wide table, or into group tables by category.
The table can be with or without a header line.
Can be used for merging fasta and fastq files as well.
Important
When merging by category, the sample names will be set to the category level names for all subsequent steps.
Tip
You can merge several types at once by passing them as a list to type
. If the type files have different numbers of header lines, pass a list of header line numbers with header
. The header list must be of length 1 or identical to the length of type
.
The extension of the resulting file will be the same as that of the files being merged, if they are all the same. If not, will not add an extension. To change the default behaviour, set an ext
parameter with the extension to use, e.g. fna
. If several types are being merged, if ext
is a string, the string will be used for all types. For a different ext
for each file type, use a list of strings, in the same order as the type
parameter.
Attention
If you split sample-scope fasta files with fasta_splitter
or split_fasta
modules, the new subsamples are stored with a source
category, containing the sample name from which the subsample was produced. When merging back into the sample scope, use scope: group
and category: source
.
Requires
A table file in any slot:
sample_data[<sample>][<file.type>]
Output
Puts output files in the following slot:
sample_data["project_data"][<file.type>]
Or, for merging by category, in the following slot:
sample_data[category_level][<file.type>]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
type |
A file type that exists in all samples. Can also be a list of types, each one of which will be merged independently |
|
script_path |
Leave blank |
|
scope |
project|group |
Merge all samples into one project table, or merge sample tables by category. |
category |
If |
|
header |
0 |
The number of header lines each table has. The header will be used for the complete table and all other headers will be removed. If there is no header line, set to 0 or leave out completely. If set but not specified, will default to 1!. |
ext |
The extension to use for the merged file. If |
|
add_filename |
If set, the source filename will be appended to each line in the resulting table. |
Lines for parameter file
Merge sample-scope tables into single project-scope table:
merge_blast_tables:
module: merge_table
base: merge1
script_path:
scope: project
type: blast.prot
header: 0
Merge sample-scope tables into group-scope table, by category country:
merge_blast_tables:
module: merge_table
base: merge1
script_path:
scope: group
category: country
type: blast.prot
header: 0
split_fasta
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A module for splitting fasta files into parts.
Convenient for parallelizing processes on the cluster. You can take a project wide fasta file (such as a transcriptome), split it into sub-fasta files, and run various processes on the sub-files.
The parts can then be combined with merge_table
module, which can concatenate any type of file.
Important
When splitting sample-scope fasta files, the subsamples are stored with a source
category set to the
original sample name. You can use this for merging results at the sample scope downstream.
See documentation for merge_table
.
Requires
A fasta file in one of the following slots (scope = “project”):
sample_data["project_data"]["fasta.nucl"]
sample_data["project_data"]["fasta.prot"]
A fasta file in one of the following slots (scope = “sample”):
sample_data[<sample>]["fasta.nucl"]
sample_data[<sample>]["fasta.prot"]
Output
Puts output files in the following slots:
sample_data[<sample>]["fasta.nucl"]
sample_data[<sample>]["fasta.prot"]
For sample scope, the original sample list will be overridden with the new sample list.
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
type |
nucl|prot |
The type of fasta file to split |
subsample_num |
Number of fragments |
Lines for parameter file
split_fasta1:
module: split_fasta
base: Trinity1
script_path:
type: nucl
subsample_num: 4
fasta_splitter
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A module for splitting fasta files into parts, using fasta-splitter.pl
.
Convenient for parallelizing processes on the cluster. You can take a project wide fasta file (such as a transcriptome), split it into sub-fasta files, and run various processes on the sub-files.
The parts can then be combined with merge_table
module, which can concatenate any type of file.
Attention
The module ships with fasta-splitter.pl
version 0.2.6, 2017-08-01.
Leave script_path
empty to use the perl script provided. Perl must be in the path!
To use a different version, supply it via script_path
.
Usage:
Usage: fasta-splitter [options] <file>...
Options:
--n-parts <N> - Divide into <N> parts
--part-size <N> - Divide into parts of size <N>
--measure (all|seq|count) - Specify whether all data, sequence length, or
number of sequences is used for determining part
sizes ('all' by default).
--line-length - Set output sequence line length, 0 for single line
(default: 60).
--eol (dos|mac|unix) - Choose end-of-line character ('unix' by default).
--part-num-prefix T - Put T before part number in file names (def.: .part-)
--out-dir - Specify output directory.
--nopad - Don't pad part numbers with 0.
--version - Show version.
--help - Show help.
You can’t use the --part-size
method, since it will end up in an unknown number of files, which is not defined in Neat-Seq Flow.
Please do not use the --nopad
parameter. There is no reason to…
Important
When splitting sample-scope fasta files, the subsamples are stored with a source
category set to the
original sample name. You can use this for merging results at the sample scope downstream.
See documentation for merge_table
.
Requires
A fasta file in one of the following slots (scope = “project”):
sample_data["project_data"]["fasta.nucl"]
sample_data["project_data"]["fasta.prot"]
A fasta file in one of the following slots (scope = “sample”):
sample_data[<sample>]["fasta.nucl"]
sample_data[<sample>]["fasta.prot"]
Output
Puts output files in the following slots:
sample_data[<sample>]["fasta.nucl"]
sample_data[<sample>]["fasta.prot"]
For sample scope, the original sample list will be overridden with the new sample list.
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
|
nucl|prot |
The type of fasta file to split |
|
Number of fragments |
Lines for parameter file
split_fasta1:
module: fasta_splitter
base: Trinity1
script_path:
type: nucl
redirects:
--n-parts: 4
--measure: seq
References
ProjectToSample
*
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A utility module for moving project data to a sample, and back again. Is useful when a module which works on sample data has to be executed on data in the project scope.
For instance, in the STAR 2 pass pipeline, the first stage involves aligning all reads to the reference in order to find splice junctions.
The reads can be merged into a project scope fastq.F
and fastq.R
slots, but all aligners take there reads from the sample scope!
This module overrides the sample list with a single sample containing the project slots (or a subset of the slots). Then, the mapping modules will take the project-wide reads from the sample representing the project.
Recovering the old sample list is done by setting the direction
parameter to smp2proj
.
See the STAR2pass workflow for the working example.
Usually, the module should be called twice, once in the proj2smp
direction and the in the smp2proj
direction.
Although it is possible to use the smp2proj
to move data from sample sample_name
to the project, it is better to do this operation with the manage_types
module.
Requires
Output
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
direction |
proj2smp|smp2proj |
Move project info to sample or vice versa |
type |
The types to operate on |
|
operation |
cp|mv |
Whether to move the slots or just copy them. |
sample_name |
The name of the new sample to create or the sample to copy from. Defaults to project title |
Attention
This moduel does NOT operate on the actual files! It only modifies internal file types index.
Lines for parameter file
Moving from project to sample:
ProjectToSample:
module: ProjectToSample
base: merge_table
script_path:
direction: proj2smp
# sample_name: fromproj
operation: mv # mv or cp
type: [fastq.F,fastq.R]
Copying from sample to project:
SampleToProject:
module: ProjectToSample
base: STAR_map_proj
script_path:
direction: smp2proj
operation: mv # mv or cp
type: SJ.out.tab
Copying and moving from sample to project: (Just for the example. Isn’t necessarily practical)
SampleToProject:
module: ProjectToSample
base: STAR_map_proj
script_path:
direction: smp2proj
operation: [cp, mv, mv] # mv or cp
type: [SJ.out.tab, fastq.F, fastq.R]