Miscellaneous Modules

Modules included in this section

manage_types ^*
merge_table
split_fasta
fasta_splitter
ProjectToSample ^*

`manage_types` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for managing file type without script creation.

Supports adding, deleting, copying and moving file types.

Requires

Output

Parameters that can be set

Parameter	Values	Comments
operation	add\|del\|mv\|cp	The operation to perform on the file type dictionary
scope	project\|sample	The scope on which to perform the operation. For ‘mv’ and ‘cp’ this is the source scope
type		The type on which to perform the operation. For ‘mv’ and ‘cp’ this is the source type
scope_trg	project\|sample	The destination scope for ‘mv’ and ‘cp’ operations
type_trg		The destination type for ‘mv’ and ‘cp’ operations
path		For ‘add’ operation, the value to insert in the file type.

Attention

The operations do NOT operate on the actual files! They only modify internal file types index.

Tip

You can combine several operations in one module instance, by passing lists to the parameters in the table above. All lists should be of the same length, or of length 1 (i.e. plain strings). Plain strings will be extrapolated to all operations. e.g., to delete one file type and add another, both at the project scope, pass [del,add] to the ‘operation’ parameter, and ‘project’ to the ‘scope’ parameter. The ‘path’ can also be a plain string. It will be extrapolated to ‘del’, as well, but will be ignored by it. See example lines below.

Lines for parameter file

manage_types1:
    module:             manage_types
    base:               STAR_bld_ind
    script_path:        
    scope:              project
    operation:          mv
    type:               trinity.contigs
    type_trg:           trinity.contigs
    scope_trg:          sample

manage_types1:
    module:             manage_types
    base:               trinity1
    script_path:        
    scope:              - project
                        - sample
                        - sample
                        - project
    operation:          - mv
                        - del
                        - cp
                        - add
    type:               - fasta.nucl
                        - fasta.nucl
                        - fastq.F
                        - bam
    type_trg:           [transcripts.nucl, None ,fastq.main, None]
    scope_trg:          sample
    path:               /path/to/mapping.bam   

`merge_table`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for merging sample tables into a single project-wide table, or into group tables by category.

The table can be with or without a header line.

Can be used for merging fasta and fastq files as well.

Important

When merging by category, the sample names will be set to the category level names for all subsequent steps.

Tip

You can merge several types at once by passing them as a list to type. If the type files have different numbers of header lines, pass a list of header line numbers with header. The header list must be of length 1 or identical to the length of type.

The extension of the resulting file will be the same as that of the files being merged, if they are all the same. If not, will not add an extension. To change the default behaviour, set an ext parameter with the extension to use, e.g. fna. If several types are being merged, if ext is a string, the string will be used for all types. For a different ext for each file type, use a list of strings, in the same order as the type parameter.

Attention

If you split sample-scope fasta files with fasta_splitter or split_fasta modules, the new subsamples are stored with a source category, containing the sample name from which the subsample was produced. When merging back into the sample scope, use scope: group and category: source.

Requires

A table file in any slot:
- sample_data[<sample>][<file.type>]

Output

Puts output files in the following slot:
- sample_data["project_data"][<file.type>]
Or, for merging by category, in the following slot:
- sample_data[category_level][<file.type>]

Parameters that can be set

Parameters that can be set:
Parameter	Values	Comments
type		A file type that exists in all samples. Can also be a list of types, each one of which will be merged independently
script_path		Leave blank
scope	project\|group	Merge all samples into one project table, or merge sample tables by category.
category		If `scope` is set to `group`, you must specify the category by which to divide the samples for merging. The category must be a string containing one of the categories (columns) in the mapping file
header	0	The number of header lines each table has. The header will be used for the complete table and all other headers will be removed. If there is no header line, set to 0 or leave out completely. If set but not specified, will default to 1!.
ext		The extension to use for the merged file. If `type` is a list, `ext` will be used for all types unless `ext` itself is a list of the same length as `type`.
add_filename		If set, the source filename will be appended to each line in the resulting table.

Lines for parameter file

Merge sample-scope tables into single project-scope table:

merge_blast_tables:
    module:         merge_table
    base:           merge1
    script_path:
    scope:          project
    type:           blast.prot
    header:         0

Merge sample-scope tables into group-scope table, by category country:

merge_blast_tables:
    module:         merge_table
    base:           merge1
    script_path:
    scope:          group
    category:       country
    type:           blast.prot
    header:         0

`split_fasta`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for splitting fasta files into parts.

Convenient for parallelizing processes on the cluster. You can take a project wide fasta file (such as a transcriptome), split it into sub-fasta files, and run various processes on the sub-files.

The parts can then be combined with merge_table module, which can concatenate any type of file.

Important

When splitting sample-scope fasta files, the subsamples are stored with a source category set to the original sample name. You can use this for merging results at the sample scope downstream. See documentation for merge_table.

Requires

A fasta file in one of the following slots (scope = “project”):
- sample_data["project_data"]["fasta.nucl"]
- sample_data["project_data"]["fasta.prot"]
A fasta file in one of the following slots (scope = “sample”):
- sample_data[<sample>]["fasta.nucl"]
- sample_data[<sample>]["fasta.prot"]

Output

Puts output files in the following slots:
- sample_data[<sample>]["fasta.nucl"]
- sample_data[<sample>]["fasta.prot"]
For sample scope, the original sample list will be overridden with the new sample list.

Parameters that can be set

Parameters that can be set:
Parameter	Values	Comments
type	nucl\|prot	The type of fasta file to split
subsample_num		Number of fragments

Lines for parameter file

split_fasta1:
    module:         split_fasta
    base:           Trinity1
    script_path:    
    type:           nucl
    subsample_num:      4

`fasta_splitter`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for splitting fasta files into parts, using fasta-splitter.pl.

Convenient for parallelizing processes on the cluster. You can take a project wide fasta file (such as a transcriptome), split it into sub-fasta files, and run various processes on the sub-files.

The parts can then be combined with merge_table module, which can concatenate any type of file.

Attention

The module ships with fasta-splitter.pl version 0.2.6, 2017-08-01.

Leave script_path empty to use the perl script provided. Perl must be in the path!

To use a different version, supply it via script_path.

Usage:

Usage: fasta-splitter [options] <file>...
    Options:
        --n-parts <N>        - Divide into <N> parts
        --part-size <N>      - Divide into parts of size <N>
        --measure (all|seq|count) - Specify whether all data, sequence length, or
                               number of sequences is used for determining part
                               sizes ('all' by default).
        --line-length        - Set output sequence line length, 0 for single line
                               (default: 60).
        --eol (dos|mac|unix) - Choose end-of-line character ('unix' by default).
        --part-num-prefix T  - Put T before part number in file names (def.: .part-)
        --out-dir            - Specify output directory.
        --nopad              - Don't pad part numbers with 0.
        --version            - Show version.
        --help               - Show help.

You can’t use the --part-size method, since it will end up in an unknown number of files, which is not defined in Neat-Seq Flow.

Please do not use the --nopad parameter. There is no reason to…

Important

When splitting sample-scope fasta files, the subsamples are stored with a source category set to the original sample name. You can use this for merging results at the sample scope downstream. See documentation for merge_table.

Requires

A fasta file in one of the following slots (scope = “project”):
- sample_data["project_data"]["fasta.nucl"]
- sample_data["project_data"]["fasta.prot"]
A fasta file in one of the following slots (scope = “sample”):
- sample_data[<sample>]["fasta.nucl"]
- sample_data[<sample>]["fasta.prot"]

Output

Puts output files in the following slots:
- sample_data[<sample>]["fasta.nucl"]
- sample_data[<sample>]["fasta.prot"]
For sample scope, the original sample list will be overridden with the new sample list.

Parameters that can be set

Parameters that can be set:
Parameter	Values	Comments
`type`	nucl\|prot	The type of fasta file to split
`redirects: --n-parts`		Number of fragments

Lines for parameter file

split_fasta1:
    module:         fasta_splitter
    base:           Trinity1
    script_path:
    type:           nucl
    redirects:
        --n-parts:      4
        --measure:      seq

References

http://kirill-kryukov.com/study/tools/fasta-splitter

`ProjectToSample` ^*

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A utility module for moving project data to a sample, and back again. Is useful when a module which works on sample data has to be executed on data in the project scope.

For instance, in the STAR 2 pass pipeline, the first stage involves aligning all reads to the reference in order to find splice junctions. The reads can be merged into a project scope fastq.F and fastq.R slots, but all aligners take there reads from the sample scope!

This module overrides the sample list with a single sample containing the project slots (or a subset of the slots). Then, the mapping modules will take the project-wide reads from the sample representing the project.

Recovering the old sample list is done by setting the direction parameter to smp2proj.

See the STAR2pass workflow for the working example.

Usually, the module should be called twice, once in the proj2smp direction and the in the smp2proj direction. Although it is possible to use the smp2proj to move data from sample sample_name to the project, it is better to do this operation with the manage_types module.

Requires

Output

Parameters that can be set

Parameter	Values	Comments
direction	proj2smp\|smp2proj	Move project info to sample or vice versa
type		The types to operate on
operation	cp\|mv	Whether to move the slots or just copy them.
sample_name		The name of the new sample to create or the sample to copy from. Defaults to project title

Attention

This moduel does NOT operate on the actual files! It only modifies internal file types index.

Lines for parameter file

Moving from project to sample:

ProjectToSample:
    module:     ProjectToSample
    base:       merge_table
    script_path:
    direction:  proj2smp
    # sample_name:    fromproj
    operation:  mv   # mv or cp
    type:       [fastq.F,fastq.R]

Copying from sample to project:

SampleToProject:
    module:     ProjectToSample
    base:       STAR_map_proj
    script_path:
    direction:  smp2proj
    operation:  mv   # mv or cp
    type:       SJ.out.tab

Copying and moving from sample to project: (Just for the example. Isn’t necessarily practical)

SampleToProject:
    module:     ProjectToSample
    base:       STAR_map_proj
    script_path:
    direction:  smp2proj
    operation:  [cp, mv, mv]   # mv or cp
    type:       [SJ.out.tab, fastq.F, fastq.R]

Miscellaneous Modules

manage_types *

Requires

Output

Parameters that can be set

Lines for parameter file

merge_table

Requires

Output

Parameters that can be set

Lines for parameter file

split_fasta

Requires

Output

Parameters that can be set

Lines for parameter file

fasta_splitter

Requires

Output

Parameters that can be set

Lines for parameter file

References

ProjectToSample *

Requires

Output

Parameters that can be set

Lines for parameter file

`manage_types` ^*

`merge_table`

`split_fasta`

`fasta_splitter`

`ProjectToSample` ^*