Miscellaneous Modules

manage_types *

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for managing file type without script creation.

Supports adding, deleting, copying and moving file types.

Requires

Output

Parameters that can be set

Parameter

Values

Comments

operation

add|del|mv|cp

The operation to perform on the file type dictionary

scope

project|sample

The scope on which to perform the operation. For ‘mv’ and ‘cp’ this is the source scope

type

The type on which to perform the operation. For ‘mv’ and ‘cp’ this is the source type

scope_trg

project|sample

The destination scope for ‘mv’ and ‘cp’ operations

type_trg

The destination type for ‘mv’ and ‘cp’ operations

path

For ‘add’ operation, the value to insert in the file type.

Attention

The operations do NOT operate on the actual files! They only modify internal file types index.

Tip

You can combine several operations in one module instance, by passing lists to the parameters in the table above. All lists should be of the same length, or of length 1 (i.e. plain strings). Plain strings will be extrapolated to all operations. e.g., to delete one file type and add another, both at the project scope, pass [del,add] to the ‘operation’ parameter, and ‘project’ to the ‘scope’ parameter. The ‘path’ can also be a plain string. It will be extrapolated to ‘del’, as well, but will be ignored by it. See example lines below.

Lines for parameter file

manage_types1:
    module:             manage_types
    base:               STAR_bld_ind
    script_path:        
    scope:              project
    operation:          mv
    type:               trinity.contigs
    type_trg:           trinity.contigs
    scope_trg:          sample

manage_types1:
    module:             manage_types
    base:               trinity1
    script_path:        
    scope:              - project
                        - sample
                        - sample
                        - project
    operation:          - mv
                        - del
                        - cp
                        - add
    type:               - fasta.nucl
                        - fasta.nucl
                        - fastq.F
                        - bam
    type_trg:           [transcripts.nucl, None ,fastq.main, None]
    scope_trg:          sample
    path:               /path/to/mapping.bam   

merge_table

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for merging sample tables into a single project-wide table, or into group tables by category.

The table can be with or without a header line.

Can be used for merging fasta and fastq files as well.

Important

When merging by category, the sample names will be set to the category level names for all subsequent steps.

Tip

You can merge several types at once by passing them as a list to type. If the type files have different numbers of header lines, pass a list of header line numbers with header. The header list must be of length 1 or identical to the length of type.

The extension of the resulting file will be the same as that of the files being merged, if they are all the same. If not, will not add an extension. To change the default behaviour, set an ext parameter with the extension to use, e.g. fna. If several types are being merged, if ext is a string, the string will be used for all types. For a different ext for each file type, use a list of strings, in the same order as the type parameter.

Attention

If you split sample-scope fasta files with fasta_splitter or split_fasta modules, the new subsamples are stored with a source category, containing the sample name from which the subsample was produced. When merging back into the sample scope, use scope: group and category: source.

Requires

  • A table file in any slot:

    • sample_data[<sample>][<file.type>]

Output

  • Puts output files in the following slot:

    • sample_data["project_data"][<file.type>]

  • Or, for merging by category, in the following slot:

    • sample_data[category_level][<file.type>]

Parameters that can be set

Parameters that can be set:

Parameter

Values

Comments

type

A file type that exists in all samples. Can also be a list of types, each one of which will be merged independently

script_path

Leave blank

scope

project|group

Merge all samples into one project table, or merge sample tables by category.

category

If scope is set to group, you must specify the category by which to divide the samples for merging. The category must be a string containing one of the categories (columns) in the mapping file

header

0

The number of header lines each table has. The header will be used for the complete table and all other headers will be removed. If there is no header line, set to 0 or leave out completely. If set but not specified, will default to 1!.

ext

The extension to use for the merged file. If type is a list, ext will be used for all types unless ext itself is a list of the same length as type.

add_filename

If set, the source filename will be appended to each line in the resulting table.

Lines for parameter file

Merge sample-scope tables into single project-scope table:

merge_blast_tables:
    module:         merge_table
    base:           merge1
    script_path:
    scope:          project
    type:           blast.prot
    header:         0

Merge sample-scope tables into group-scope table, by category country:

merge_blast_tables:
    module:         merge_table
    base:           merge1
    script_path:
    scope:          group
    category:       country
    type:           blast.prot
    header:         0

split_fasta

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for splitting fasta files into parts.

Convenient for parallelizing processes on the cluster. You can take a project wide fasta file (such as a transcriptome), split it into sub-fasta files, and run various processes on the sub-files.

The parts can then be combined with merge_table module, which can concatenate any type of file.

Important

When splitting sample-scope fasta files, the subsamples are stored with a source category set to the original sample name. You can use this for merging results at the sample scope downstream. See documentation for merge_table.

Requires

  • A fasta file in one of the following slots (scope = “project”):

    • sample_data["project_data"]["fasta.nucl"]

    • sample_data["project_data"]["fasta.prot"]

  • A fasta file in one of the following slots (scope = “sample”):

    • sample_data[<sample>]["fasta.nucl"]

    • sample_data[<sample>]["fasta.prot"]

Output

  • Puts output files in the following slots:

    • sample_data[<sample>]["fasta.nucl"]

    • sample_data[<sample>]["fasta.prot"]

  • For sample scope, the original sample list will be overridden with the new sample list.

Parameters that can be set

Parameters that can be set:

Parameter

Values

Comments

type

nucl|prot

The type of fasta file to split

subsample_num

Number of fragments

Lines for parameter file

split_fasta1:
    module:         split_fasta
    base:           Trinity1
    script_path:    
    type:           nucl
    subsample_num:      4

fasta_splitter

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for splitting fasta files into parts, using fasta-splitter.pl.

Convenient for parallelizing processes on the cluster. You can take a project wide fasta file (such as a transcriptome), split it into sub-fasta files, and run various processes on the sub-files.

The parts can then be combined with merge_table module, which can concatenate any type of file.

Attention

The module ships with fasta-splitter.pl version 0.2.6, 2017-08-01.

Leave script_path empty to use the perl script provided. Perl must be in the path!

To use a different version, supply it via script_path.

Usage:

Usage: fasta-splitter [options] <file>...
    Options:
        --n-parts <N>        - Divide into <N> parts
        --part-size <N>      - Divide into parts of size <N>
        --measure (all|seq|count) - Specify whether all data, sequence length, or
                               number of sequences is used for determining part
                               sizes ('all' by default).
        --line-length        - Set output sequence line length, 0 for single line
                               (default: 60).
        --eol (dos|mac|unix) - Choose end-of-line character ('unix' by default).
        --part-num-prefix T  - Put T before part number in file names (def.: .part-)
        --out-dir            - Specify output directory.
        --nopad              - Don't pad part numbers with 0.
        --version            - Show version.
        --help               - Show help.

You can’t use the --part-size method, since it will end up in an unknown number of files, which is not defined in Neat-Seq Flow.

Please do not use the --nopad parameter. There is no reason to…

Important

When splitting sample-scope fasta files, the subsamples are stored with a source category set to the original sample name. You can use this for merging results at the sample scope downstream. See documentation for merge_table.

Requires

  • A fasta file in one of the following slots (scope = “project”):

    • sample_data["project_data"]["fasta.nucl"]

    • sample_data["project_data"]["fasta.prot"]

  • A fasta file in one of the following slots (scope = “sample”):

    • sample_data[<sample>]["fasta.nucl"]

    • sample_data[<sample>]["fasta.prot"]

Output

  • Puts output files in the following slots:

    • sample_data[<sample>]["fasta.nucl"]

    • sample_data[<sample>]["fasta.prot"]

  • For sample scope, the original sample list will be overridden with the new sample list.

Parameters that can be set

Parameters that can be set:

Parameter

Values

Comments

type

nucl|prot

The type of fasta file to split

redirects: --n-parts

Number of fragments

Lines for parameter file

split_fasta1:
    module:         fasta_splitter
    base:           Trinity1
    script_path:
    type:           nucl
    redirects:
        --n-parts:      4
        --measure:      seq

References

http://kirill-kryukov.com/study/tools/fasta-splitter

ProjectToSample *

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A utility module for moving project data to a sample, and back again. Is useful when a module which works on sample data has to be executed on data in the project scope.

For instance, in the STAR 2 pass pipeline, the first stage involves aligning all reads to the reference in order to find splice junctions. The reads can be merged into a project scope fastq.F and fastq.R slots, but all aligners take there reads from the sample scope!

This module overrides the sample list with a single sample containing the project slots (or a subset of the slots). Then, the mapping modules will take the project-wide reads from the sample representing the project.

Recovering the old sample list is done by setting the direction parameter to smp2proj.

See the STAR2pass workflow for the working example.

Usually, the module should be called twice, once in the proj2smp direction and the in the smp2proj direction. Although it is possible to use the smp2proj to move data from sample sample_name to the project, it is better to do this operation with the manage_types module.

Requires

Output

Parameters that can be set

Parameter

Values

Comments

direction

proj2smp|smp2proj

Move project info to sample or vice versa

type

The types to operate on

operation

cp|mv

Whether to move the slots or just copy them.

sample_name

The name of the new sample to create or the sample to copy from. Defaults to project title

Attention

This moduel does NOT operate on the actual files! It only modifies internal file types index.

Lines for parameter file

Moving from project to sample:

ProjectToSample:
    module:     ProjectToSample
    base:       merge_table
    script_path:
    direction:  proj2smp
    # sample_name:    fromproj
    operation:  mv   # mv or cp
    type:       [fastq.F,fastq.R]

Copying from sample to project:

SampleToProject:
    module:     ProjectToSample
    base:       STAR_map_proj
    script_path:
    direction:  smp2proj
    operation:  mv   # mv or cp
    type:       SJ.out.tab

Copying and moving from sample to project: (Just for the example. Isn’t necessarily practical)

SampleToProject:
    module:     ProjectToSample
    base:       STAR_map_proj
    script_path:
    direction:  smp2proj
    operation:  [cp, mv, mv]   # mv or cp
    type:       [SJ.out.tab, fastq.F, fastq.R]