Various Reporting Programs

Modules included in this section

NGSplot

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running NGSplot:

Runs NGSplot on existing sorted BAM files.

Please make sure the BAM is sorted, such as following the samtools module

If this is a ChIP-seq experiment and you have controls defined, it will also run NGSplot for the sample:control comparison.

At the moment, the module works only at the sample scope. (BAM files in the project scope are rare!)

Requires

  • BAM files in the following slots:

    • sample_data[<sample>]["bam"]

Output

  • Puts output NGS reports in the following slots:

    • self.sample_data[<sample>]["NGSplot"]

  • For ChIP-seq data, puts comparison reports in

    • self.sample_data[<sample>]["NGSplot_vs_control"]

Parameters that can be set

Parameter

Values

Comments

setenv

NGSPLOT=/path/to/ngsplot

Running NGSplot requires setting this EV.

Lines for parameter file

NGSplot_genebody:
    module:             NGSplot
    base:               sam_base
    script_path:        Rscript /path/to/ngsplot-2.61/bin/ngs.plot.r
    setenv:             NGSPLOT=/path/to/ngsplot-2.61
    redirects:
        -G:             mm10
        -R:             genebody
        -P:             20
        -GO:            hc
    qsub_params:
        -pe:            shared 20

References

Shen, L., Shao, N., Liu, X. and Nestler, E., 2014. ngs.plot: Quick mining and visualization of next-generation sequencing data by integrating genomic databases. BMC genomics, 15(1), p.284.

Multiqc *

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for preparing a MultiQC report for all samples.

Tip

By default, the module will search for parsable reports in the directories of all the modules in the branch leading to this instance. To search only in the directories of the explicit base steps, specify the bases_only parameter.

Requires

  • No real requirements. Will give a report with information if one of the base steps produces reports that MultiQC can read, e.g. fastqc, bowtie2, samtools etc.

Output

  • puts report dir in the following slot:

    • self.sample_data[<sample>]["Multiqc_report"]

Parameters that can be set

Parameter

Values

Comments

bases_only

Search directories of explicit base steps only.

Lines for parameter file

firstMultQC:
    module: Multiqc
    base:
        - sam_bwt2_1
        - fqc_trim1
    bases_only:
    script_path: /path/to/multiqc

References

Ewels, P., Magnusson, M., Lundin, S. and Käller, M., 2016. MultiQC: summarize analysis results for multiple tools and samples in a single report. Bioinformatics, 32(19), pp.3047-3048.

Collect_results

Authors

Liron Levin

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

Short Description

A module to Collect and merge/append results from all base steps directories: This module will search for each base step for all the results files with a common name pattern [Regular expression]. The search will be done within the base step result directories. The sample name could be inferred for each result file base on the parent directory name and added to the merged file [as new column named “Samples”]. All the results files will be append [by default] or merged by a common column name. The merge files can then be convert individually to pivot table file

Requires

  • Tab delimited files with common name pattern found within the base step data directories:

  • For example files ending with .out

Output

  • Generate merged tab delimited files:

  • Will generate file for each of the base steps with the file ending with .merg

  • Can also generate Excel file with sheet for each base step

  • Put results file in:

    self.sample_data[“project_data”][“results”]

Parameters that can be set

Parameter

Values

Comments

Comments

  • The following python packages are required:

    pandas openpyxl

Lines for parameter file

Step_Name:                            # Name of this step
    module: Collect_results           # Name of the used module
    base:                             # Name of the step [or list of names] to run after and collect results from [must be after a merge step]
    script_path:                      # Command for running the a merging script
                                      # If this line is empty or missing it will try using the module's associated script
    redirects:
        -R:                           # Regular expression to find result files
        --Merge_by:                   # Merge files by common column
        --header:                     # Don't use a header row, use integers instead [0,1,2,3...], easy to use with --pivot option
        --Excel:                      # Collect all results to excel file split by sheets
        --add_samples_names:          # Infer and add samples names from file parent directory to "Samples" column
        --pivot:                      # Convert to pivot table by [index columns values]
                                      # If with the options: -add_samples_names and --header  it is possible to use: '''Samples'' '5' '0''
        --MetaData:                   # Use external MetaData file as the base for merging
        --split_by:                   # Split the data in the columns [index <columns> values] before pivot
        --sep:                        # Columns separator for input file
        -T:                           # Write Transpose output

Tree_plot

Authors

Liron Levin

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

Short Description

A module for plotting tree file in newick format together with MetaData information and possible additional matrix information.

Requires

  • A tree file in newick format in:

    self.sample_data["project_data"]["newick"]

  • Tab delimited file with samples names in one of the columns from:

    self.sample_data["project_data"]["MetaData"] self.sample_data["project_data"]["results"] or from external file.

Output

  • Generate pdf file of the tree with the MetaData information:

Parameters that can be set

Parameter

Values

Comments

Comments

  • The following R packages are required:

    optparse ape ggtree openxlsx

Lines for parameter file

Step_Name:                            # Name of this step
    module: Tree_plot                 # Name of the used module
    base:                             # Name of the step [or list of names] to run after and generate a Tree plot [must be after a tree making step]
                                      # If more then one base is specified: the first overwrite the other bases overlapped slots  
    script_path:                      # Command for running the Tree plot script
                                      # If this line is empty or missing it will try using the module's associated script
    iterate_on_bases:                 # If set will iterate over the step's bases and generate a plot for each base. 
    tree_by_heatmap:                  # Generate additional tree using Hierarchical Clustering of the heatmap
    redirects:
        --layout:                     # Tree layout [fan or rectangular (default)]
        --Meta_Data:                  # Path to tab-delimited Meta Data file with header line. 
                                      # If this line is empty or missing it will try searching for results data.
        --M_Excel:                    # If the Meta_Data input is an Excel file indicate the sheet name to use
        --ID_field:                   # Column name in the Meta Data file for IDs found in the tips of the tree
        --cols_to_use:                # Columns in the Meta Data file to use and the order from the center up  
        --open.angle:                 # Tree open angle.
        --branch.length:              # Don't use branch length [cladogram]
        --conect.tip:                 # Connect the tip to its label
        --pre_spacer:                 # Space before the label text [default=0.05]
        --post_spacer:                # Space after the label text [default=0.01]
        --OTU:                        # Column name in the Meta Data file to use as OTU annotation
        --labels:                     # Use branch length labels
        --Tip_labels:                 # Show tip labels
        --heatmap:                    # Path to Data file to generate a heatmap 
                                      # If this line is empty it will try searching for results data.
        --H_Excel:                    # If the heatmap input is an Excel file indicate the sheet name to use
        --heatmap_cell_border:        # Color of heatmap cell border [default='white']
        --heatmap_lowest_value:       # Color of heatmap lowest value [default='white']
        --heatmap_highest_value:      # Color of heatmap highest value [default='red']
        --cols_to_use_heatmap:        # Columns in the heatmap Data file to use and the order from the center up
        --ID_heatmap_field:           # Column name for IDs found in the tips of the tree in the heatmap Data file
        --heatmap_variable:           # Use only variable columns in the heatmap
        --heatmap_count_by_sep:       # Count the sep in each cell to generate the values for the heatmap
        --heatmap_HC_dist:            # The heatmap Hierarchical Clustering dist method
        --heatmap_HC_agg:             # The heatmap Hierarchical Clustering agglomeration method

BUSCO

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A class that defines a module for running BUSCO.

BUSCO searches for predefined sequences in an assembly. See the BUSCO website.

This module creates scripts for running BUSCO on a fasta file against a BUSCO lineage database.

The lineage can be specified in two ways:

  1. Specify the path to the lineage file with the --lineage redirected argument.

  2. Specify the URL of the database (e.g. http://busco.ezlab.org/v2/datasets/eukaryota_odb9.tar.gz). The file will be downloaded and unzipped.

Requires

  • fasta files in one of the following slots for sample-wise BUSCO:

    • sample_data[<sample>]["fasta.nucl"]

    • sample_data[<sample>]["fasta.prot"]

  • or fasta files in one of the following slots for project-wise BUSCO:

    • sample_data["fasta.nucl"]

    • sample_data["fasta.prot"]

Output:

  • Stores output directory in:

    • self.sample_data[<sample>][“BUSCO”] (scope = sample)

    • self.sample_data[“project_data”][“BUSCO”] (scope = project)

Parameters that can be set

Parameter

Values

Comments

scope

sample | project

Use sample of project scope fasta file.

get_lineage

Path to one of the lineages to download from https://busco.ezlab.org/frame_wget.html. Will be downloaded, unzipped and used if no –lineage is passed.

Lines for parameter file

Run BUSCO on project-scope fasta file, using a pre-downloaded BUSCO database:

BUSCO1:
    module:             BUSCO
    base:               Trinity_assembl
    script_path:        {Vars.paths.BUSCO} 
    scope:              project
    redirects:
        --mode:         transcriptome
        --lineage:      {Vars.databases.BUSCO}
        --cpu:          65
        --force:
        --restart:

Run BUSCO on project-scope fasta file, including downloading the BUSCO database:

BUSCO1:
    module:             BUSCO
    base:               Trinity_assembl
    script_path:        {Vars.paths.BUSCO}
    scope:              project
    get_lineage:        http://busco.ezlab.org/v2/datasets/eukaryota_odb9.tar.gz
    redirects:
        --mode:         transcriptome
        --cpu:          65
        --force:
        --restart:

References