RNASeq

Modules included in this section

DeSeq2

`DeSeq2`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Short Description

A module to preform: * Gene level differential expression using DeSeq2. * Gene annotation. * PCA plot. * Clustering of significant genes. * Heatmaps of significant genes by clusters. * Expression patterns plot by clusters * Enrichment analysis KEGG/GO.

Requires

Search for count data in :
self.sample_data[<sample>][“RSEM”] self.sample_data[<sample>][“genes.counts”] self.sample_data[<sample>][“HTSeq.counts”] self.sample_data[“project_data”][“results”]

Parameters that can be set

Parameter	Values	Comments
use_click		Will use the CLICK clustering program (Shamir et al. 2000)

Note

If your using the use_click option, cite: Expander: Ulitsky I, Maron-Katz A, Shavit S, Sagir D, Linhart C, Elkon R, Tanay A, Sharan R, Shiloh Y, Shamir R. Expander: from expression microarrays to networks and functions. Nature Protocols Vol 5, pp 303 - 322, 2010 Click: Shamir , R. and Sharan, R. CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis. Proceedings ISMB 2000, pp.307-316 (2000)

Comments

The following R packages are required:
DESeq2 ggplot2 pheatmap mclust factoextra cowplot gridExtra biomaRt clusterProfiler KEGGREST scater sva rmarkdown plotly dt xml2 dplyr rcolorbrewer colorspace stringr

Note

It is Possible to use CONDA to install all dependencies:

wget https://raw.githubusercontent.com/bioinfo-core-BGU/neatseq-flow-modules/master/neatseq_flow_modules/Liron/DeSeq2_module/DeSeq2_env_install.yaml
conda env create -f DeSeq2_env_install.yaml

Flow this Tutorial for More Information.

Lines for parameter file

Step_Name:                              # Name of this step
    module: DeSeq2                      # Name of the used module
    base:                               # Name of the step [or list of names] to run after with count results.
    script_path:                        # Command for running the a DeSeq2 script
                                        # If this line is empty or missing it will try using the module's associated script
    use_click:                          # Will use the CLICK clustering program (Shamir et al. 2000). 
    redirects:
        --SAMPLE_DATA_FILE:             # Path to Samples Information File
        --GENE_ID_TYPE:                 # The Gene ID Type i.e 'ENSEMBL'[for Bioconductor] OR 'ensembl_gene_id'/'ensembl_transcript_id' [for ENSEMBL]
        --Annotation_db:                # Bioconductor Annotation Data Base Name from https://bioconductor.org/packages/release/BiocViews.html#___OrgDb  
        --Species:                      # Species Name to Retrieve Annotation Data from ENSEMBL
        --KEGG_Species:                 # Species Name to Retrieve Annotation Data from KEGG
        --KEGG_KAAS:                    # Gene to KO file from KEGG KAAS [first column gene id, second column KO number]
        --Trinotate:                    # Path to a Trinotate annotation file in which the first column is the genes names
        --FILTER_SAMPLES:               # Filter Samples with Low Number of expressed genes OR with Small Library size using 'scater' package 
        --FILTER_GENES:                 # Filter Low-Abundance Genes using 'scater' package
        --NORMALIZATION_TYPE:           # The DeSeq2 Normalization Type To Use [VSD , RLOG] The Default is VSD
        --BLIND_NORM:                   # Perform Blind Normalization
        --DESIGN:                       # The Main DeSeq2 Design [ ~ Group ]
        --removeBatchEffect             # Will Remove Batch Effect from the Normalized counts data up to 2 
                                        # [using the limma package and only one using the sva package]
                                        # Batch Effect fields [from the Sample Data ] separated by , 
        --removeBatchEffect_method      # The method to Remove Batch Effect from the Normalized counts data using the limma or sva packages [sva is the default]
        --LRT:                          # The LRT DeSeq2 Design
        --ALPHA:                        # Significant Level Cutoff, The Default is 0.05
        --Post_statistical_ALPHA        # Post Statistical P-value Filtering
        --FoldChange:                   # Fold change Cutoff [testing for fold changes greater in absolute value], The Default is 1
        --Post_statistical_FoldChange   # Post Statistical Fold change Filtering
        --CONTRAST:                     # The DeSeq Contrast Design ["Group,Treatment,Control"] [Not For LTR] .
                                        # It is possible to define more then one contrast Design ["Group,Treatment1,Control1|Group,Treatment2,Control2|..."]
        --SPLIT_BY_CONTRAST             # Only use Samples found in the relevant contrast for Clustering and Enrichment Analysis
        --modelMatrixType:              # How the DeSeq model matrix of the GLM formula is formed [standard or expanded] ,The Default is standard
        --GENES_PLOT:                   # Genes Id To Plot count Data [separated by ','] 
        --X_AXIS:                       # The Filed In the Sample Data To Use as X Axis
        --GROUP:                        # The Filed In the Sample Data To Group By [can be two fields separated by ',']
        --SPLIT_BY:                     # The Filed In the Sample Data To Split the Analysis By.
        --FUNcluster:                   # A clustering function including [kmeans,pam,clara,fanny,hclust,agnes,diana,click]. The default is hclust
                                        # If the 'use_click' option is used the '--FUNcluster' option is set to 'click' 
        --hc_metric:                    # Hierarchical clustering metric to be used for calculating dissimilarities between observations. The default is pearson
        --hc_method:                    # Hierarchical clustering agglomeration method to be used. The default is ward.D2
        --k.max:                        # The maximum number of clusters to consider, must be at least two. The default is 20
        --nboot:                        # Number of Monte Carlo (bootstrap) samples for determining the number of clusters [Not For Mclust]. The default is 10 
        --stand:                        # The Data will be Standardized Before Clustering.
        --Mclust:                       # Use Mclust for determining the number of clusters.
        --CLICK_HOMOGENEITY:            # The HOMOGENEITY [0-1] of clusters using CLICK program (Shamir et al. 2000). The default is 0.5 
        --PCA_COLOR:                    # The Filed In the Sample Data To Determine Color In The PCA Plot
        --PCA_SHAPE:                    # The Filed In the Sample Data To Determine Shape In The PCA Plot
        --PCA_SIZE:                     # The Filed In the Sample Data To Determine Size In The PCA Plot. The default is Library Size
        --Enriched_terms_overlap:       # Test for genes overlap in enriched terms
        --USE_INPUT_GENES_AS_BACKGROUND # Use The input Genes as the Background for Enrichment Analysis
        --only_clustering               # Don't Perform Differential Analysis!!!
        --significant_genes             # Use these genes as the set of significant genes [a comma separated list]
        --collapseReplicates            # Will collapse technical replicates using a Sample Data field indicating which samples are technical replicates