Short Description

A module to preform: * Gene level differential expression using DeSeq2. * Gene annotation. * PCA plot. * Clustering of significant genes. * Heatmaps of significant genes by clusters. * Expression patterns plot by clusters * Enrichment analysis KEGG/GO.


  • Search for count data in :

    self.sample_data[<sample>][“RSEM”] self.sample_data[<sample>][“genes.counts”] self.sample_data[<sample>][“HTSeq.counts”] self.sample_data[“project_data”][“results”]

Will use the CLICK clustering program (Shamir et al. 2000)


If your using the use_click option, cite: Expander: Ulitsky I, Maron-Katz A, Shavit S, Sagir D, Linhart C, Elkon R, Tanay A, Sharan R, Shiloh Y, Shamir R. Expander: from expression microarrays to networks and functions. Nature Protocols Vol 5, pp 303 - 322, 2010 Click: Shamir , R. and Sharan, R. CLICK: A Clustering Algorithm with Applications to Gene Expression Analysis. Proceedings ISMB 2000, pp.307-316 (2000)


  • The following R packages are required:

    DESeq2 ggplot2 pheatmap mclust factoextra cowplot gridExtra biomaRt clusterProfiler KEGGREST scater sva rmarkdown plotly dt xml2 dplyr rcolorbrewer colorspace stringr


It is Possible to use CONDA to install all dependencies:

wget https://raw.githubusercontent.com/bioinfo-core-BGU/neatseq-flow-modules/master/neatseq_flow_modules/Liron/DeSeq2_module/DeSeq2_env_install.yaml
conda env create -f DeSeq2_env_install.yaml

Flow this Tutorial for More Information.

Lines for parameter file

Step_Name:                              # Name of this step
    module: DeSeq2                      # Name of the used module
    base:                               # Name of the step [or list of names] to run after with count results.
    script_path:                        # Command for running the a DeSeq2 script
                                        # If this line is empty or missing it will try using the module's associated script
    use_click:                          # Will use the CLICK clustering program (Shamir et al. 2000). 
        --SAMPLE_DATA_FILE:             # Path to Samples Information File
        --GENE_ID_TYPE:                 # The Gene ID Type i.e 'ENSEMBL'[for Bioconductor] OR 'ensembl_gene_id'/'ensembl_transcript_id' [for ENSEMBL]
        --Annotation_db:                # Bioconductor Annotation Data Base Name from https://bioconductor.org/packages/release/BiocViews.html#___OrgDb  
        --Species:                      # Species Name to Retrieve Annotation Data from ENSEMBL
        --KEGG_Species:                 # Species Name to Retrieve Annotation Data from KEGG
        --KEGG_KAAS:                    # Gene to KO file from KEGG KAAS [first column gene id, second column KO number]
        --Trinotate:                    # Path to a Trinotate annotation file in which the first column is the genes names
        --FILTER_SAMPLES:               # Filter Samples with Low Number of expressed genes OR with Small Library size using 'scater' package 
        --FILTER_GENES:                 # Filter Low-Abundance Genes using 'scater' package
        --NORMALIZATION_TYPE:           # The DeSeq2 Normalization Type To Use [VSD , RLOG] The Default is VSD
        --BLIND_NORM:                   # Perform Blind Normalization
        --DESIGN:                       # The Main DeSeq2 Design [ ~ Group ]
        --removeBatchEffect             # Will Remove Batch Effect from the Normalized counts data up to 2 
                                        # [using the limma package and only one using the sva package]
                                        # Batch Effect fields [from the Sample Data ] separated by , 
        --removeBatchEffect_method      # The method to Remove Batch Effect from the Normalized counts data using the limma or sva packages [sva is the default]
        --LRT:                          # The LRT DeSeq2 Design
        --ALPHA:                        # Significant Level Cutoff, The Default is 0.05
        --Post_statistical_ALPHA        # Post Statistical P-value Filtering
        --FoldChange:                   # Fold change Cutoff [testing for fold changes greater in absolute value], The Default is 1
        --Post_statistical_FoldChange   # Post Statistical Fold change Filtering
        --CONTRAST:                     # The DeSeq Contrast Design ["Group,Treatment,Control"] [Not For LTR] .
                                        # It is possible to define more then one contrast Design ["Group,Treatment1,Control1|Group,Treatment2,Control2|..."]
        --SPLIT_BY_CONTRAST             # Only use Samples found in the relevant contrast for Clustering and Enrichment Analysis
        --modelMatrixType:              # How the DeSeq model matrix of the GLM formula is formed [standard or expanded] ,The Default is standard
        --GENES_PLOT:                   # Genes Id To Plot count Data [separated by ','] 
        --X_AXIS:                       # The Filed In the Sample Data To Use as X Axis
        --GROUP:                        # The Filed In the Sample Data To Group By [can be two fields separated by ',']
        --SPLIT_BY:                     # The Filed In the Sample Data To Split the Analysis By.
        --FUNcluster:                   # A clustering function including [kmeans,pam,clara,fanny,hclust,agnes,diana,click]. The default is hclust
                                        # If the 'use_click' option is used the '--FUNcluster' option is set to 'click' 
        --hc_metric:                    # Hierarchical clustering metric to be used for calculating dissimilarities between observations. The default is pearson
        --hc_method:                    # Hierarchical clustering agglomeration method to be used. The default is ward.D2
        --k.max:                        # The maximum number of clusters to consider, must be at least two. The default is 20
        --nboot:                        # Number of Monte Carlo (bootstrap) samples for determining the number of clusters [Not For Mclust]. The default is 10 
        --stand:                        # The Data will be Standardized Before Clustering.
        --Mclust:                       # Use Mclust for determining the number of clusters.
        --CLICK_HOMOGENEITY:            # The HOMOGENEITY [0-1] of clusters using CLICK program (Shamir et al. 2000). The default is 0.5 
        --PCA_COLOR:                    # The Filed In the Sample Data To Determine Color In The PCA Plot
        --PCA_SHAPE:                    # The Filed In the Sample Data To Determine Shape In The PCA Plot
        --PCA_SIZE:                     # The Filed In the Sample Data To Determine Size In The PCA Plot. The default is Library Size
        --Enriched_terms_overlap:       # Test for genes overlap in enriched terms
        --USE_INPUT_GENES_AS_BACKGROUND # Use The input Genes as the Background for Enrichment Analysis
        --only_clustering               # Don't Perform Differential Analysis!!!
        --significant_genes             # Use these genes as the set of significant genes [a comma separated list]
        --collapseReplicates            # Will collapse technical replicates using a Sample Data field indicating which samples are technical replicates