Microbiology

Modules included in this section

CARD_RGI
cgMLST_and_MLST_typing
Roary
Snippy
Gubbins
Tree_plot

`CARD_RGI`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

A module for running CARD RGI:

RGI is executed on the contigs stored in a Nucleotide fasta file.

Requires

A nucleotide fasta file in one of the following slots:
- sample_data[<sample>]["fasta.nucl"]
- sample_data["fasta.nucl"]

Output

If scope is set to sample:
- Puts output files in:
  
  sample_data[<sample>]["CARD_RGI.json"] sample_data[<sample>]["CARD_RGI.tsv"]
- Puts index of output files in:
  
  self.sample_data["project_data"]["CARD_RGI.files_index"]
- If merge_script_path is specified in parameters, puts the merged file in
  
  self.sample_data["project_data"]["CARD_RGI.merged_reports"]
If scope is set to project:
- Puts output files in:
  
  sample_data["CARD_RGI.json"] sample_data["CARD_RGI.tsv"]

Parameters that can be set

Parameter	Values	Comments
JSON2tsv_script	path	The path to the CARD script for converting the JSON output to tsv (find ‘convertJsonToTSV.py’ in your RGI installation)
merge_script_path	path	Path to a script that takes an index of RGI output files (’–ind’) and a place to put the output (–output). This script will be executed in the wrapping up stage. (Note, the script can take more parameters. These should be passed with the path in the parameter files, e.g. ‘python /path/to/script –param1 val1 –param2 val2’) If the parameters is not passed, no action will be taken on the output files.

Comments

Lines for parameter file

rgi_inst:
    module: CARD_RGI
    base: spades1
    script_path: python /path/to/rgi.py
    qsub_params:
        -pe: shared 15
    JSON2tsv_script: python /path/to/convertJsonToTSV.py
    merge_script_path: Rscript /path/to/merge_reports.R --variable bit_score
    orf_to_use: -x
    scope: sample
    redirects:
        -n: 20
        -x: 1

References

McArthur, A.G., Waglechner, N., Nizam, F., Yan, A., Azad, M.A., Baylay, A.J., Bhullar, K., Canova, M.J., De Pascale, G., Ejim, L. and Kalan, L., 2013. The comprehensive antibiotic resistance database. Antimicrobial agents and chemotherapy, 57(7), pp.3348-3357.

`cgMLST_and_MLST_typing`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad. The MLST typing R script was created by Menachem Sklarz & Michal Gordon

Short Description

A module for a MLST and cgMLST Typing

Requires

Blast results after parsing in:
self.sample_data[<sample>]["blast.parsed"]

Output

Typing results in:
self.sample_data[<sample>]["Typing"]

Merge of typing results in:
self.sample_data["project_data"]["Typing"]

Files for phyloviz in:
self.sample_data["project_data"]["phyloviz_MetaData"] self.sample_data["project_data"]["phyloviz_Alleles"]

Tree file (if –Tree flag is set) in newick format in:
self.sample_data["project_data"]["newick"]

Parameters that can be set

Parameter	Values	Comments
cut_samples_not_in_metadata		In the final merge file consider only samples found in the Meta-Data file
sample_cutoff	[0-1]	In the final merge file consider only samples that have at least this fraction of identified alleles

Comments

The following python packages are required:
- pandas
The following R packages are required:
- magrittr
- plyr
- optparse
- tools

Note

If using conda environment with R installed the R packages will be automatically installed inside the environment.

Lines for parameter file

Step_Name:                                   # Name of this step
    module: cgMLST_and_MLST_typing           # Name of the module to use
    base:                                    # Name of the step [or list of names] to run after [must be after steps that generates blast.parsed File_Types] 
    script_path:                             # Leave blank
    metadata:                                # Path to Meta-Data file
    metadata_samples_ID_field:               # Column name in the Meta-Data file of the samples ID
    cut_samples_not_in_metadata:             # In the final merge file consider only samples found in the Meta-Data file
    sample_cutoff:                           # In the final merge file consider only samples that have at least this fraction of identified alleles
    Tree:                                    # Generate newick Tree using hierarchical-clustering [Hamming distance]
    Tree_method:                             # The hierarchical-clustering linkage method [default=complete]
    redirects:
        --scheme:                            # Path to the Typing scheme file [Tab delimited]
        --Type_col_name:                     # Column/s name/s in the scheme file that are not locus names
        --ignore_unidentified_alleles        # Remove columns with unidentified alleles [default=False]

References

`Roary`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad. The Bi_clustering R script was created by Eliad Levi

Short Description

A module for running Roary on GFF files

Requires

For each Sample, GFF file location in:
- sample_data[<sample>]["GFF"]
If there is a GFF directory in the following slot, no new GFF directory will be created and ONLY the GFF files in this directory will be analysed.
- sample_data["GFF_dir"]
If the search_GFF flag is on GFF files will be searched in the last base name directory

Output

puts output GFF directory location in the following slots:
- sample_data["GFF"]
puts output pan_genome results directory location in the following slots:
- sample_data["pan_genome_results_dir"]
puts output pan_genome presence_absence_matrix file location in the following slots:
- sample_data["presence_absence_matrix"]
puts output pan_genome clustered_proteins file location in the following slots:
- sample_data["clustered_proteins"]
puts output GWAS directory location in the following slot:
- sample_data["GWAS_results_dir"]
puts output Biclustering directory location in the following slot:
- sample_data["Bicluster_results_dir"]
puts output Biclustering cluster file location in the following slot:
- sample_data["Bicluster_clusters"]
puts output Gecko directory location in the following slot:
- sample_data["Gecko_results_dir"]
puts Accessory genes or virulence/resistance hierarchical-clustering tree file in the following slot:
- self.sample_data["project_data"]["newick"]

Parameters that can be set

Parameter	Values	Comments

Comments

This module was tested on:
- Roary v3.10.2
- Roary v1.006924
- Scoary v1.6.11
- Scoary v1.6.9
- Gecko3
For the Bi_cluster analysis the following R packages are required:
- optparse
- eisa
- ExpressionView
- openxlsx
- clusterProfiler
- org.Hs.eg.db
To plot the pan-genome matrix the following python packages are required:
- pandas
- patsy
- seaborn
- matplotlib
- numpy
- scipy
For the scoary analysis the following python packages are required:
- pandas
For the Gecko analysis the following python packages are required:
- pandas

Note

If using conda environment with R installed, the R packages will be automatically installed inside the environment.

Lines for parameter file

Step_Name:                                   # Name of this step
    module: Roary                            # Name of the module used
    base:                                    # Name of the step [or list of names] to run after [must be after a GFF file generator step like Prokka]
    script_path:                             # Command for running the Roary script 
    env:                                     # env parameters that needs to be in the PATH for running this module
    qsub_params:                             
        -pe:                                 # Number of CPUs to reserve for this analysis
    virulence_resistance_tag:                # Use the name of the db used in prokka or use "VFDB" if you used the VFDB built-in Prokka module DB 
    search_GFF:                              # Search for GFF files?
    Bi_cluster:                              # Do Bi_cluster analysis using the Roary results, if empty or this line dose not exist will not do Bi_cluster analysis 
        --Annotation:                        # location of virulence annotation file to use to annotate the clusters or use "VFDB" if you used the VFDB built-in Prokka module DB
        --ID_field:                          # The column name in the MetaData file of the samples IDs
        --cols_to_use:                       # list of the MetaData columns to use to annotate the clusters  example: '"ST","CC","source","host","geographic.location","Date"'
        --metadata:                          # location of MetaData file to use to annotate the clusters
    plot:                                    # plot gene presence/absence matrix
        format:                              # The gene presence/absence matrix plot output format. example: pdf
        Clustering_method                    # The gene presence/absence matrix plot hierarchical-clustering method. example: ward
        Tree:                                # Save s tree in newick format of the 'Accessory' genes or the 'virulence_resistance_tag' genes hierarchical-clustering
                                             # example: Tree: Accessory 
    scoary:
        script_path:                         # Command for running the scoary script, if empty or this line dose not exist will not run scoary 
        BH_cutoff:                           # Scoary BH correction for multiple testing cut-off
        Bonferroni_cutoff:                   # Scoary Bonferroni correction for multiple testing cut-off
        metadata_file:                       # location of MetaData file to use to create the scoary traits file
        metadata_samples_ID_field:           # The column name in the MetaData file of the sample's IDs
        traits_file:                         # Path to a traits file
        traits_to_pars:               # If a traits file is not provided use a list of conditions to create the scoary traits file from MetaData file. example:"source/=='blood'"  "source/=='wound'"
                                             # Pairs of field and operator + value to convert to boolean traits: field_name1/op_value1 .. field_nameN/op_valueN Example: "field_1/>=val_1<val_2"    "feild_2/=='str_val'"
                                             # A Filter can be used by FILTER_field_name1/FILTER_op_value1&field_name1/op_value1
                                             # Note that Gecko can't run if the Bi_clustering was not run
    Gecko:
        script_path:                         # Command for running the Gecko script, if empty or this line dose not exist will not run Gecko
        -d:                                  # Parameters for running Gecko
        -s:                                  # Parameters for running Gecko
        -q:                                  # Parameters for running Gecko
    redirects:
        -k:                                  # Parameters for running Roary
        -p:                                  # Parameters for running Roary
        -qc:                                 # Parameters for running Roary
        -s:                                  # Parameters for running Roary
        -v:                                  # Parameters for running Roary
        -y:                                  # Parameters for running Roary

References

Roary program: Page, Andrew J., et al. “Roary: rapid large-scale prokaryote pan genome analysis.” Bioinformatics 31.22 (2015): 3691-3693.‏

Scoary program: Brynildsrud, Ola, et al. “Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary.” Genome biology 17.1 (2016): 238.‏

Gecko program: Winter, Sascha, et al. “Finding approximate gene clusters with Gecko 3.” Nucleic acids research 44.20 (2016): 9600-9610.‏

`Snippy`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

Short Description

A module for running Snippy on fastq files

Requires

fastq files in at least one of the following slots:
self.sample_data[<sample>]["fastq.F"] self.sample_data[<sample>]["fastq.R"] self.sample_data[<sample>]["fastq.S"]

Output

puts Results directory location in:
self.sample_data[<sample>]["Snippy"]

puts for each sample the vcf file location in:
self.sample_data[<sample>]["vcf"]

if snippy_core is set to run:

puts the core Multi-FASTA alignment location in:
self.sample_data["project_data"]["fasta.nucl"]

puts core vcf file location of all analyzed samples in the following slot:
self.sample_data["project_data"]["vcf"]

if Gubbins is set to run:

puts result Tree file location of all analyzed samples in:
self.sample_data["project_data"]["newick"]

update the core Multi-FASTA alignment in:
self.sample_data["project_data"]["fasta.nucl"]

update the core vcf file in the slot:
self.sample_data["project_data"]["vcf"]

if pars is set to run, puts phyloviz ready to use files in:

Alleles:
self.sample_data["project_data"]["phyloviz_Alleles"]

MetaData:
self.sample_data["project_data"]["phyloviz_MetaData"]

Parameters that can be set

Parameter	Values	Comments

Comments

This module was tested on:
Snippy v3.2 gubbins v2.2.0

For the pars analysis the following python packages are required:
pandas

Lines for parameter file

Step_Name:                                  # Name of this step
    module: Snippy                          # Name of the module used
    base:                                   # Name of the step [or list of names] to run after [must be after a merge step]
    script_path:                            # Command for running the Snippy script
    env:                                    # env parameters that needs to be in the PATH for running this module
    qsub_params:
        -pe:                                # Number of CPUs to reserve for this analysis
    gubbins:
        script_path:                        # Command for running the gubbins script, if empty or this line dose not exist will not run gubbins
        --STR:                              # More redirects arguments for running gubbins
    phyloviz:                                   # Generate phyloviz ready to use files
        -M:                                 # Location of a MetaData file 
        --Cut:                              # Use only Samples found in the metadata file
        --S_MetaData:                       # The name of the samples ID column
        -C:                                 # Use only Samples that has at least this fraction of identified alleles
    snippy_core:
        script_path:                        # Command for running the snippy-core script, if empty or this line dose not exist will not run snippy-core
        --noref:                            # Exclude reference 
    redirects:
        --cpus:                             # Parameters for running Snippy
        --force:                            # Force overwrite of existing output folder (default OFF)
        --mapqual:                          # Minimum mapping quality to allow
        --mincov:                           # Minimum coverage of variant site
        --minfrac:                          # Minumum proportion for variant evidence
        --reference:                        # Reference Genome location
        --cleanup                           # Remove all non-SNP files: BAMs, indices etc (default OFF)            

References

Snippy:
https://github.com/tseemann/snippy

gubbins:
Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. “Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins”. doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014

`Gubbins`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

Short Description

A module for running Gubbins on a project level nucleotide Multi-FASTA alignment file.

Requires

Project level nucleotide Multi-FASTA alignment file in the following slot:
sample_data["fasta.nucl"]

Output

puts result Tree file location of all analyzed samples in the slot:
self.sample_data["project_data"]["newick"]

update the Multi-FASTA alignment in the slot:
self.sample_data["project_data"]["fasta.nucl"]

puts the filtered vcf file in the slot:
self.sample_data["project_data"]["vcf"]

if pars is set to run, puts phyloviz ready to use files in the slots:

Alleles:
self.sample_data["project_data"]["phyloviz_Alleles"]

MetaData:
self.sample_data["project_data"]["phyloviz_MetaData"]

Parameters that can be set

Parameter	Values	Comments

Comments

This module was tested on:
gubbins v2.2.0

For the pars analysis the following python packages are required:
pandas

Lines for parameter file

Step_Name:                                  # Name of this step
    module: Gubbins                         # Name of the module used
    base:                                   # Name of the step [or list of names] to run after [must be after a step that generates a Project level nucleotide Multi-FASTA alignment]
    script_path:                            # Command for running the gubbins script, if empty or this line dose not exist will not run gubbins
    env:                                    # env parameters that needs to be in the PATH for running this module
    qsub_params:
        -pe:                                # Number of CPUs to reserve for this analysis
    phyloviz:                                   # Generate phyloviz ready to use files
        -M:                                 # Location of a MetaData file 
        --Cut:                              # Use only Samples found in the metadata file
        --S_MetaData:                       # The name of the samples ID column
        -C:                                 # Use only Samples that has at least this fraction of identified alleles
    redirects:
        --threads:                          # Parameters for running Gubbins

References

gubbins:
Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. “Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins”. doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014

`Tree_plot`

Authors: Liron Levin
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Note

This module was developed as part of a study led by Dr. Jacob Moran Gilad

Short Description

A module for plotting tree file in newick format together with MetaData information and possible additional matrix information.

Requires

A tree file in newick format in:
self.sample_data["project_data"]["newick"]

Tab delimited file with samples names in one of the columns from:
self.sample_data["project_data"]["MetaData"] self.sample_data["project_data"]["results"] or from external file.

Output

Generate pdf file of the tree with the MetaData information:

Parameters that can be set

Parameter	Values	Comments

Comments

The following R packages are required:
optparse ape ggtree openxlsx

Lines for parameter file

Step_Name:                            # Name of this step
    module: Tree_plot                 # Name of the used module
    base:                             # Name of the step [or list of names] to run after and generate a Tree plot [must be after a tree making step]
                                      # If more then one base is specified: the first overwrite the other bases overlapped slots  
    script_path:                      # Command for running the Tree plot script
                                      # If this line is empty or missing it will try using the module's associated script
    iterate_on_bases:                 # If set will iterate over the step's bases and generate a plot for each base. 
    tree_by_heatmap:                  # Generate additional tree using Hierarchical Clustering of the heatmap
    redirects:
        --layout:                     # Tree layout [fan or rectangular (default)]
        --Meta_Data:                  # Path to tab-delimited Meta Data file with header line. 
                                      # If this line is empty or missing it will try searching for results data.
        --M_Excel:                    # If the Meta_Data input is an Excel file indicate the sheet name to use
        --ID_field:                   # Column name in the Meta Data file for IDs found in the tips of the tree
        --cols_to_use:                # Columns in the Meta Data file to use and the order from the center up  
        --open.angle:                 # Tree open angle.
        --branch.length:              # Don't use branch length [cladogram]
        --conect.tip:                 # Connect the tip to its label
        --pre_spacer:                 # Space before the label text [default=0.05]
        --post_spacer:                # Space after the label text [default=0.01]
        --OTU:                        # Column name in the Meta Data file to use as OTU annotation
        --labels:                     # Use branch length labels
        --Tip_labels:                 # Show tip labels
        --heatmap:                    # Path to Data file to generate a heatmap 
                                      # If this line is empty it will try searching for results data.
        --H_Excel:                    # If the heatmap input is an Excel file indicate the sheet name to use
        --heatmap_cell_border:        # Color of heatmap cell border [default='white']
        --heatmap_lowest_value:       # Color of heatmap lowest value [default='white']
        --heatmap_highest_value:      # Color of heatmap highest value [default='red']
        --cols_to_use_heatmap:        # Columns in the heatmap Data file to use and the order from the center up
        --ID_heatmap_field:           # Column name for IDs found in the tips of the tree in the heatmap Data file
        --heatmap_variable:           # Use only variable columns in the heatmap
        --heatmap_count_by_sep:       # Count the sep in each cell to generate the values for the heatmap
        --heatmap_HC_dist:            # The heatmap Hierarchical Clustering dist method
        --heatmap_HC_agg:             # The heatmap Hierarchical Clustering agglomeration method

Microbiology

CARD_RGI

Requires

Output

Parameters that can be set

Comments

Lines for parameter file

References

cgMLST_and_MLST_typing

Short Description

Requires

Output

Parameters that can be set

Comments

Lines for parameter file

References

Roary

Short Description

Requires

Output

Parameters that can be set

Comments

Lines for parameter file

References

Snippy

Short Description

Requires

Output

Parameters that can be set

Comments

Lines for parameter file

References

Gubbins

Short Description

Requires

Output

Parameters that can be set

Comments

Lines for parameter file

References

Tree_plot

Short Description

Requires

Output

Parameters that can be set

Comments

Lines for parameter file

`CARD_RGI`

`cgMLST_and_MLST_typing`

`Roary`

`Snippy`

`Gubbins`

`Tree_plot`