Microbiology
CARD_RGI
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
Note
This module was developed as part of a study led by Dr. Jacob Moran Gilad
A module for running CARD RGI:
RGI is executed on the contigs stored in a Nucleotide fasta file.
Requires
A nucleotide fasta file in one of the following slots:
sample_data[<sample>]["fasta.nucl"]
sample_data["fasta.nucl"]
Output
If
scope
is set tosample
:Puts output files in:
sample_data[<sample>]["CARD_RGI.json"]
sample_data[<sample>]["CARD_RGI.tsv"]
Puts index of output files in:
self.sample_data["project_data"]["CARD_RGI.files_index"]
If
merge_script_path
is specified in parameters, puts the merged file inself.sample_data["project_data"]["CARD_RGI.merged_reports"]
If
scope
is set toproject
:Puts output files in:
sample_data["CARD_RGI.json"]
sample_data["CARD_RGI.tsv"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
JSON2tsv_script |
path |
|
merge_script_path |
path |
|
Lines for parameter file
rgi_inst:
module: CARD_RGI
base: spades1
script_path: python /path/to/rgi.py
qsub_params:
-pe: shared 15
JSON2tsv_script: python /path/to/convertJsonToTSV.py
merge_script_path: Rscript /path/to/merge_reports.R --variable bit_score
orf_to_use: -x
scope: sample
redirects:
-n: 20
-x: 1
References
McArthur, A.G., Waglechner, N., Nizam, F., Yan, A., Azad, M.A., Baylay, A.J., Bhullar, K., Canova, M.J., De Pascale, G., Ejim, L. and Kalan, L., 2013. The comprehensive antibiotic resistance database. Antimicrobial agents and chemotherapy, 57(7), pp.3348-3357.
cgMLST_and_MLST_typing
- Authors
Liron Levin
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
Note
This module was developed as part of a study led by Dr. Jacob Moran Gilad. The MLST typing R script was created by Menachem Sklarz & Michal Gordon
Short Description
A module for a MLST and cgMLST Typing
Requires
- Blast results after parsing in:
self.sample_data[<sample>]["blast.parsed"]
Output
- Typing results in:
self.sample_data[<sample>]["Typing"]
- Merge of typing results in:
self.sample_data["project_data"]["Typing"]
- Files for phyloviz in:
self.sample_data["project_data"]["phyloviz_MetaData"]
self.sample_data["project_data"]["phyloviz_Alleles"]
- Tree file (if –Tree flag is set) in newick format in:
self.sample_data["project_data"]["newick"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
cut_samples_not_in_metadata |
In the final merge file consider only samples found in the Meta-Data file |
|
sample_cutoff |
[0-1] |
In the final merge file consider only samples that have at least this fraction of identified alleles |
Comments
- The following python packages are required:
pandas
- The following R packages are required:
magrittr
plyr
optparse
tools
Note
If using conda environment with R installed the R packages will be automatically installed inside the environment.
Lines for parameter file
Step_Name: # Name of this step
module: cgMLST_and_MLST_typing # Name of the module to use
base: # Name of the step [or list of names] to run after [must be after steps that generates blast.parsed File_Types]
script_path: # Leave blank
metadata: # Path to Meta-Data file
metadata_samples_ID_field: # Column name in the Meta-Data file of the samples ID
cut_samples_not_in_metadata: # In the final merge file consider only samples found in the Meta-Data file
sample_cutoff: # In the final merge file consider only samples that have at least this fraction of identified alleles
Tree: # Generate newick Tree using hierarchical-clustering [Hamming distance]
Tree_method: # The hierarchical-clustering linkage method [default=complete]
redirects:
--scheme: # Path to the Typing scheme file [Tab delimited]
--Type_col_name: # Column/s name/s in the scheme file that are not locus names
--ignore_unidentified_alleles # Remove columns with unidentified alleles [default=False]
References
Roary
- Authors
Liron Levin
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
Note
This module was developed as part of a study led by Dr. Jacob Moran Gilad. The Bi_clustering R script was created by Eliad Levi
Short Description
A module for running Roary on GFF files
Requires
- For each Sample, GFF file location in:
sample_data[<sample>]["GFF"]
- If there is a GFF directory in the following slot, no new GFF directory will be created and ONLY the GFF files in this directory will be analysed.
sample_data["GFF_dir"]
If the search_GFF flag is on GFF files will be searched in the last base name directory
Output
- puts output GFF directory location in the following slots:
sample_data["GFF"]
- puts output pan_genome results directory location in the following slots:
sample_data["pan_genome_results_dir"]
- puts output pan_genome presence_absence_matrix file location in the following slots:
sample_data["presence_absence_matrix"]
- puts output pan_genome clustered_proteins file location in the following slots:
sample_data["clustered_proteins"]
- puts output GWAS directory location in the following slot:
sample_data["GWAS_results_dir"]
- puts output Biclustering directory location in the following slot:
sample_data["Bicluster_results_dir"]
- puts output Biclustering cluster file location in the following slot:
sample_data["Bicluster_clusters"]
- puts output Gecko directory location in the following slot:
sample_data["Gecko_results_dir"]
- puts Accessory genes or virulence/resistance hierarchical-clustering tree file in the following slot:
self.sample_data["project_data"]["newick"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
Comments
- This module was tested on:
Roary v3.10.2
Roary v1.006924
Scoary v1.6.11
Scoary v1.6.9
Gecko3
- For the Bi_cluster analysis the following R packages are required:
optparse
eisa
ExpressionView
openxlsx
clusterProfiler
org.Hs.eg.db
- To plot the pan-genome matrix the following python packages are required:
pandas
patsy
seaborn
matplotlib
numpy
scipy
- For the scoary analysis the following python packages are required:
pandas
- For the Gecko analysis the following python packages are required:
pandas
Note
If using conda environment with R installed, the R packages will be automatically installed inside the environment.
Lines for parameter file
Step_Name: # Name of this step
module: Roary # Name of the module used
base: # Name of the step [or list of names] to run after [must be after a GFF file generator step like Prokka]
script_path: # Command for running the Roary script
env: # env parameters that needs to be in the PATH for running this module
qsub_params:
-pe: # Number of CPUs to reserve for this analysis
virulence_resistance_tag: # Use the name of the db used in prokka or use "VFDB" if you used the VFDB built-in Prokka module DB
search_GFF: # Search for GFF files?
Bi_cluster: # Do Bi_cluster analysis using the Roary results, if empty or this line dose not exist will not do Bi_cluster analysis
--Annotation: # location of virulence annotation file to use to annotate the clusters or use "VFDB" if you used the VFDB built-in Prokka module DB
--ID_field: # The column name in the MetaData file of the samples IDs
--cols_to_use: # list of the MetaData columns to use to annotate the clusters example: '"ST","CC","source","host","geographic.location","Date"'
--metadata: # location of MetaData file to use to annotate the clusters
plot: # plot gene presence/absence matrix
format: # The gene presence/absence matrix plot output format. example: pdf
Clustering_method # The gene presence/absence matrix plot hierarchical-clustering method. example: ward
Tree: # Save s tree in newick format of the 'Accessory' genes or the 'virulence_resistance_tag' genes hierarchical-clustering
# example: Tree: Accessory
scoary:
script_path: # Command for running the scoary script, if empty or this line dose not exist will not run scoary
BH_cutoff: # Scoary BH correction for multiple testing cut-off
Bonferroni_cutoff: # Scoary Bonferroni correction for multiple testing cut-off
metadata_file: # location of MetaData file to use to create the scoary traits file
metadata_samples_ID_field: # The column name in the MetaData file of the sample's IDs
traits_file: # Path to a traits file
traits_to_pars: # If a traits file is not provided use a list of conditions to create the scoary traits file from MetaData file. example:"source/=='blood'" "source/=='wound'"
# Pairs of field and operator + value to convert to boolean traits: field_name1/op_value1 .. field_nameN/op_valueN Example: "field_1/>=val_1<val_2" "feild_2/=='str_val'"
# A Filter can be used by FILTER_field_name1/FILTER_op_value1&field_name1/op_value1
# Note that Gecko can't run if the Bi_clustering was not run
Gecko:
script_path: # Command for running the Gecko script, if empty or this line dose not exist will not run Gecko
-d: # Parameters for running Gecko
-s: # Parameters for running Gecko
-q: # Parameters for running Gecko
redirects:
-k: # Parameters for running Roary
-p: # Parameters for running Roary
-qc: # Parameters for running Roary
-s: # Parameters for running Roary
-v: # Parameters for running Roary
-y: # Parameters for running Roary
References
Roary program: Page, Andrew J., et al. “Roary: rapid large-scale prokaryote pan genome analysis.” Bioinformatics 31.22 (2015): 3691-3693.
Scoary program: Brynildsrud, Ola, et al. “Rapid scoring of genes in microbial pan-genome-wide association studies with Scoary.” Genome biology 17.1 (2016): 238.
Gecko program: Winter, Sascha, et al. “Finding approximate gene clusters with Gecko 3.” Nucleic acids research 44.20 (2016): 9600-9610.
Snippy
- Authors
Liron Levin
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
Note
This module was developed as part of a study led by Dr. Jacob Moran Gilad
Short Description
A module for running Snippy on fastq files
Requires
- fastq files in at least one of the following slots:
self.sample_data[<sample>]["fastq.F"]
self.sample_data[<sample>]["fastq.R"]
self.sample_data[<sample>]["fastq.S"]
Output
- puts Results directory location in:
self.sample_data[<sample>]["Snippy"]
- puts for each sample the vcf file location in:
self.sample_data[<sample>]["vcf"]
- if snippy_core is set to run:
- puts the core Multi-FASTA alignment location in:
self.sample_data["project_data"]["fasta.nucl"]
- puts core vcf file location of all analyzed samples in the following slot:
self.sample_data["project_data"]["vcf"]
- if Gubbins is set to run:
- puts result Tree file location of all analyzed samples in:
self.sample_data["project_data"]["newick"]
- update the core Multi-FASTA alignment in:
self.sample_data["project_data"]["fasta.nucl"]
- update the core vcf file in the slot:
self.sample_data["project_data"]["vcf"]
- if pars is set to run, puts phyloviz ready to use files in:
- Alleles:
self.sample_data["project_data"]["phyloviz_Alleles"]
- MetaData:
self.sample_data["project_data"]["phyloviz_MetaData"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
Comments
- This module was tested on:
Snippy v3.2
gubbins v2.2.0
- For the pars analysis the following python packages are required:
pandas
Lines for parameter file
Step_Name: # Name of this step
module: Snippy # Name of the module used
base: # Name of the step [or list of names] to run after [must be after a merge step]
script_path: # Command for running the Snippy script
env: # env parameters that needs to be in the PATH for running this module
qsub_params:
-pe: # Number of CPUs to reserve for this analysis
gubbins:
script_path: # Command for running the gubbins script, if empty or this line dose not exist will not run gubbins
--STR: # More redirects arguments for running gubbins
phyloviz: # Generate phyloviz ready to use files
-M: # Location of a MetaData file
--Cut: # Use only Samples found in the metadata file
--S_MetaData: # The name of the samples ID column
-C: # Use only Samples that has at least this fraction of identified alleles
snippy_core:
script_path: # Command for running the snippy-core script, if empty or this line dose not exist will not run snippy-core
--noref: # Exclude reference
redirects:
--cpus: # Parameters for running Snippy
--force: # Force overwrite of existing output folder (default OFF)
--mapqual: # Minimum mapping quality to allow
--mincov: # Minimum coverage of variant site
--minfrac: # Minumum proportion for variant evidence
--reference: # Reference Genome location
--cleanup # Remove all non-SNP files: BAMs, indices etc (default OFF)
References
- Snippy:
- gubbins:
Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. “Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins”. doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014
Gubbins
- Authors
Liron Levin
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
Note
This module was developed as part of a study led by Dr. Jacob Moran Gilad
Short Description
A module for running Gubbins on a project level nucleotide Multi-FASTA alignment file.
Requires
- Project level nucleotide Multi-FASTA alignment file in the following slot:
sample_data["fasta.nucl"]
Output
- puts result Tree file location of all analyzed samples in the slot:
self.sample_data["project_data"]["newick"]
- update the Multi-FASTA alignment in the slot:
self.sample_data["project_data"]["fasta.nucl"]
- puts the filtered vcf file in the slot:
self.sample_data["project_data"]["vcf"]
- if pars is set to run, puts phyloviz ready to use files in the slots:
- Alleles:
self.sample_data["project_data"]["phyloviz_Alleles"]
- MetaData:
self.sample_data["project_data"]["phyloviz_MetaData"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
Comments
- This module was tested on:
gubbins v2.2.0
- For the pars analysis the following python packages are required:
pandas
Lines for parameter file
Step_Name: # Name of this step
module: Gubbins # Name of the module used
base: # Name of the step [or list of names] to run after [must be after a step that generates a Project level nucleotide Multi-FASTA alignment]
script_path: # Command for running the gubbins script, if empty or this line dose not exist will not run gubbins
env: # env parameters that needs to be in the PATH for running this module
qsub_params:
-pe: # Number of CPUs to reserve for this analysis
phyloviz: # Generate phyloviz ready to use files
-M: # Location of a MetaData file
--Cut: # Use only Samples found in the metadata file
--S_MetaData: # The name of the samples ID column
-C: # Use only Samples that has at least this fraction of identified alleles
redirects:
--threads: # Parameters for running Gubbins
References
- gubbins:
Croucher N. J., Page A. J., Connor T. R., Delaney A. J., Keane J. A., Bentley S. D., Parkhill J., Harris S.R. “Rapid phylogenetic analysis of large samples of recombinant bacterial whole genome sequences using Gubbins”. doi:10.1093/nar/gku1196, Nucleic Acids Research, 2014
Tree_plot
- Authors
Liron Levin
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
Note
This module was developed as part of a study led by Dr. Jacob Moran Gilad
Short Description
A module for plotting tree file in newick format together with MetaData information and possible additional matrix information.
Requires
- A tree file in newick format in:
self.sample_data["project_data"]["newick"]
- Tab delimited file with samples names in one of the columns from:
self.sample_data["project_data"]["MetaData"]
self.sample_data["project_data"]["results"]
or from external file.
Output
Generate pdf file of the tree with the MetaData information:
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
Comments
- The following R packages are required:
optparse
ape
ggtree
openxlsx
Lines for parameter file
Step_Name: # Name of this step
module: Tree_plot # Name of the used module
base: # Name of the step [or list of names] to run after and generate a Tree plot [must be after a tree making step]
# If more then one base is specified: the first overwrite the other bases overlapped slots
script_path: # Command for running the Tree plot script
# If this line is empty or missing it will try using the module's associated script
iterate_on_bases: # If set will iterate over the step's bases and generate a plot for each base.
tree_by_heatmap: # Generate additional tree using Hierarchical Clustering of the heatmap
redirects:
--layout: # Tree layout [fan or rectangular (default)]
--Meta_Data: # Path to tab-delimited Meta Data file with header line.
# If this line is empty or missing it will try searching for results data.
--M_Excel: # If the Meta_Data input is an Excel file indicate the sheet name to use
--ID_field: # Column name in the Meta Data file for IDs found in the tips of the tree
--cols_to_use: # Columns in the Meta Data file to use and the order from the center up
--open.angle: # Tree open angle.
--branch.length: # Don't use branch length [cladogram]
--conect.tip: # Connect the tip to its label
--pre_spacer: # Space before the label text [default=0.05]
--post_spacer: # Space after the label text [default=0.01]
--OTU: # Column name in the Meta Data file to use as OTU annotation
--labels: # Use branch length labels
--Tip_labels: # Show tip labels
--heatmap: # Path to Data file to generate a heatmap
# If this line is empty it will try searching for results data.
--H_Excel: # If the heatmap input is an Excel file indicate the sheet name to use
--heatmap_cell_border: # Color of heatmap cell border [default='white']
--heatmap_lowest_value: # Color of heatmap lowest value [default='white']
--heatmap_highest_value: # Color of heatmap highest value [default='red']
--cols_to_use_heatmap: # Columns in the heatmap Data file to use and the order from the center up
--ID_heatmap_field: # Column name for IDs found in the tips of the tree in the heatmap Data file
--heatmap_variable: # Use only variable columns in the heatmap
--heatmap_count_by_sep: # Count the sep in each cell to generate the values for the heatmap
--heatmap_HC_dist: # The heatmap Hierarchical Clustering dist method
--heatmap_HC_agg: # The heatmap Hierarchical Clustering agglomeration method
Comments