Shotgun Metagenomics
- Author
Menachem Sklarz
- Affiliation
Bioinformatics Core Facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
Module categories
A workflow for executing various analyses on metagenomics data.
The workflow uses two approaches:
Analysis of the raw reads and
assembly of the reads and analysis of the assembled contigs.
Developed as part of a study led by Prof. Jacob Moran-Gilad.
Steps
- Analysis of the raw reads with:
kraken2
metaphlan2
kaiju
HUMAnN2
The output from the former three programs is also plotted with
krona
.
- Assembly and analysis of the assembled reads:
Assembly is done per-sample with
spades
.The assemblies are quality-tested with
quast
.Assemblies are annotated with
Prokka
.Antibiotic resistance is determined with
CARD_RGI
.Not included. Resistance and virulence can also be determined by BLASTing AR and virulence databases against the assemblies. See module BLAST.
Workflow Schema
Requires
fastq files. Paired end or single-end.
Programs required
All the programs used in this workflow can be installed with conda. See section Quick start with conda.
Example of Sample File
Title Metagenomics
#SampleID Type Path lane
Sample1 Forward /path/to/Sample1_F1.fastq.gz 1
Sample1 Forward /path/to/Sample1_F2.fastq.gz 2
Sample1 Reverse /path/to/Sample1_R1.fastq.gz 1
Sample1 Reverse /path/to/Sample1_R2.fastq.gz 2
Sample2 Forward /path/to/Sample2_F1.fastq.gz 1
Sample2 Reverse /path/to/Sample2_R1.fastq.gz 1
Sample2 Forward /path/to/Sample2_F2.fastq.gz 2
Sample2 Reverse /path/to/Sample2_R2.fastq.gz 2
Download
The workflow file is available here
Quick start with conda
For easy setup of the workflow, including a sample dataset, use the following instructions for complete installation with conda:
Download the conda environment definition file:
You can
download the Metagenomics_conda.yaml file
here, or programatically with:curl -LO https://raw.githubusercontent.com/bioinfo-core-BGU/neatseq-flow-modules/master/docs/source/_extra/Metagenomics_conda.yaml
Create and activate a conda environment with all the required programs:
conda env create -f Metagenomics_conda.yaml source activate Metagenomics
Create a sample file. It should look like the following, only the file names should be replaced with absolute file names:
Title Trinity_example #SampleID Type Path Sample1 Forward 00.Raw_reads/reads.left.fq.gz Sample1 Reverse 00.Raw_reads/reads.right.fq.gz
Tip
To get the full path to a file, use the following command:
readlink -f 00.Raw_reads/reads.left.fq.gz
Create a directory for your databases. Save the location of the directory in
$DBDIR
.export DBDIR=/path/to/databases_dir mkdir -p $DBDIR
Install required databases
Warning
Installing the databases requires about 220 GB of disk space!
Tip
File
Metagenomics_DBinstall_cmds.sh
contains a script for installing all the databases described below.Execution might take a while due to the large datasetb being downloaded, therefore it is recommended to execute as follows (After setting $DBDIR!!!):
curl -LO https://raw.githubusercontent.com/bioinfo-core-BGU/neatseq-flow-modules/master/docs/source/_extra/Metagenomics_DBinstall_cmds.sh nohup bash Metagenomics_DBinstall_cmds.sh &
- MetaPhlAn2
Running MetaPhlAn2 will download the database for you:
metaphlan2.py \ --input_type fastq \ --bowtie2_exe bowtie2 \ --bowtie2db $DBDIR/MetaPhlAn_temp
- Kraken2
Installing Kraken2 database takes a long time and requires about 100 GB of disk space.
mkdir -p $DBDIR/kraken2 kraken2-build \ --standard \ --threads 10 \ --db $DBDIR/kraken2
Attention
If
rsync
dosen’t work for you, you can try adding the--use-ftp
to thekraken2-build
command to usewget
instead.- krona
ktUpdateTaxonomy.sh $DBDIR/krona/taxonomy
- Kaiju
Kaiju provides different databases for downloading. To get a list of options, just execute
kaiju-makedb
with no arguments:The following commands demonstrate how to get the
nr
database including eukaryotes (nr_euk
) and theprogenomes
database.mkdir -p $DBDIR/kaiju cd $DBDIR/kaiju kaiju-makedb -s progenomes -t 10 kaiju-makedb -s nr_euk -t 10 cd -
- HUMAnN2
Online help on downloading databases.
mkdir -p databases/HUMAnN2 humann2_databases --download chocophlan full $DBDIR/HUMAnN2 humann2_databases --download uniref uniref90_diamond $DBDIR/HUMAnN2/uniref90 humann2_databases --download uniref uniref50_diamond $DBDIR/HUMAnN2/uniref50 humann2_config --update database_folders nucleotide $DBDIR/HUMAnN2/chocophlan humann2_config --update database_folders protein $DBDIR/HUMAnN2/uniref90
Attention
The commands download the recommended translated databases. For other options, see the Download a translated search database section of the HUMAnN2 tutorial.
Get
the parameter file
with:curl -LO https://raw.githubusercontent.com/bioinfo-core-BGU/neatseq-flow-modules/master/Workflows/Metagenomics.yaml
Settings to set in the parameter file
You will have to make some changes to the parameter file to suit your needs:
Set the parameters in the
Global_params
section to suit your cluster. Alternatively, setExecutor
toLocal
for running on a single machine.In the
Vars
section, setdatabase_prefix
to the location of your databases dir, which is the value of$DBDIR
set above. If$DBDIR
is set, you can use the followingsed
command to set thedatabase_prefix
correctly:sed -i s+\$DBdir+$DBDIR+ Metagenomics.yaml
In
Vars.databases.kaiju
, you will have to make sure the value offmi
fits the database you decide to use. In the provided parameter file, thenr_euk
is set. The equivalentfmi
value for theprogenomes
database is commented out.Go over the
redirects
sections in the parameter file and make sure they are set according to your requirements.If you have a fasta file with sequences to search for within your metagenome assemblies, set the
proteins_of_interest
variable to the full path to that file. If not, you can delete or uncomment theSKIP
line in stepsmake_blast_db_per_assembly
,blast_proteins_vs_assemblies
andparse_blast
.
In the conda definitions (line 46), set
base:
to the path to the conda installation which you used to install the environment.You can get the path by executing the following command, when inside the Metagenomics conda environment:
echo $CONDA_EXE | sed -e 's/\/bin\/conda$//g'
Tip
See also this nice presentation by Galeb Abu-Ali, Eric Franzosa and Curtis Huttenhower