Bioinformatics Core Facility
National Institute of Biotechnology in the Negev, Ben Gurion University.
A workflow for executing various analyses on metagenomics data.
The workflow uses two approaches:
Analysis of the raw reads and
assembly of the reads and analysis of the assembled contigs.
Developed as part of a study led by Prof. Jacob Moran-Gilad.
- Analysis of the raw reads with:
The output from the former three programs is also plotted with
- Assembly and analysis of the assembled reads:
Assembly is done per-sample with
The assemblies are quality-tested with
Assemblies are annotated with
Antibiotic resistance is determined with
Not included. Resistance and virulence can also be determined by BLASTing AR and virulence databases against the assemblies. See module BLAST.
fastq files. Paired end or single-end.
All the programs used in this workflow can be installed with conda. See section Quick start with conda.
Title Metagenomics #SampleID Type Path lane Sample1 Forward /path/to/Sample1_F1.fastq.gz 1 Sample1 Forward /path/to/Sample1_F2.fastq.gz 2 Sample1 Reverse /path/to/Sample1_R1.fastq.gz 1 Sample1 Reverse /path/to/Sample1_R2.fastq.gz 2 Sample2 Forward /path/to/Sample2_F1.fastq.gz 1 Sample2 Reverse /path/to/Sample2_R1.fastq.gz 1 Sample2 Forward /path/to/Sample2_F2.fastq.gz 2 Sample2 Reverse /path/to/Sample2_R2.fastq.gz 2
The workflow file is available
For easy setup of the workflow, including a sample dataset, use the following instructions for complete installation with conda:
Download the conda environment definition file:
download the Metagenomics_conda.yaml filehere, or programatically with:
curl -LO https://raw.githubusercontent.com/bioinfo-core-BGU/neatseq-flow-modules/master/docs/source/_extra/Metagenomics_conda.yaml
Create and activate a conda environment with all the required programs:
conda env create -f Metagenomics_conda.yaml source activate Metagenomics
Create a sample file. It should look like the following, only the file names should be replaced with absolute file names:
Title Trinity_example #SampleID Type Path Sample1 Forward 00.Raw_reads/reads.left.fq.gz Sample1 Reverse 00.Raw_reads/reads.right.fq.gz
To get the full path to a file, use the following command:
readlink -f 00.Raw_reads/reads.left.fq.gz
Create a directory for your databases. Save the location of the directory in
export DBDIR=/path/to/databases_dir mkdir -p $DBDIR
Install required databases
Installing the databases requires about 220 GB of disk space!
Metagenomics_DBinstall_cmds.shcontains a script for installing all the databases described below.
Execution might take a while due to the large datasetb being downloaded, therefore it is recommended to execute as follows (After setting $DBDIR!!!):
curl -LO https://raw.githubusercontent.com/bioinfo-core-BGU/neatseq-flow-modules/master/docs/source/_extra/Metagenomics_DBinstall_cmds.sh nohup bash Metagenomics_DBinstall_cmds.sh &
Running MetaPhlAn2 will download the database for you:
metaphlan2.py \ --input_type fastq \ --bowtie2_exe bowtie2 \ --bowtie2db $DBDIR/MetaPhlAn_temp
Installing Kraken2 database takes a long time and requires about 100 GB of disk space.
mkdir -p $DBDIR/kraken2 kraken2-build \ --standard \ --threads 10 \ --db $DBDIR/kraken2
rsyncdosen’t work for you, you can try adding the
kraken2-buildcommand to use
Kaiju provides different databases for downloading. To get a list of options, just execute
kaiju-makedbwith no arguments:
The following commands demonstrate how to get the
nrdatabase including eukaryotes (
nr_euk) and the
mkdir -p $DBDIR/kaiju cd $DBDIR/kaiju kaiju-makedb -s progenomes -t 10 kaiju-makedb -s nr_euk -t 10 cd -
mkdir -p databases/HUMAnN2 humann2_databases --download chocophlan full $DBDIR/HUMAnN2 humann2_databases --download uniref uniref90_diamond $DBDIR/HUMAnN2/uniref90 humann2_databases --download uniref uniref50_diamond $DBDIR/HUMAnN2/uniref50 humann2_config --update database_folders nucleotide $DBDIR/HUMAnN2/chocophlan humann2_config --update database_folders protein $DBDIR/HUMAnN2/uniref90
The commands download the recommended translated databases. For other options, see the Download a translated search database section of the HUMAnN2 tutorial.
the parameter filewith:
curl -LO https://raw.githubusercontent.com/bioinfo-core-BGU/neatseq-flow-modules/master/Workflows/Metagenomics.yaml
Settings to set in the parameter file
You will have to make some changes to the parameter file to suit your needs:
Set the parameters in the
Global_paramssection to suit your cluster. Alternatively, set
Localfor running on a single machine.
database_prefixto the location of your databases dir, which is the value of
$DBDIRset above. If
$DBDIRis set, you can use the following
sedcommand to set the
sed -i s+\$DBdir+$DBDIR+ Metagenomics.yaml
Vars.databases.kaiju, you will have to make sure the value of
fmifits the database you decide to use. In the provided parameter file, the
nr_eukis set. The equivalent
fmivalue for the
progenomesdatabase is commented out.
Go over the
redirectssections in the parameter file and make sure they are set according to your requirements.
If you have a fasta file with sequences to search for within your metagenome assemblies, set the
proteins_of_interestvariable to the full path to that file. If not, you can delete or uncomment the
SKIPline in steps
In the conda definitions (line 46), set
base:to the path to the conda installation which you used to install the environment.
You can get the path by executing the following command, when inside the Metagenomics conda environment:
echo $CONDA_EXE | sed -e 's/\/bin\/conda$//g'
See also this nice presentation by Galeb Abu-Ali, Eric Franzosa and Curtis Huttenhower