Shotgun Metagenomics

Author: Menachem Sklarz
Affiliation: Bioinformatics Core Facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

Module categories

Steps
Workflow Schema
Requires
Programs required
Example of Sample File
Download
Quick start with conda

A workflow for executing various analyses on metagenomics data.

The workflow uses two approaches:

Analysis of the raw reads and
assembly of the reads and analysis of the assembled contigs.

Developed as part of a study led by Prof. Jacob Moran-Gilad.

Steps 

Analysis of the raw reads with:
- kraken2
- metaphlan2
- kaiju
- HUMAnN2
The output from the former three programs is also plotted with krona.
Assembly and analysis of the assembled reads:
1. Assembly is done per-sample with spades.
2. The assemblies are quality-tested with quast.
3. Assemblies are annotated with Prokka.
4. Antibiotic resistance is determined with CARD_RGI.
5. Not included. Resistance and virulence can also be determined by BLASTing AR and virulence databases against the assemblies. See module BLAST.

Workflow Schema 

Requires 

fastq files. Paired end or single-end.

Programs required 

All the programs used in this workflow can be installed with conda. See section Quick start with conda.

Example of Sample File 

Title       Metagenomics

#SampleID   Type    Path    lane
Sample1     Forward /path/to/Sample1_F1.fastq.gz 1
Sample1     Forward /path/to/Sample1_F2.fastq.gz 2
Sample1     Reverse /path/to/Sample1_R1.fastq.gz 1
Sample1     Reverse /path/to/Sample1_R2.fastq.gz 2
Sample2     Forward /path/to/Sample2_F1.fastq.gz 1
Sample2     Reverse /path/to/Sample2_R1.fastq.gz 1
Sample2     Forward /path/to/Sample2_F2.fastq.gz 2
Sample2     Reverse /path/to/Sample2_R2.fastq.gz 2

Download 

The workflow file is available here

Quick start with conda 

For easy setup of the workflow, including a sample dataset, use the following instructions for complete installation with conda:

Download the conda environment definition file:

You can download the Metagenomics_conda.yaml file here, or programatically with:

curl -LO https://raw.githubusercontent.com/bioinfo-core-BGU/neatseq-flow-modules/master/docs/source/_extra/Metagenomics_conda.yaml

Create and activate a conda environment with all the required programs:

conda env create -f Metagenomics_conda.yaml
source activate Metagenomics

Create a sample file. It should look like the following, only the file names should be replaced with absolute file names:

Title   Trinity_example

#SampleID       Type    Path
Sample1 Forward 00.Raw_reads/reads.left.fq.gz
Sample1 Reverse 00.Raw_reads/reads.right.fq.gz

Tip

To get the full path to a file, use the following command:

readlink -f 00.Raw_reads/reads.left.fq.gz

Create a directory for your databases. Save the location of the directory in $DBDIR.
```
export DBDIR=/path/to/databases_dir
mkdir -p $DBDIR
```
Install required databases
Warning

Installing the databases requires about 220 GB of disk space!
Tip

File Metagenomics_DBinstall_cmds.sh contains a script for installing all the databases described below.

Execution might take a while due to the large datasetb being downloaded, therefore it is recommended to execute as follows (After setting $DBDIR!!!):
curl -LO https://raw.githubusercontent.com/bioinfo-core-BGU/neatseq-flow-modules/master/docs/source/_extra/Metagenomics_DBinstall_cmds.sh nohup bash Metagenomics_DBinstall_cmds.sh &
MetaPhlAn2
Running MetaPhlAn2 will download the database for you:

metaphlan2.py \ --input_type fastq \ --bowtie2_exe bowtie2 \ --bowtie2db $DBDIR/MetaPhlAn_temp
Kraken2
Installing Kraken2 database takes a long time and requires about 100 GB of disk space.

mkdir -p $DBDIR/kraken2 kraken2-build \ --standard \ --threads 10 \ --db $DBDIR/kraken2

Attention

If rsync dosen’t work for you, you can try adding the --use-ftp to the kraken2-build command to use wget instead.
krona
ktUpdateTaxonomy.sh $DBDIR/krona/taxonomy
Kaiju
Kaiju provides different databases for downloading. To get a list of options, just execute kaiju-makedb with no arguments:

The following commands demonstrate how to get the nr database including eukaryotes (nr_euk) and the progenomes database.

mkdir -p $DBDIR/kaiju cd $DBDIR/kaiju kaiju-makedb -s progenomes -t 10 kaiju-makedb -s nr_euk -t 10 cd -
HUMAnN2
Online help on downloading databases.

mkdir -p databases/HUMAnN2 humann2_databases --download chocophlan full $DBDIR/HUMAnN2 humann2_databases --download uniref uniref90_diamond $DBDIR/HUMAnN2/uniref90 humann2_databases --download uniref uniref50_diamond $DBDIR/HUMAnN2/uniref50 humann2_config --update database_folders nucleotide $DBDIR/HUMAnN2/chocophlan humann2_config --update database_folders protein $DBDIR/HUMAnN2/uniref90

Attention

The commands download the recommended translated databases. For other options, see the Download a translated search database section of the HUMAnN2 tutorial.

Get the parameter file with:

curl -LO https://raw.githubusercontent.com/bioinfo-core-BGU/neatseq-flow-modules/master/Workflows/Metagenomics.yaml

Settings to set in the parameter file

You will have to make some changes to the parameter file to suit your needs:
- Set the parameters in the Global_params section to suit your cluster. Alternatively, set Executor to Local for running on a single machine.
- In the Vars section, set database_prefix to the location of your databases dir, which is the value of $DBDIR set above. If $DBDIR is set, you can use the following sed command to set the database_prefix correctly:
  sed -i s+\$DBdir+$DBDIR+ Metagenomics.yaml
- In Vars.databases.kaiju, you will have to make sure the value of fmi fits the database you decide to use. In the provided parameter file, the nr_euk is set. The equivalent fmi value for the progenomes database is commented out.
- Go over the redirects sections in the parameter file and make sure they are set according to your requirements.
- If you have a fasta file with sequences to search for within your metagenome assemblies, set the proteins_of_interest variable to the full path to that file. If not, you can delete or uncomment the SKIP line in steps make_blast_db_per_assembly, blast_proteins_vs_assemblies and parse_blast.
In the conda definitions (line 46), set base: to the path to the conda installation which you used to install the environment.
You can get the path by executing the following command, when inside the Metagenomics conda environment:
echo $CONDA_EXE | sed -e 's/\/bin\/conda$//g'
Execute NeatSeq-Flow.

Tip

See also this nice presentation by Galeb Abu-Ali, Eric Franzosa and Curtis Huttenhower

Shotgun Metagenomics

Steps

Workflow Schema

Requires

Programs required

Example of Sample File

Download

Quick start with conda