Shotgun Metagenomics

Author

Menachem Sklarz

Affiliation

Bioinformatics Core Facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A workflow for executing various analyses on metagenomics data.

The workflow uses two approaches:

  1. Analysis of the raw reads and

  2. assembly of the reads and analysis of the assembled contigs.

Developed as part of a study led by Prof. Jacob Moran-Gilad.

Steps

  1. Analysis of the raw reads with:
    • kraken2

    • metaphlan2

    • kaiju

    • HUMAnN2

    The output from the former three programs is also plotted with krona.

  2. Assembly and analysis of the assembled reads:
    1. Assembly is done per-sample with spades.

    2. The assemblies are quality-tested with quast.

    3. Assemblies are annotated with Prokka.

    4. Antibiotic resistance is determined with CARD_RGI.

    5. Not included. Resistance and virulence can also be determined by BLASTing AR and virulence databases against the assemblies. See module BLAST.

Workflow Schema

Metagenomics DAG

Requires

fastq files. Paired end or single-end.

Programs required

All the programs used in this workflow can be installed with conda. See section Quick start with conda.

Example of Sample File

Title       Metagenomics

#SampleID   Type    Path    lane
Sample1     Forward /path/to/Sample1_F1.fastq.gz 1
Sample1     Forward /path/to/Sample1_F2.fastq.gz 2
Sample1     Reverse /path/to/Sample1_R1.fastq.gz 1
Sample1     Reverse /path/to/Sample1_R2.fastq.gz 2
Sample2     Forward /path/to/Sample2_F1.fastq.gz 1
Sample2     Reverse /path/to/Sample2_R1.fastq.gz 1
Sample2     Forward /path/to/Sample2_F2.fastq.gz 2
Sample2     Reverse /path/to/Sample2_R2.fastq.gz 2

Download

The workflow file is available here

Quick start with conda

For easy setup of the workflow, including a sample dataset, use the following instructions for complete installation with conda:

  1. Download the conda environment definition file:

    You can download the Metagenomics_conda.yaml file here, or programatically with:

    curl -LO https://raw.githubusercontent.com/bioinfo-core-BGU/neatseq-flow-modules/master/docs/source/_extra/Metagenomics_conda.yaml
    
  2. Create and activate a conda environment with all the required programs:

    conda env create -f Metagenomics_conda.yaml
    source activate Metagenomics
    
  3. Create a sample file. It should look like the following, only the file names should be replaced with absolute file names:

    Title   Trinity_example
    
    #SampleID       Type    Path
    Sample1 Forward 00.Raw_reads/reads.left.fq.gz
    Sample1 Reverse 00.Raw_reads/reads.right.fq.gz
    

    Tip

    To get the full path to a file, use the following command:

    readlink -f 00.Raw_reads/reads.left.fq.gz
    
  4. Create a directory for your databases. Save the location of the directory in $DBDIR.

    export DBDIR=/path/to/databases_dir
    mkdir -p $DBDIR
    
  5. Install required databases

    Warning

    Installing the databases requires about 220 GB of disk space!

    Tip

    File Metagenomics_DBinstall_cmds.sh contains a script for installing all the databases described below.

    Execution might take a while due to the large datasetb being downloaded, therefore it is recommended to execute as follows (After setting $DBDIR!!!):

    curl -LO https://raw.githubusercontent.com/bioinfo-core-BGU/neatseq-flow-modules/master/docs/source/_extra/Metagenomics_DBinstall_cmds.sh
    nohup bash Metagenomics_DBinstall_cmds.sh &
    
    MetaPhlAn2

    Running MetaPhlAn2 will download the database for you:

    metaphlan2.py \
        --input_type fastq \
        --bowtie2_exe bowtie2 \
        --bowtie2db $DBDIR/MetaPhlAn_temp
    
    Kraken2

    Installing Kraken2 database takes a long time and requires about 100 GB of disk space.

    mkdir -p $DBDIR/kraken2
    kraken2-build \
        --standard \
        --threads 10 \
        --db $DBDIR/kraken2
    

    Attention

    If rsync dosen’t work for you, you can try adding the --use-ftp to the kraken2-build command to use wget instead.

    krona
    ktUpdateTaxonomy.sh $DBDIR/krona/taxonomy
    
    Kaiju

    Kaiju provides different databases for downloading. To get a list of options, just execute kaiju-makedb with no arguments:

    The following commands demonstrate how to get the nr database including eukaryotes (nr_euk) and the progenomes database.

    mkdir -p $DBDIR/kaiju
    cd $DBDIR/kaiju
    kaiju-makedb -s progenomes -t 10
    kaiju-makedb -s nr_euk -t 10
    cd -
    
    HUMAnN2

    Online help on downloading databases.

    mkdir -p databases/HUMAnN2
    humann2_databases --download chocophlan full  $DBDIR/HUMAnN2
    humann2_databases --download uniref uniref90_diamond  $DBDIR/HUMAnN2/uniref90
    humann2_databases --download uniref uniref50_diamond  $DBDIR/HUMAnN2/uniref50
    
    humann2_config --update database_folders nucleotide $DBDIR/HUMAnN2/chocophlan
    humann2_config --update database_folders protein $DBDIR/HUMAnN2/uniref90
    

    Attention

    The commands download the recommended translated databases. For other options, see the Download a translated search database section of the HUMAnN2 tutorial.

  6. Get the parameter file with:

    curl -LO https://raw.githubusercontent.com/bioinfo-core-BGU/neatseq-flow-modules/master/Workflows/Metagenomics.yaml
    
  7. Settings to set in the parameter file

    You will have to make some changes to the parameter file to suit your needs:

    • Set the parameters in the Global_params section to suit your cluster. Alternatively, set Executor to Local for running on a single machine.

    • In the Vars section, set database_prefix to the location of your databases dir, which is the value of $DBDIR set above. If $DBDIR is set, you can use the following sed command to set the database_prefix correctly:

      sed -i s+\$DBdir+$DBDIR+ Metagenomics.yaml
      
    • In Vars.databases.kaiju, you will have to make sure the value of fmi fits the database you decide to use. In the provided parameter file, the nr_euk is set. The equivalent fmi value for the progenomes database is commented out.

    • Go over the redirects sections in the parameter file and make sure they are set according to your requirements.

    • If you have a fasta file with sequences to search for within your metagenome assemblies, set the proteins_of_interest variable to the full path to that file. If not, you can delete or uncomment the SKIP line in steps make_blast_db_per_assembly, blast_proteins_vs_assemblies and parse_blast.

  8. In the conda definitions (line 46), set base: to the path to the conda installation which you used to install the environment.

    You can get the path by executing the following command, when inside the Metagenomics conda environment:

    echo $CONDA_EXE | sed -e 's/\/bin\/conda$//g'
    
  9. Execute NeatSeq-Flow.

Tip

See also this nice presentation by Galeb Abu-Ali, Eric Franzosa and Curtis Huttenhower