Sequence Clustering

Modules included in this section

cd_hit

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for clustering with cd-hit/ch-hit-est:

This module runs both cd-hit and cd-hit-est. The type of sequence (nucl or prot) will be determined by the program supplied in script_path.

You must make sure that the required file exists: If clustering prot sequences with cd-hit-est, make sure there is a fasta.prot file, etc.

CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Weizhong Li & Adam Godzik. Bioinformatics, (2006) 22:1658-1659

CD-HIT: accelerated for clustering the next generation sequencing data, Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu & Weizhong Li. Bioinformatics, (2012) 28:3150-3152

Requires

  • fasta files in the following slot (scope = sample):

    • sample_data[<sample>]["fasta.nucl"|"fasta.prot"]

  • fasta files in the following slot (scope = project):

    • sample_data["fasta.nucl"|"fasta.prot"]

Output

  • Puts the output fasta file in the fasta slot:

    self.sample_data[<sample>]["fasta.nucl"|"fasta.prot"]

  • Or

    self.sample_data["project_data"]["fasta.nucl"|"fasta.prot"]

Parameters that can be set

Parameter

Values

Comments

scope

project | sample

Indicates whether to use a project or sample fasta.

Lines for parameter file

clust_proj:
    module: cd_hit
    base: derepel_proj
    script_path: 'path/to/cd-hit-est'
    qsub_params:
        -pe: shared 40
    scope: project
    redirects:
        -T: 40

References

Fu, L., Niu, B., Zhu, Z., Wu, S. and Li, W., 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23), pp.3150-3152.

vsearch_cluster

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running vsearch clustering:

The reads stored in fasta files are clustered with one of the 3 methods available: cluster_fast, cluster_size or cluster_smallmem.

..Note: At the moment this works on the nucl fasta only. See the web: https://github.com/torognes/vsearch/issues/42

Output types are defined with the outputs parameter which can be a comma separated list of the following:

biomout,mothur_shared_out,otutabout,profile,uc

Fasta output files are defined with the fasta_outputs parameter which can be a comma separated list of the following:

centroids,consout,msaout

By default, the centroids file is stored in the fasta slot. Change this by setting store_fasta to one of the types listed above, i.e. centroids,consout or msaout

Requires

  • fasta files in the following slot (scope = sample):

    • sample_data[<sample>]["fasta.nucl"]

  • fasta files in the following slot (scope = project):

    • sample_data["fasta.nucl"]

Output

  • Puts required output in similarly named slots, e.g.:

    self.sample_data[<sample>]["vsearch.centroids"] or self.sample_data["project_data"]["vsearch.centroids"]

  • Puts the required fasta in the fasta slot:

    self.sample_data[<sample>]["fasta.nucl"] or self.sample_data["project_data"]["fasta.nucl"]

Parameters that can be set

Parameter

Values

Comments

outputs

biomout,mothur_shared_out,otutabout,profile,uc

List of outputs other than fasta type outputs (see fasta_outputs

fasta_outputs

centroids,consout,msaout

A list of fasta types to produce.

store_fasta

centroids|consout|msaout

The fasta type to store in fasta slot

scope

project | sample

Indicates whether to use a project or sample nucl fasta.

Lines for parameter file

clust_proj:
    module: vsearch_cluster
    base: derepel_proj
    script_path: '{Vars.vsearch_path}/vsearch'
    qsub_params:
        -pe: shared 40
    fasta_outputs: centroids,consout
    outputs: uc
    store_fasta: centroids
    scope: project
    type: cluster_fast
    redirects:
        --id: 0.85  # From ipyrad defaults
        --qmask: dust
        --strand: both
        --threads: 40
        --sizein:
        --sizeout:

References

Rognes, T., Flouri, T., Nichols, B., Quince, C. and Mahé, F., 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, p.e2584.

vsearch_derepel

Authors

Menachem Sklarz

Affiliation

Bioinformatics core facility

Organization

National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running vsearch read dereplication:

Performs dereplication on fastq and fasta files.

Note

Dereplication with vsearch is not defined on paired end reads.

At the moment, this module is defined only for fasta.nucl or for fastq.S.

Requires

  • fastq files in the following slots:

    • sample_data[<sample>]["fastq.S"]

  • or fasta files the following slot:

    • sample_data[<sample>]["fasta.nucl"]

Output

  • Puts output fasta file in the following slots:

    • self.sample_data[<sample>]["fasta.nucl"]

    • self.sample_data[<sample>]["vsearch_derepl"]

Parameters that can be set

Parameter

Values

Comments

scope

sample | project

Which file to use for dereplication: sample-wise or project-wise files

uc

Save UCLUST-like dereplication output? (see –uc in manual)

type

derep_fulllength | derep_prefix

Type of derelpication strategy. See manual

Lines for parameter file

For external index:

derepel_proj:
    module: vsearch_derepel
    base: merge_proj
    script_path: '{Vars.vsearch_path}/vsearch'
    scope: project
    type: derep_fulllength
    uc: 
    redirects:
        --sizein:
        --sizeout:

References

Rognes, T., Flouri, T., Nichols, B., Quince, C. and Mahé, F., 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, p.e2584.