Sequence Clustering

Modules included in this section

cd_hit
vsearch_cluster
vsearch_derepel

`cd_hit`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for clustering with cd-hit/ch-hit-est:

This module runs both cd-hit and cd-hit-est. The type of sequence (nucl or prot) will be determined by the program supplied in script_path.

You must make sure that the required file exists: If clustering prot sequences with cd-hit-est, make sure there is a fasta.prot file, etc.

CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Weizhong Li & Adam Godzik. Bioinformatics, (2006) 22:1658-1659

CD-HIT: accelerated for clustering the next generation sequencing data, Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu & Weizhong Li. Bioinformatics, (2012) 28:3150-3152

Requires

fasta files in the following slot (scope = sample):
- sample_data[<sample>]["fasta.nucl"|"fasta.prot"]
fasta files in the following slot (scope = project):
- sample_data["fasta.nucl"|"fasta.prot"]

Output

Puts the output fasta file in the fasta slot:

self.sample_data[<sample>]["fasta.nucl"|"fasta.prot"]
Or

self.sample_data["project_data"]["fasta.nucl"|"fasta.prot"]

Parameters that can be set

Parameter	Values	Comments
scope	project \| sample	Indicates whether to use a project or sample fasta.

Lines for parameter file

clust_proj:
    module: cd_hit
    base: derepel_proj
    script_path: 'path/to/cd-hit-est'
    qsub_params:
        -pe: shared 40
    scope: project
    redirects:
        -T: 40

References

Fu, L., Niu, B., Zhu, Z., Wu, S. and Li, W., 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23), pp.3150-3152.

`vsearch_cluster`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running vsearch clustering:

The reads stored in fasta files are clustered with one of the 3 methods available: cluster_fast, cluster_size or cluster_smallmem.

..Note: At the moment this works on the nucl fasta only. See the web: https://github.com/torognes/vsearch/issues/42

Output types are defined with the outputs parameter which can be a comma separated list of the following:

biomout,mothur_shared_out,otutabout,profile,uc

Fasta output files are defined with the fasta_outputs parameter which can be a comma separated list of the following:

centroids,consout,msaout

By default, the centroids file is stored in the fasta slot. Change this by setting store_fasta to one of the types listed above, i.e. centroids,consout or msaout

Requires

fasta files in the following slot (scope = sample):
- sample_data[<sample>]["fasta.nucl"]
fasta files in the following slot (scope = project):
- sample_data["fasta.nucl"]

Output

Puts required output in similarly named slots, e.g.:

self.sample_data[<sample>]["vsearch.centroids"] or self.sample_data["project_data"]["vsearch.centroids"]
Puts the required fasta in the fasta slot:

self.sample_data[<sample>]["fasta.nucl"] or self.sample_data["project_data"]["fasta.nucl"]

Parameters that can be set

Parameter	Values	Comments
outputs	biomout,mothur_shared_out,otutabout,profile,uc	List of outputs other than fasta type outputs (see fasta_outputs
fasta_outputs	centroids,consout,msaout	A list of fasta types to produce.
store_fasta	centroids\|consout\|msaout	The fasta type to store in fasta slot
scope	project \| sample	Indicates whether to use a project or sample nucl fasta.

Lines for parameter file

clust_proj:
    module: vsearch_cluster
    base: derepel_proj
    script_path: '{Vars.vsearch_path}/vsearch'
    qsub_params:
        -pe: shared 40
    fasta_outputs: centroids,consout
    outputs: uc
    store_fasta: centroids
    scope: project
    type: cluster_fast
    redirects:
        --id: 0.85  # From ipyrad defaults
        --qmask: dust
        --strand: both
        --threads: 40
        --sizein:
        --sizeout:

References

Rognes, T., Flouri, T., Nichols, B., Quince, C. and Mahé, F., 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, p.e2584.

`vsearch_derepel`

Authors: Menachem Sklarz
Affiliation: Bioinformatics core facility
Organization: National Institute of Biotechnology in the Negev, Ben Gurion University.

A module for running vsearch read dereplication:

Performs dereplication on fastq and fasta files.

Note

Dereplication with vsearch is not defined on paired end reads.

At the moment, this module is defined only for fasta.nucl or for fastq.S.

Requires

fastq files in the following slots:
- sample_data[<sample>]["fastq.S"]
or fasta files the following slot:
- sample_data[<sample>]["fasta.nucl"]

Output

Puts output fasta file in the following slots:
- self.sample_data[<sample>]["fasta.nucl"]
- self.sample_data[<sample>]["vsearch_derepl"]

Parameters that can be set

Parameter	Values	Comments
scope	sample \| project	Which file to use for dereplication: sample-wise or project-wise files
uc		Save UCLUST-like dereplication output? (see –uc in manual)
type	derep_fulllength \| derep_prefix	Type of derelpication strategy. See manual

Lines for parameter file

For external index:

derepel_proj:
    module: vsearch_derepel
    base: merge_proj
    script_path: '{Vars.vsearch_path}/vsearch'
    scope: project
    type: derep_fulllength
    uc: 
    redirects:
        --sizein:
        --sizeout:

References

Rognes, T., Flouri, T., Nichols, B., Quince, C. and Mahé, F., 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, p.e2584.

Sequence Clustering

cd_hit

Requires

Output

Parameters that can be set

Lines for parameter file

References

vsearch_cluster

Requires

Output

Parameters that can be set

Lines for parameter file

References

vsearch_derepel

Requires

Output

Parameters that can be set

Lines for parameter file

References

`cd_hit`

`vsearch_cluster`

`vsearch_derepel`