Sequence Clustering
Modules included in this section
cd_hit
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A module for clustering with cd-hit/ch-hit-est:
This module runs both cd-hit and cd-hit-est. The type of sequence (nucl or prot) will be determined by the program supplied in script_path.
You must make sure that the required file exists: If clustering prot sequences with cd-hit-est
, make sure there is a fasta.prot
file, etc.
CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Weizhong Li & Adam Godzik. Bioinformatics, (2006) 22:1658-1659
CD-HIT: accelerated for clustering the next generation sequencing data, Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu & Weizhong Li. Bioinformatics, (2012) 28:3150-3152
Requires
fasta files in the following slot (scope = sample):
sample_data[<sample>]["fasta.nucl"|"fasta.prot"]
fasta files in the following slot (scope = project):
sample_data["fasta.nucl"|"fasta.prot"]
Output
Puts the output fasta file in the fasta slot:
self.sample_data[<sample>]["fasta.nucl"|"fasta.prot"]
Or
self.sample_data["project_data"]["fasta.nucl"|"fasta.prot"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
scope |
project | sample |
Indicates whether to use a project or sample fasta. |
Lines for parameter file
clust_proj:
module: cd_hit
base: derepel_proj
script_path: 'path/to/cd-hit-est'
qsub_params:
-pe: shared 40
scope: project
redirects:
-T: 40
References
Fu, L., Niu, B., Zhu, Z., Wu, S. and Li, W., 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23), pp.3150-3152.
vsearch_cluster
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A module for running vsearch clustering:
The reads stored in fasta files are clustered with one of the 3 methods available: cluster_fast, cluster_size or cluster_smallmem.
..Note: At the moment this works on the nucl fasta only. See the web: https://github.com/torognes/vsearch/issues/42
Output types are defined with the outputs parameter which can be a comma separated list of the following:
biomout,mothur_shared_out,otutabout,profile,uc
Fasta output files are defined with the fasta_outputs parameter which can be a comma separated list of the following:
centroids,consout,msaout
By default, the centroids file is stored in the fasta slot. Change this by setting store_fasta to one of the types listed above, i.e. centroids,consout or msaout
Requires
fasta files in the following slot (scope = sample):
sample_data[<sample>]["fasta.nucl"]
fasta files in the following slot (scope = project):
sample_data["fasta.nucl"]
Output
Puts required output in similarly named slots, e.g.:
self.sample_data[<sample>]["vsearch.centroids"]
orself.sample_data["project_data"]["vsearch.centroids"]
Puts the required fasta in the fasta slot:
self.sample_data[<sample>]["fasta.nucl"]
orself.sample_data["project_data"]["fasta.nucl"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
outputs |
biomout,mothur_shared_out,otutabout,profile,uc |
List of outputs other than fasta type outputs (see fasta_outputs |
fasta_outputs |
centroids,consout,msaout |
A list of fasta types to produce. |
store_fasta |
centroids|consout|msaout |
The fasta type to store in fasta slot |
scope |
project | sample |
Indicates whether to use a project or sample nucl fasta. |
Lines for parameter file
clust_proj:
module: vsearch_cluster
base: derepel_proj
script_path: '{Vars.vsearch_path}/vsearch'
qsub_params:
-pe: shared 40
fasta_outputs: centroids,consout
outputs: uc
store_fasta: centroids
scope: project
type: cluster_fast
redirects:
--id: 0.85 # From ipyrad defaults
--qmask: dust
--strand: both
--threads: 40
--sizein:
--sizeout:
References
Rognes, T., Flouri, T., Nichols, B., Quince, C. and Mahé, F., 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, p.e2584.
vsearch_derepel
- Authors
Menachem Sklarz
- Affiliation
Bioinformatics core facility
- Organization
National Institute of Biotechnology in the Negev, Ben Gurion University.
A module for running vsearch read dereplication:
Performs dereplication on fastq and fasta files.
Note
Dereplication with vsearch is not defined on paired end reads.
At the moment, this module is defined only for fasta.nucl
or for fastq.S
.
Requires
fastq files in the following slots:
sample_data[<sample>]["fastq.S"]
or fasta files the following slot:
sample_data[<sample>]["fasta.nucl"]
Output
Puts output fasta file in the following slots:
self.sample_data[<sample>]["fasta.nucl"]
self.sample_data[<sample>]["vsearch_derepl"]
Parameters that can be set
Parameter |
Values |
Comments |
---|---|---|
scope |
sample | project |
Which file to use for dereplication: sample-wise or project-wise files |
uc |
Save UCLUST-like dereplication output? (see –uc in manual) |
|
type |
derep_fulllength | derep_prefix |
Type of derelpication strategy. See manual |
Lines for parameter file
For external index:
derepel_proj:
module: vsearch_derepel
base: merge_proj
script_path: '{Vars.vsearch_path}/vsearch'
scope: project
type: derep_fulllength
uc:
redirects:
--sizein:
--sizeout:
References
Rognes, T., Flouri, T., Nichols, B., Quince, C. and Mahé, F., 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, p.e2584.