Sequence Clustering¶
Modules included in this section
cd_hit
¶
Authors: | Menachem Sklarz |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A module for clustering with cd-hit/ch-hit-est:
This module runs both cd-hit and cd-hit-est. The type of sequence (nucl or prot) will be determined by the program supplied in script_path.
You must make sure that the required file exists: If clustering prot sequences with cd-hit-est
, make sure there is a fasta.prot
file, etc.
CD-HIT: a fast program for clustering and comparing large sets of protein or nucleotide sequences, Weizhong Li & Adam Godzik. Bioinformatics, (2006) 22:1658-1659
CD-HIT: accelerated for clustering the next generation sequencing data, Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu & Weizhong Li. Bioinformatics, (2012) 28:3150-3152
Requires¶
fasta files in the following slot (scope = sample):
sample_data[<sample>]["fasta.nucl"|"fasta.prot"]
fasta files in the following slot (scope = project):
sample_data["fasta.nucl"|"fasta.prot"]
Output¶
Puts the output fasta file in the fasta slot:
self.sample_data[<sample>]["fasta.nucl"|"fasta.prot"]
Or
self.sample_data["project_data"]["fasta.nucl"|"fasta.prot"]
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
scope | project | sample | Indicates whether to use a project or sample fasta. |
Lines for parameter file¶
clust_proj:
module: cd_hit
base: derepel_proj
script_path: 'path/to/cd-hit-est'
qsub_params:
-pe: shared 40
scope: project
redirects:
-T: 40
References¶
Fu, L., Niu, B., Zhu, Z., Wu, S. and Li, W., 2012. CD-HIT: accelerated for clustering the next-generation sequencing data. Bioinformatics, 28(23), pp.3150-3152.
vsearch_cluster
¶
Authors: | Menachem Sklarz |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A module for running vsearch clustering:
The reads stored in fasta files are clustered with one of the 3 methods available: cluster_fast, cluster_size or cluster_smallmem.
..Note: At the moment this works on the nucl fasta only. See the web: https://github.com/torognes/vsearch/issues/42
Output types are defined with the outputs parameter which can be a comma separated list of the following:
biomout,mothur_shared_out,otutabout,profile,uc
Fasta output files are defined with the fasta_outputs parameter which can be a comma separated list of the following:
centroids,consout,msaout
By default, the centroids file is stored in the fasta slot. Change this by setting store_fasta to one of the types listed above, i.e. centroids,consout or msaout
Requires¶
fasta files in the following slot (scope = sample):
sample_data[<sample>]["fasta.nucl"]
fasta files in the following slot (scope = project):
sample_data["fasta.nucl"]
Output¶
Puts required output in similarly named slots, e.g.:
self.sample_data[<sample>]["vsearch.centroids"]
orself.sample_data["project_data"]["vsearch.centroids"]
Puts the required fasta in the fasta slot:
self.sample_data[<sample>]["fasta.nucl"]
orself.sample_data["project_data"]["fasta.nucl"]
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
outputs | biomout,mothur_shared_out,otutabout,profile,uc | List of outputs other than fasta type outputs (see fasta_outputs |
fasta_outputs | centroids,consout,msaout | A list of fasta types to produce. |
store_fasta | centroids|consout|msaout | The fasta type to store in fasta slot |
scope | project | sample | Indicates whether to use a project or sample nucl fasta. |
Lines for parameter file¶
clust_proj:
module: vsearch_cluster
base: derepel_proj
script_path: '{Vars.vsearch_path}/vsearch'
qsub_params:
-pe: shared 40
fasta_outputs: centroids,consout
outputs: uc
store_fasta: centroids
scope: project
type: cluster_fast
redirects:
--id: 0.85 # From ipyrad defaults
--qmask: dust
--strand: both
--threads: 40
--sizein:
--sizeout:
References¶
Rognes, T., Flouri, T., Nichols, B., Quince, C. and Mahé, F., 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, p.e2584.
vsearch_derepel
¶
Authors: | Menachem Sklarz |
---|---|
Affiliation: | Bioinformatics core facility |
Organization: | National Institute of Biotechnology in the Negev, Ben Gurion University. |
A module for running vsearch read dereplication:
Performs dereplication on fastq and fasta files.
Note
Dereplication with vsearch is not defined on paired end reads.
At the moment, this module is defined only for fasta.nucl
or for fastq.S
.
Requires¶
fastq files in the following slots:
sample_data[<sample>]["fastq.S"]
or fasta files the following slot:
sample_data[<sample>]["fasta.nucl"]
Output¶
Puts output fasta file in the following slots:
self.sample_data[<sample>]["fasta.nucl"]
self.sample_data[<sample>]["vsearch_derepl"]
Parameters that can be set¶
Parameter | Values | Comments |
---|---|---|
scope | sample | project | Which file to use for dereplication: sample-wise or project-wise files |
uc | Save UCLUST-like dereplication output? (see –uc in manual) | |
type | derep_fulllength | derep_prefix | Type of derelpication strategy. See manual |
Lines for parameter file¶
For external index:
derepel_proj:
module: vsearch_derepel
base: merge_proj
script_path: '{Vars.vsearch_path}/vsearch'
scope: project
type: derep_fulllength
uc:
redirects:
--sizein:
--sizeout:
References¶
Rognes, T., Flouri, T., Nichols, B., Quince, C. and Mahé, F., 2016. VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, p.e2584.