OptiFit: an improved method for fitting amplicon sequences to existing OTUs

Assigning amplicon sequences to operational taxonomic units (OTUs) is often an important step in characterizing the composition of microbial communities across large datasets. OptiClust, a de novo OTU clustering method, has been shown to produce higher quality OTU assignments than other methods and at comparable or faster speeds. A notable difference between de novo clustering and database-dependent reference clustering methods is that OTU assignments from de novo methods may change when new sequences are added to a dataset. However, in some cases one may wish to incorporate new samples into a previously clustered dataset without performing clustering again on all sequences, such as when comparing across datasets or deploying machine learning models where OTUs are features. Existing reference-based clustering methods produce consistent OTUs, but they only consider the similarity of each query sequence to a single reference sequence in an OTU, thus resulting in OTU assignments that are significantly worse than those generated by de novo methods. To provide an efficient and robust method to fit amplicon sequence data to existing OTUs, we developed the OptiFit algorithm. Inspired by OptiClust, OptiFit considers the similarity of all pairs of reference and query sequences in an OTU to produce OTUs of the best possible quality. We tested OptiFit using four microbiome datasets with two different strategies: by clustering to an external reference database or by splitting the dataset into a reference and query set and clustering the query sequences to the reference set after clustering it using OptiClust. The result is an improved implementation of closed and open-reference clustering. OptiFit produces OTUs of similar quality as OptiClust and at faster speeds when using the split dataset strategy, although the OTU quality and processing speed depends on the database chosen when using the external database strategy. OptiFit provides a suitable option for users who require consistent OTU assignments at the same quality afforded by de novo clustering methods.

Download Full-text

Variant calling for cpn60 barcode sequence-based microbiome profiling

10.1101/749267 ◽

2019 ◽

Author(s):

Sarah J. Vancuren ◽

Scott J. Dos Santos ◽

Janet E. Hill ◽

Keyword(s):

De Novo ◽

Sequence Data ◽

Variant Calling ◽

Taxonomic Composition ◽

Species Level ◽

Reference Sequence ◽

Sequence Length ◽

Sequence Variant ◽

Operational Taxonomic Units ◽

Microbiome Profiling

AbstractAmplification and sequencing of conserved genetic barcodes such as the cpn60 gene is a common approach to determining the taxonomic composition of microbiomes. Exact sequence variant calling has been proposed as an alternative to previously established methods for aggregation of sequence reads into operational taxonomic units (OTU). We investigated the utility of variant calling for cpn60 barcode sequences and determined the minimum sequence length required to provide species-level resolution. Sequence data from the 5’ region of the cpn60 barcode amplified from the human vaginal microbiome (n=45), and a mock community were used to compare variant calling to de novo assembly of reads, and mapping to a reference sequence database in terms of number of OTU formed, and overall community composition. Variant calling resulted in microbiome profiles that were consistent in apparent composition to those generated with the other methods but with significant logistical advantages. Variant calling is rapid, achieves high resolution of taxa, and does not require reference sequence data. Our results further demonstrate that 150 bp from the 5’ end of the cpn60 barcode sequence is sufficient to provide species-level resolution of microbiota.

Download Full-text

Peer Review #1 of "De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units (v0.2)"

10.7287/peerj.1487v0.2/reviews/1 ◽

2015 ◽

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Peer Review ◽

De Novo ◽

Rrna Gene ◽

Clustering Methods ◽

Gene Sequences ◽

16S Rrna Gene Sequences ◽

Operational Taxonomic Units

Download Full-text

CD-HIT-OTU-MiSeq, an Improved Approach for Clustering and Analyzing Paired End MiSeq 16S rRNA Sequences

10.1101/153783 ◽

2017 ◽

Cited By ~ 3

Author(s):

Weizhong Li ◽

Yuanyuan Chang

Keyword(s):

16S Rrna ◽

High Speed ◽

De Novo ◽

Sequence Data ◽

Illumina Miseq ◽

Poor Quality ◽

Reference Database ◽

Rrna Gene ◽

Variable Regions ◽

Novel Approach

AbstractIn recent years, Illumina MiSeq sequencers replaced pyrosequencing platforms and became dominant in 16S rRNA sequencing. One unique feature of MiSeq technology, compared with Pyrosequencing, is the Paired End (PE) reads, with each read can be sequenced to 250-300 bases to cover multiple variable regions on the 16S rRNA gene. However, the PE reads need to be assembled into a single contig at the beginning of the analysis. Although there are many methods capable of assembling PE reads into contigs, a big portion of PE reads can not be accurately assembled because the poor quality at the 3’ ends of both PE reads in the overlapping region. This causes that many sequences are discarded in the analysis. In this study, we developed a novel approach for clustering and annotation MiSeq-based 16S sequence data, CD-HIT-OTU-MiSeq. This new approach has four distinct novel features. (1) The package can clustering PE reads without joining them into contigs. (2) Users can choose a high quality portion of the PE reads for analysis (e.g. first 200 / 150 bases from forward / reverse reads), according to base quality profile. (3) We implemented a tool that can splice out the target region (e.g. V3-V4) from a full-length 16S reference database into the PE sequences. CD-HIT-OTU-MiSeq can cluster the spliced PE reference database together with samples, so we can derive Operational Taxonomic Units (OTUs) and annotate these OTUs concurrently. (4) Chimeric sequences are effectively identified through de novo approach. The package offers high speed and high accuracy. The software package is freely available as open source package and is distributed along with CD-HIT from http://cd-hit.org. Within the CD-HIT package, CD-HIT-OTU-MiSeq is within the usecase folder.

Download Full-text

Peer Review #2 of "De novo clustering methods outperform reference-based methods for assigning 16S rRNA gene sequences to operational taxonomic units (v0.2)"

10.7287/peerj.1487v0.2/reviews/2 ◽

2015 ◽

Author(s):

TS Schmidt

Keyword(s):

16S Rrna ◽

16S Rrna Gene ◽

Peer Review ◽

De Novo ◽

Rrna Gene ◽

Clustering Methods ◽

Gene Sequences ◽

16S Rrna Gene Sequences ◽

Operational Taxonomic Units

Download Full-text

Swarm: robust and fast clustering method for amplicon-based studies

10.7287/peerj.preprints.386 ◽

2014 ◽

Author(s):

Frédéric Mahé ◽

Torbjørn Rognes ◽

Christopher Quince ◽

Colomban de Vargas ◽

Micah Dunthorn

Keyword(s):

Internal Structure ◽

De Novo ◽

Biological Information ◽

Clustering Methods ◽

Clustering Method ◽

Operational Taxonomic Units ◽

Local Threshold

Popular de novo amplicon clustering methods suffer from two fundamental flaws: arbitrary global clustering thresholds, and input-order dependency induced by centroid selection. Swarm was developed to address these issues by first clustering nearly identical amplicons iteratively using a local threshold, and then by using clusters' internal structure and amplicon abundances to refine its results. This fast, scalable, and input-order independent approach reduces the influence of clustering parameters and produces robust operational taxonomic units, improving the amount of meaningful biological information that can be extracted from amplicon-based studies.

Download Full-text

Consistent, comprehensive and computationally efficient OTU definitions

10.7287/peerj.preprints.411 ◽

2014 ◽

Author(s):

Jai Ram Rideout ◽

Yan He ◽

Jose Antonio Navas-Molina ◽

William A Walters ◽

Luke K Ursell ◽

...

Keyword(s):

16S Rrna ◽

De Novo ◽

Sequence Data ◽

Marker Gene ◽

Community Analysis ◽

Microbial Community Analysis ◽

Reference Database ◽

Data Sets ◽

Computationally Efficient ◽

Sequencing Platforms

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.

Download Full-text

Accuracy of microbial community diversity estimated by closed- and open-reference OTUs

PeerJ ◽

10.7717/peerj.3889 ◽

2017 ◽

Vol 5 ◽

pp. e3889 ◽

Cited By ~ 69

Author(s):

Robert C. Edgar

Keyword(s):

Ribosomal Rna ◽

De Novo ◽

Community Diversity ◽

Reference Database ◽

Mock Community ◽

Variable Regions ◽

Operational Taxonomic Units ◽

Sequencing Technologies ◽

Generation Sequencing ◽

Mock Communities

Next-generation sequencing of 16S ribosomal RNA is widely used to survey microbial communities. Sequences are typically assigned to Operational Taxonomic Units (OTUs). Closed- and open-reference OTU assignment matches reads to a reference database at 97% identity (closed), then clusters unmatched reads using a de novo method (open). Implementations of these methods in the QIIME package were tested on several mock community datasets with 20 strains using different sequencing technologies and primers. Richness (number of reported OTUs) was often greatly exaggerated, with hundreds or thousands of OTUs generated on Illumina datasets. Between-sample diversity was also found to be highly exaggerated in many cases, with weighted Jaccard distances between identical mock samples often close to one, indicating very low similarity. Non-overlapping hyper-variable regions in 70% of species were assigned to different OTUs. On mock communities with Illumina V4 reads, 56% to 88% of predicted genus names were false positives. Biological inferences obtained using these methods are therefore not reliable.

Download Full-text

Metagenome sequence clustering with hash-based canopies

Journal of Bioinformatics and Computational Biology ◽

10.1142/s0219720017400066 ◽

2017 ◽

Vol 15 (06) ◽

pp. 1740006 ◽

Cited By ~ 6

Author(s):

Mohammad Arifur Rahman ◽

Nathan LaPierre ◽

Huzefa Rangwala ◽

Daniel Barbara

Keyword(s):

Large Scale ◽

Clustering Algorithm ◽

State Of The Art ◽

Sequence Data ◽

Clustering Algorithms ◽

Clustering Methods ◽

Operational Taxonomic Units ◽

Sequence Clustering ◽

Scalable Clustering ◽

Metagenome Sequence

Metagenomics is the collective sequencing of co-existing microbial communities which are ubiquitous across various clinical and ecological environments. Due to the large volume and random short sequences (reads) obtained from community sequences, analysis of diversity, abundance and functions of different organisms within these communities are challenging tasks. We present a fast and scalable clustering algorithm for analyzing large-scale metagenome sequence data. Our approach achieves efficiency by partitioning the large number of sequence reads into groups (called canopies) using hashing. These canopies are then refined by using state-of-the-art sequence clustering algorithms. This canopy-clustering (CC) algorithm can be used as a pre-processing phase for computationally expensive clustering algorithms. We use and compare three hashing schemes for canopy construction with five popular and state-of-the-art sequence clustering methods. We evaluate our clustering algorithm on synthetic and real-world 16S and whole metagenome benchmarks. We demonstrate the ability of our proposed approach to determine meaningful Operational Taxonomic Units (OTU) and observe significant speedup with regards to run time when compared to different clustering algorithms. We also make our source code publicly available on Github. a

Download Full-text

Rapid, raw-read reference and identification (R4IDs): A flexible platform for rapid generic species ID using long-read sequencing technology

10.1101/281048 ◽

2018 ◽

Cited By ~ 2

Author(s):

Joe Parker ◽

Andrew Helmstetter ◽

James Crowe ◽

John Iacona ◽

Dion Devey ◽

...

Keyword(s):

Dna Sequencing ◽

Species Identification ◽

Sequence Data ◽

Vascular Plant ◽

Reference Sequence ◽

Read Length ◽

Reference Database ◽

Sequencing Technology ◽

Long Read ◽

Suitable Reference

AbstractThe versatility of the current DNA sequencing platforms and the development of portable, nanopore sequencers means that it has never been easier to collect genetic data for unknown sample ID. DNA barcoding and meta-barcoding have become increasingly popular and barcode databases continue to grow at an impressive rate. However, the number of canonical genome assemblies (reference or draft) that are publically available is relatively tiny, hindering the more widespread use of genome scale DNA sequencing technology for accurate species identification and discovery. Here, we show that rapid raw-read reference datasets, or R4IDs for short, generated in a matter of hours on the Oxford Nanopore MinION, can bridge this gap and accelerate the generation of useable reference sequence data. By exploiting the long read length of this technology, shotgun genomic sequencing of a small portion of an organism’s genome can act as a suitable reference database despite the low sequencing coverage. These R4IDs can then be used for accurate species identification with minimal amounts of re-sequencing effort (1000s of reads). We demonstrated the capabilities of this approach with six vascular plant species for which we created R4IDs in the laboratory and then re-sequenced, live at the Kew Science Festival 2016. We further validated our method using simulations to determine the broader applicability of the approach. Our data analysis pipeline has been made available as a Dockerised workflow for simple, scalable deployment for a range of uses.

Download Full-text

Comparison of three clustering approaches for detecting novel environmental microbial diversity

10.7287/peerj.preprints.1414 ◽

2015 ◽

Author(s):

Dominik Forster ◽

Micah Dunthorn ◽

Thorsten Stoeck ◽

Frédéric Mahé

Keyword(s):

Graph Theory ◽

Microbial Ecology ◽

High Throughput Sequencing ◽

De Novo ◽

Sequence Similarity ◽

Pairwise Alignment ◽

Reference Database ◽

Clustering Methods ◽

Underlying Network ◽

Network Topologies

Discovery of novel diversity in high-throughput sequencing (HTS) studies is a central task in environmental microbial ecology. To evaluate the effects that amplicon clustering methods have on novel diversity discovery, we clustered an environmental marine protist HTS dataset of protist reads together with accessions from the taxonomically curated PR2 reference database using three de novo approaches: sequence similarity networks, USEARCH, and Swarm. The novel diversity uncovered by each clustering approach differed drastically in the number of operational taxonomic units (OTUs) and the number of environmental amplicons in these novel diversity OTUs. Global pairwise alignment comparisons revealed that numerous amplicons classified as novel by USEARCH and Swarm were actually highly similar to reference accessions. Using graph theory we found additional novel diversity within OTUs that would have gone unnoticed without further using their underlying network topologies. Our results suggest that novel diversity inferred from clustering approaches requires further validation, whereas graph theory provides a powerful tool for microbial ecology and the analyses of environmental HTS datasets.

Download Full-text