scholarly journals CD-HIT-OTU-MiSeq, an Improved Approach for Clustering and Analyzing Paired End MiSeq 16S rRNA Sequences

2017 ◽  
Author(s):  
Weizhong Li ◽  
Yuanyuan Chang

AbstractIn recent years, Illumina MiSeq sequencers replaced pyrosequencing platforms and became dominant in 16S rRNA sequencing. One unique feature of MiSeq technology, compared with Pyrosequencing, is the Paired End (PE) reads, with each read can be sequenced to 250-300 bases to cover multiple variable regions on the 16S rRNA gene. However, the PE reads need to be assembled into a single contig at the beginning of the analysis. Although there are many methods capable of assembling PE reads into contigs, a big portion of PE reads can not be accurately assembled because the poor quality at the 3’ ends of both PE reads in the overlapping region. This causes that many sequences are discarded in the analysis. In this study, we developed a novel approach for clustering and annotation MiSeq-based 16S sequence data, CD-HIT-OTU-MiSeq. This new approach has four distinct novel features. (1) The package can clustering PE reads without joining them into contigs. (2) Users can choose a high quality portion of the PE reads for analysis (e.g. first 200 / 150 bases from forward / reverse reads), according to base quality profile. (3) We implemented a tool that can splice out the target region (e.g. V3-V4) from a full-length 16S reference database into the PE sequences. CD-HIT-OTU-MiSeq can cluster the spliced PE reference database together with samples, so we can derive Operational Taxonomic Units (OTUs) and annotate these OTUs concurrently. (4) Chimeric sequences are effectively identified through de novo approach. The package offers high speed and high accuracy. The software package is freely available as open source package and is distributed along with CD-HIT from http://cd-hit.org. Within the CD-HIT package, CD-HIT-OTU-MiSeq is within the usecase folder.

Algologia ◽  
2021 ◽  
Vol 31 (1) ◽  
pp. 93-113
Author(s):  
A.R. Nur Fadzliana ◽  
◽  
W.O. Wan Maznah ◽  
S.A.M. Nor ◽  
Choon Pin Foong ◽  
...  

Cyanobacteria are the most widespread group of photosynthetic prokaryotes. They are primary producers in a wide variety of habitats and are able to thrive in harsh environments, including polluted waters; therefore, this study was conducted to explore the cyanobacterial populations inhabiting river tributaries with different levels of pollution. Sediment samples (epipelon) were collected from selected tributaries of the Pinang River basin. Air Terjun (T1) and Air Itam rivers (T2) represent the upper streams of Pinang River basin, while Dondang (T3) and Jelutong rivers (T4) are located at in the middle of the river basin. The Pinang River (T5) is located near the estuary and is subjected to saline water intrusion during high tides. Cyanobacterial community was determined by identifying the taxa via 16S rRNA gene amplicon sequence data. 16S rRNA gene amplicons generated from collected samples were sequenced using illumina Miseq, with the targeted V3 and V4 regions yielding approximately 1 mln reads per sample. Synechococcus, Phormidium, Arthronema and Leptolyngbya were found in all samples. Shannon-Weiner diversity index was highest (H’ = 1.867) at the clean upstream station (T1), while the moderately polluted stream (T3) recorded the lowest diversity (H’ = 0.399), and relatively polluted stations (T4 and T5) recorded fairly high values of H’. This study provides insights into the cyanobacterial community structure in Pinang River basin via cultivation-independent techniques using 16S rRNA gene amplicon sequence. Occurrence of some morphospecies at specific locations showed that the cyanobacterial communities are quite distinct and have specific ecological demands. Some species which were ubiquitous might be able to tolerate varied environmental conditions.


2014 ◽  
Author(s):  
Jai Ram Rideout ◽  
Yan He ◽  
Jose Antonio Navas-Molina ◽  
William A Walters ◽  
Luke K Ursell ◽  
...  

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.


2014 ◽  
Author(s):  
Catherine Burke ◽  
Aaron E Darling

We describe a method for sequencing full-length 16S rRNA gene amplicons using the high throughput Illumina MiSeq platform. The resulting sequences have about 100-fold higher accuracy than standard Illumina reads and are chimera filtered using information from a single molecule dual tagging scheme that boosts the signal available for chimera detection. We demonstrate that the data provides fine scale phylogenetic resolution not available from Illumina amplicon methods targeting smaller variable regions of the 16S rRNA gene.


2014 ◽  
Author(s):  
Jai Ram Rideout ◽  
Yan He ◽  
Jose Antonio Navas-Molina ◽  
William A Walters ◽  
Luke K Ursell ◽  
...  

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because parts of our algorithm can be run in parallel, it makes open-reference OTU picking tractable on massive amplicon sequence data sets. We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “legacy” open-reference OTU picking, where less of the process can be parallelized, through comparisons on three well-studied datasets. We therefore recommend that subsampled open-reference OTU picking always be applied in favor of “legacy” open-reference OTU picking. An implementation of this algorithm is provided in the popular QIIME software package. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters.


2021 ◽  
Vol 12 ◽  
Author(s):  
Chiron J. Anderson ◽  
Lucas R. Koester ◽  
Stephan Schmitz-Esser

In this meta-analysis, 17 rumen epithelial 16S rRNA gene Illumina MiSeq amplicon sequencing data sets were analyzed to identify a core rumen epithelial microbiota and core rumen epithelial OTUs shared between the different studies included. Sequences were quality-filtered and screened for chimeric sequences before performing closed-reference 97% OTU clustering, and de novo 97% OTU clustering. Closed-reference OTU clustering identified the core rumen epithelial OTUs, defined as any OTU present in ≥ 80% of the samples, while the de novo data was randomly subsampled to 10,000 reads per sample to generate phylum- and genus-level distributions and beta diversity metrics. 57 core rumen epithelial OTUs were identified including metabolically important taxa such as Ruminococcus, Butyrivibrio, and other Lachnospiraceae, as well as sulfate-reducing bacteria Desulfobulbus and Desulfovibrio. Two Betaproteobacteria OTUs (Neisseriaceae and Burkholderiaceae) were core rumen epithelial OTUs, in contrast to rumen content where previous literature indicates they are rarely found. Two core OTUs were identified as the methanogenic archaea Methanobrevibacter and Methanomethylophilaceae. These core OTUs are consistently present across the many variables between studies which include different host species, geographic region, diet, age, farm management practice, time of year, hypervariable region sequenced, and more. When considering only cattle samples, the number of core rumen epithelial OTUs expands to 147, highlighting the increased similarity within host species despite geographical location and other variables. De novo OTU clustering revealed highly similar rumen epithelial communities, predominated by Firmicutes, Bacteroidetes, and Proteobacteria at the phylum level which comprised 79.7% of subsampled sequences. The 15 most abundant genera represented an average of 54.5% of sequences in each individual study. These abundant taxa broadly overlap with the core rumen epithelial OTUs, with the exception of Prevotellaceae which were abundant, but not identified within the core OTUs. Our results describe the core and abundant bacteria found in the rumen epithelial environment and will serve as a basis to better understand the composition and function of rumen epithelial communities.


2014 ◽  
Author(s):  
Jai Ram Rideout ◽  
Yan He ◽  
Jose Antonio Navas-Molina ◽  
William A Walters ◽  
Luke K Ursell ◽  
...  

We present a performance-optimized algorithm, subsampled open-reference OTU picking, for assigning marker gene (e.g., 16S rRNA) sequences generated on next-generation sequencing platforms to operational taxonomic units (OTUs) for microbial community analysis. This algorithm provides benefits over de novo OTU picking (clustering can be performed largely in parallel, reducing runtime) and closed-reference OTU picking (all reads are clustered, not only those that match a reference database sequence with high similarity). Because more of our algorithm can be run in parallel relative to “classic” open-reference OTU picking, it makes open-reference OTU picking tractable on massive amplicon sequence data sets (though on smaller data sets, “classic” open-reference OTU clustering is often faster). We illustrate that here by applying it to the first 15,000 samples sequenced for the Earth Microbiome Project (1.3 billion V4 16S rRNA amplicons). To the best of our knowledge, this is the largest OTU picking run ever performed, and we estimate that our new algorithm runs in less than 1/5 the time than would be required of “classic” open reference OTU picking. We show that subsampled open-reference OTU picking yields results that are highly correlated with those generated by “classic” open-reference OTU picking through comparisons on three well-studied datasets. An implementation of this algorithm is provided in the popular QIIME software package, which uses uclust for read clustering. All analyses were performed using QIIME’s uclust wrappers, though we provide details (aided by the open-source code in our GitHub repository) that will allow implementation of subsampled open-reference OTU picking independently of QIIME (e.g., in a compiled programming language, where runtimes should be further reduced). Our analyses should generalize to other implementations of these OTU picking algorithms. Finally, we present a comparison of parameter settings in QIIME’s OTU picking workflows and make recommendations on settings for these free parameters to optimize runtime without reducing the quality of the results. These optimized parameters can vastly decrease the runtime of uclust-based OTU picking in QIIME.


2013 ◽  
Vol 79 (17) ◽  
pp. 5112-5120 ◽  
Author(s):  
James J. Kozich ◽  
Sarah L. Westcott ◽  
Nielson T. Baxter ◽  
Sarah K. Highlander ◽  
Patrick D. Schloss

ABSTRACTRapid advances in sequencing technology have changed the experimental landscape of microbial ecology. In the last 10 years, the field has moved from sequencing hundreds of 16S rRNA gene fragments per study using clone libraries to the sequencing of millions of fragments per study using next-generation sequencing technologies from 454 and Illumina. As these technologies advance, it is critical to assess the strengths, weaknesses, and overall suitability of these platforms for the interrogation of microbial communities. Here, we present an improved method for sequencing variable regions within the 16S rRNA gene using Illumina's MiSeq platform, which is currently capable of producing paired 250-nucleotide reads. We evaluated three overlapping regions of the 16S rRNA gene that vary in length (i.e., V34, V4, and V45) by resequencing a mock community and natural samples from human feces, mouse feces, and soil. By titrating the concentration of 16S rRNA gene amplicons applied to the flow cell and using a quality score-based approach to correct discrepancies between reads used to construct contigs, we were able to reduce error rates by as much as two orders of magnitude. Finally, we reprocessed samples from a previous study to demonstrate that large numbers of samples could be multiplexed and sequenced in parallel with shotgun metagenomes. These analyses demonstrate that our approach can provide data that are at least as good as that generated by the 454 platform while providing considerably higher sequencing coverage for a fraction of the cost.


mSystems ◽  
2021 ◽  
Author(s):  
Farnaz Fouladi ◽  
Jacqueline B. Young ◽  
Anthony A. Fodor

Recent bioinformatics development has enabled the detection of sequence variants with a high resolution of only one single-nucleotide difference in 16S rRNA gene sequence data. Despite this progress, there are several limitations that can be associated with variant calling pipelines, such as producing a large number of low-abundance sequence variants which need to be filtered out with arbitrary thresholds in downstream analyses or having a slow runtime.


2009 ◽  
Vol 75 (23) ◽  
pp. 7537-7541 ◽  
Author(s):  
Patrick D. Schloss ◽  
Sarah L. Westcott ◽  
Thomas Ryabin ◽  
Justine R. Hall ◽  
Martin Hartmann ◽  
...  

ABSTRACT mothur aims to be a comprehensive software package that allows users to use a single piece of software to analyze community sequence data. It builds upon previous tools to provide a flexible and powerful software package for analyzing sequencing data. As a case study, we used mothur to trim, screen, and align sequences; calculate distances; assign sequences to operational taxonomic units; and describe the α and β diversity of eight marine samples previously characterized by pyrosequencing of 16S rRNA gene fragments. This analysis of more than 222,000 sequences was completed in less than 2 h with a laptop computer.


Sign in / Sign up

Export Citation Format

Share Document