scholarly journals Systematic processing of ribosomal RNA gene amplicon sequencing data

GigaScience ◽  
2019 ◽  
Vol 8 (12) ◽  
Author(s):  
Julien Tremblay ◽  
Etienne Yergeau

Abstract Background With the advent of high-throughput sequencing, microbiology is becoming increasingly data-intensive. Because of its low cost, robust databases, and established bioinformatic workflows, sequencing of 16S/18S/ITS ribosomal RNA (rRNA) gene amplicons, which provides a marker of choice for phylogenetic studies, has become ubiquitous. Many established end-to-end bioinformatic pipelines are available to perform short amplicon sequence data analysis. These pipelines suit a general audience, but few options exist for more specialized users who are experienced in code scripting, Linux-based systems, and high-performance computing (HPC) environments. For such an audience, existing pipelines can be limiting to fully leverage modern HPC capabilities and perform tweaking and optimization operations. Moreover, a wealth of stand-alone software packages that perform specific targeted bioinformatic tasks are increasingly accessible, and finding a way to easily integrate these applications in a pipeline is critical to the evolution of bioinformatic methodologies. Results Here we describe AmpliconTagger, a short rRNA marker gene amplicon pipeline coded in a Python framework that enables fine tuning and integration of virtually any potential rRNA gene amplicon bioinformatic procedure. It is designed to work within an HPC environment, supporting a complex network of job dependencies with a smart-restart mechanism in case of job failure or parameter modifications. As proof of concept, we present end results obtained with AmpliconTagger using 16S, 18S, ITS rRNA short gene amplicons and Pacific Biosciences long-read amplicon data types as input. Conclusions Using a selection of published algorithms for generating operational taxonomic units and amplicon sequence variants and for computing downstream taxonomic summaries and diversity metrics, we demonstrate the performance and versatility of our pipeline for systematic analyses of amplicon sequence data.

2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Eric J. Raes ◽  
Kristen Karsh ◽  
Swan L. S. Sow ◽  
Martin Ostrowski ◽  
Mark V. Brown ◽  
...  

AbstractGlobal oceanographic monitoring initiatives originally measured abiotic essential ocean variables but are currently incorporating biological and metagenomic sampling programs. There is, however, a large knowledge gap on how to infer bacterial functions, the information sought by biogeochemists, ecologists, and modelers, from the bacterial taxonomic information (produced by bacterial marker gene surveys). Here, we provide a correlative understanding of how a bacterial marker gene (16S rRNA) can be used to infer latitudinal trends for metabolic pathways in global monitoring campaigns. From a transect spanning 7000 km in the South Pacific Ocean we infer ten metabolic pathways from 16S rRNA gene sequences and 11 corresponding metagenome samples, which relate to metabolic processes of primary productivity, temperature-regulated thermodynamic effects, coping strategies for nutrient limitation, energy metabolism, and organic matter degradation. This study demonstrates that low-cost, high-throughput bacterial marker gene data, can be used to infer shifts in the metabolic strategies at the community scale.


2009 ◽  
Vol 75 (23) ◽  
pp. 7537-7541 ◽  
Author(s):  
Patrick D. Schloss ◽  
Sarah L. Westcott ◽  
Thomas Ryabin ◽  
Justine R. Hall ◽  
Martin Hartmann ◽  
...  

ABSTRACT mothur aims to be a comprehensive software package that allows users to use a single piece of software to analyze community sequence data. It builds upon previous tools to provide a flexible and powerful software package for analyzing sequencing data. As a case study, we used mothur to trim, screen, and align sequences; calculate distances; assign sequences to operational taxonomic units; and describe the α and β diversity of eight marine samples previously characterized by pyrosequencing of 16S rRNA gene fragments. This analysis of more than 222,000 sequences was completed in less than 2 h with a laptop computer.


PeerJ ◽  
2018 ◽  
Vol 6 ◽  
pp. e5382 ◽  
Author(s):  
Fernanda Cornejo-Granados ◽  
Luigui Gallardo-Becerra ◽  
Miriam Leonardo-Reza ◽  
Juan Pablo Ochoa-Romo ◽  
Adrian Ochoa-Leyva

The shrimp or prawn is the most valuable traded marine product in the world market today and its microbiota plays an essential role in its development, physiology, and health. The technological advances and dropping costs of high-throughput sequencing have increased the number of studies characterizing the shrimp microbiota. However, the application of different experimental and bioinformatics protocols makes it difficult to compare different studies to reach general conclusions about shrimp microbiota. To meet this necessity, we report the first meta-analysis of the microbiota from freshwater and marine shrimps using all publically available sequences of the 16S ribosomal gene (16S rRNA gene). We obtained data for 199 samples, in which 63.3% were from marine (Alvinocaris longirostris, Litopenaeus vannamei and Penaeus monodon), and 36.7% were from freshwater (Macrobrachium asperulum, Macrobrachium nipponense, Macrobranchium rosenbergii, Neocaridina denticulata) shrimps. Technical variations among studies, such as selected primers, hypervariable region, and sequencing platform showed a significant impact on the microbiota structure. Additionally, the ANOSIM and PERMANOVA analyses revealed that the most important biological factor in structuring the shrimp microbiota was the marine and freshwater environment (ANOSIM R = 0.54, P = 0.001; PERMANOVA pseudo-F = 21.8, P = 0.001), where freshwater showed higher bacterial diversity than marine shrimps. Then, for marine shrimps, the most relevant biological factors impacting the microbiota composition were lifestyle (ANOSIM R = 0.341, P = 0.001; PERMANOVA pseudo-F = 8.50, P = 0.0001), organ (ANOSIM R = 0.279, P = 0.001; PERMANOVA pseudo-F = 6.68, P = 0.001) and developmental stage (ANOSIM R = 0.240, P = 0.001; PERMANOVA pseudo-F = 5.05, P = 0.001). According to the lifestyle, organ, developmental stage, diet, and health status, the highest diversity were for wild-type, intestine, adult, wild-type diet, and healthy samples, respectively. Additionally, we used PICRUSt to predict the potential functions of the microbiota, and we found that the organ had more differentially enriched functions (93), followed by developmental stage (12) and lifestyle (9). Our analysis demonstrated that despite the impact of technical and bioinformatics factors, the biological factors were also statistically significant in shaping the microbiota. These results show that cross-study comparisons are a valuable resource for the improvement of the shrimp microbiota and microbiome fields. Thus, it is important that future studies make public their sequencing data, allowing other researchers to reach more powerful conclusions about the microbiota in this non-model organism. To our knowledge, this is the first meta-analysis that aims to define the shrimp microbiota.


2021 ◽  
Author(s):  
Jiaqi Li ◽  
Lei Wei ◽  
Xianglin Zhang ◽  
Wei Zhang ◽  
Haochen Wang ◽  
...  

ABSTRACTDetecting cancer signals in cell-free DNA (cfDNA) high-throughput sequencing data is emerging as a novel non-invasive cancer detection method. Due to the high cost of sequencing, it is crucial to make robust and precise prediction with low-depth cfDNA sequencing data. Here we propose a novel approach named DISMIR, which can provide ultrasensitive and robust cancer detection by integrating DNA sequence and methylation information in plasma cfDNA whole genome bisulfite sequencing (WGBS) data. DISMIR introduces a new feature termed as “switching region” to define cancer-specific differentially methylated regions, which can enrich the cancer-related signal at read-resolution. DISMIR applies a deep learning model to predict the source of every single read based on its DNA sequence and methylation state, and then predicts the risk that the plasma donor is suffering from cancer. DISMIR exhibited high accuracy and robustness on hepatocellular carcinoma detection by plasma cfDNA WGBS data even at ultra-low sequencing depths. Analysis showed that DISMIR tends to be insensitive to alterations of single CpG sites’ methylation states, which suggests DISMIR could resist to technical noise of WGBS. All these results showed DISMIR with the potential to be a precise and robust method for low-cost early cancer detection.


2018 ◽  
Author(s):  
Nathaniel R. Glasser ◽  
Ryan C. Hunter ◽  
Theodore G. Liou ◽  
Dianne K. Newman ◽  

SummaryPseudomonas aeruginosalung infections are a leading cause of morbidity and mortality in cystic fibrosis (CF) patients (1, 2). Our laboratory has studied a class of small molecules produced byP. aeruginosaknown as phenazines, including pyocyanin and its biogenic precursor phenazine-1-carboxylic acid (PCA). As phenazines are known virulence factors (3), we and others have explored the possibility of using phenazine concentrations as a marker for disease progression (4–6). Previously, we reported that sputum concentrations of pyocyanin and PCA negatively correlate with lung function in cystic fibrosis patients (6). Our study used high performance liquid chromatography (HPLC) to quantify phenazines by UV–vis absorbance after extraction from lung sputum. Since our initial study, methods for metabolite analysis have advanced considerably, aided in large part by usage of mass spectrometry (LC-MS) and tandem mass spectrometry (LC-MS/MS). Because a more recent study employing LC-MS/MS revealed a surprising decoupling ofP. aeruginosametabolites in sputum and the detection ofP. aeruginosathrough culturing or microbiome profiles (4), we decided to check whether we could reproduce our previous findings by analyzing sputum samples from a different patient cohort with a new LC-MS instrument in our laboratory. Our new samples were provided by the Mountain West CF Consortium Sputum Biomarker study (7). In the course of performing our new analyses, comparison of our old HPLC data to our new LC-MS data led us to realize that the peak previously assigned to PCA instead originates from heme, and the peak assigned to pyocyanin originates from an as-yet unknown compound. This correction only affects the measurements of phenazines in sputum, and we are confident in the phenazine measurements from isolated cultures and the 16S rRNA gene sequencing data from that study (6). Here we outline the basis for our correction and present additional data showing that heme concentration negatively correlates with lung function in cystic fibrosis patients.


2021 ◽  
Author(s):  
Yiheng Hu ◽  
Laszlo Irinyi ◽  
Minh Thuy Vi Hoang ◽  
Tavish Eenjes ◽  
Abigail Graetz ◽  
...  

Background: The kingdom fungi is crucial for life on earth and is highly diverse. Yet fungi are challenging to characterize. They can be difficult to culture and may be morphologically indistinct in culture. They can have complex genomes of over 1 Gb in size and are still underrepresented in whole genome sequence databases. Overall their description and analysis lags far behind other microbes such as bacteria. At the same time, classification of species via high throughput sequencing without prior purification is increasingly becoming the norm for pathogen detection, microbiome studies, and environmental monitoring. However, standardized procedures for characterizing unknown fungi from complex sequencing data have not yet been established. Results: We compared different metagenomics sequencing and analysis strategies for the identification of fungal species. Using two fungal mock communities of 44 phylogenetically diverse species, we compared species classification and community composition analysis pipelines using shotgun metagenomics and amplicon sequencing data generated from both short and long read sequencing technologies. We show that regardless of the sequencing methodology used, the highest accuracy of species identification was achieved by sequence alignment against a fungi-specific database. During the assessment of classification algorithms, we found that applying cut-offs to the query coverage of each read or contig significantly improved the classification accuracy and community composition analysis without significant data loss. Conclusion: Overall, our study expands the toolkit for identifying fungi by improving sequence-based fungal classification, and provides a practical guide for the design of metagenomics analyses.


2019 ◽  
Author(s):  
◽  
Sarah Unruh

[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI AT REQUEST OF AUTHOR.] Phylogenetic trees show us how organisms are related and provide frameworks for studying and testing evolutionary hypotheses. To better understand the evolution of orchids and their mycorrhizal fungi, I used high-throughput sequencing data and bioinformatic analyses, to build phylogenetic hypotheses. In Chapter 2, I used transcriptome sequences to both build a phylogeny of the slipper orchid genera and to confirm the placement of a polyploidy event at the base of the orchid family. Polyploidy is hypothesized to be a strong driver of evolution and a source of unique traits so confirming this event leads us closer to explaining extant orchid diversity. The list of orthologous genes generated from this study will provide a less expensive and more powerful method for researchers examining the evolutionary relationships in Orchidaceae. In Chapter 3, I generated genomic sequence data for 32 fungal isolates that were collected from orchids across North America. I inferred the first multi-locus nuclear phylogenetic tree for these fungal clades. The phylogenetic structure of these fungi will improve the taxonomy of these clades by providing evidence for new species and for revising problematic species designations. A robust taxonomy is necessary for studying the role of fungi in the orchid mycorrhizal symbiosis. In chapter 4 I summarize my work and outline the future directions of my lab at Illinois College including addressing the remaining aims of my Community Sequencing Proposal with the Joint Genome Institute by analyzing the 15 fungal reference genomes I generated during my PhD. Together these chapters are the start of a life-long research project into the evolution and function of the orchid/fungal symbiosis.


2018 ◽  
Vol 100 ◽  
Author(s):  
Xiangyu Liao ◽  
Xingyu Liao ◽  
Wufei Zhu ◽  
Lu Fang ◽  
Xing Chen

AbstractWith the advancement of high-throughput sequencing technologies, the amount of available sequencing data is growing at a pace that has now begun to greatly challenge the data processing and storage capacities of modern computer systems. Removing redundancy from such data by clustering could be crucial for reducing memory, disk space and running time consumption. In addition, it also has good performance on reducing dataset noise in some analysis applications. In this study, we propose a high-performance short sequence classification algorithm (HSC) for next generation sequencing (NGS) data based on efficient hash function and text similarity. First, HSC converts all reads into k-mers, then it forms a unique k-mer set by merging the duplicated and reverse complementary elements. Second, all unique k-mers are stored in a hash table, where the k-mer string is stored in the key field, and the ID of the reads containing the k-mer are stored in the value field. Third, each hash unit is transformed into a short text consisting of reads. Fourth, texts that satisfy the similarity threshold are combined into a long text, the merge operation is executed iteratively until there is no text that satisfies the merge condition. Finally, the long text is transformed into a cluster consisting of reads. We tested HSC using five real datasets. The experimental results showed that HSC cluster 100 million short reads within 2 hours, and it has excellent performance in reducing memory consumption. Compared to existing methods, HSC is much faster than other tools, it can easily handle tens of millions of sequences. In addition, when HSC is used as a preprocessing tool to produce assembly data, the memory and time consumption of the assembler is greatly reduced. It can help the assembler to achieve better assemblies in terms of N50, NA50 and genome fraction.


2016 ◽  
Vol 82 (24) ◽  
pp. 7217-7226 ◽  
Author(s):  
D. Lee Taylor ◽  
William A. Walters ◽  
Niall J. Lennon ◽  
James Bochicchio ◽  
Andrew Krohn ◽  
...  

ABSTRACTWhile high-throughput sequencing methods are revolutionizing fungal ecology, recovering accurate estimates of species richness and abundance has proven elusive. We sought to design internal transcribed spacer (ITS) primers and an Illumina protocol that would maximize coverage of the kingdom Fungi while minimizing nontarget eukaryotes. We inspected alignments of the 5.8S and large subunit (LSU) ribosomal genes and evaluated potential primers using PrimerProspector. We tested the resulting primers using tiered-abundance mock communities and five previously characterized soil samples. We recovered operational taxonomic units (OTUs) belonging to all 8 members in both mock communities, despite DNA abundances spanning 3 orders of magnitude. The expected and observed read counts were strongly correlated (r= 0.94 to 0.97). However, several taxa were consistently over- or underrepresented, likely due to variation in rRNA gene copy numbers. The Illumina data resulted in clustering of soil samples identical to that obtained with Sanger sequence clone library data using different primers. Furthermore, the two methods produced distance matrices with a Mantel correlation of 0.92. Nonfungal sequences comprised less than 0.5% of the soil data set, with most attributable to vascular plants. Our results suggest that high-throughput methods can produce fairly accurate estimates of fungal abundances in complex communities. Further improvements might be achieved through corrections for rRNA copy number and utilization of standardized mock communities.IMPORTANCEFungi play numerous important roles in the environment. Improvements in sequencing methods are providing revolutionary insights into fungal biodiversity, yet accurate estimates of the number of fungal species (i.e., richness) and their relative abundances in an environmental sample (e.g., soil, roots, water, etc.) remain difficult to obtain. We present improved methods for high-throughput Illumina sequencing of the species-diagnostic fungal ribosomal marker gene that improve the accuracy of richness and abundance estimates. The improvements include new PCR primers and library preparation, validation using a known mock community, and bioinformatic parameter tuning.


Sign in / Sign up

Export Citation Format

Share Document