Simulated rRNA/DNA Ratios Show Potential To Misclassify Active Populations as Dormant

ABSTRACT The use of rRNA/DNA ratios derived from surveys of rRNA sequences in RNA and DNA extracts is an appealing but poorly validated approach to infer the activity status of environmental microbes. To improve the interpretation of rRNA/DNA ratios, we performed simulations to investigate the effects of community structure, rRNA amplification, and sampling depth on the accuracy of rRNA/DNA ratios in classifying bacterial populations as “active” or “dormant.” Community structure was an insignificant factor. In contrast, the extent of rRNA amplification that occurs as cells transition from dormant to growing had a significant effect (P < 0.0001) on classification accuracy, with misclassification errors ranging from 16 to 28%, depending on the rRNA amplification model. The error rate increased to 47% when communities included a mixture of rRNA amplification models, but most of the inflated error was false negatives (i.e., active populations misclassified as dormant). Sampling depth also affected error rates (P < 0.001). Inadequate sampling depth produced various artifacts that are characteristic of rRNA/DNA ratios generated from real communities. These data show important constraints on the use of rRNA/DNA ratios to infer activity status. Whereas classification of populations as active based on rRNA/DNA ratios appears generally valid, classification of populations as dormant is potentially far less accurate. IMPORTANCE The rRNA/DNA ratio approach is appealing because it extracts an extra layer of information from high-throughput DNA sequencing data, offering a means to determine not only the seedbank of taxa present in communities but also the subset of taxa that are metabolically active. This study provides crucial insights into the use of rRNA/DNA ratios to infer the activity status of microbial taxa in complex communities. Our study shows that the approach may not be as robust as previously supposed, particularly in complex communities composed of populations employing different growth strategies, and identifies factors that inflate the erroneous classification of active populations as dormant.

Download Full-text

Faculty Opinions recommendation of Error rates, PCR recombination, and sampling depth in HIV-1 whole genome deep sequencing.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.727160877.793531007 ◽

2017 ◽

Author(s):

Sarah Rowland-Jones ◽

Sophie Andrews

Keyword(s):

Deep Sequencing ◽

Error Rates ◽

Whole Genome ◽

Sampling Depth ◽

Hiv 1

Download Full-text

Development of a User-Friendly Pipeline for Mutational Analyses of HIV Using Ultra-Accurate Maximum-Depth Sequencing

Viruses ◽

10.3390/v13071338 ◽

2021 ◽

Vol 13 (7) ◽

pp. 1338

Author(s):

Morgan E. Meissner ◽

Emily J. Julik ◽

Jonathan P. Badalamenti ◽

William G. Arndt ◽

Lauren J. Mills ◽

...

Keyword(s):

Error Rates ◽

Maximum Depth ◽

Sequencing Data ◽

Background Error ◽

High Background ◽

Immunodeficiency Virus ◽

User Friendly ◽

Viral Mutagenesis ◽

Hiv 1

Human immunodeficiency virus type 2 (HIV-2) accumulates fewer mutations during replication than HIV type 1 (HIV-1). Advanced studies of HIV-2 mutagenesis, however, have historically been confounded by high background error rates in traditional next-generation sequencing techniques. In this study, we describe the adaptation of the previously described maximum-depth sequencing (MDS) technique to studies of both HIV-1 and HIV-2 for the ultra-accurate characterization of viral mutagenesis. We also present the development of a user-friendly Galaxy workflow for the bioinformatic analyses of sequencing data generated using the MDS technique, designed to improve replicability and accessibility to molecular virologists. This adapted MDS technique and analysis pipeline were validated by comparisons with previously published analyses of the frequency and spectra of mutations in HIV-1 and HIV-2 and is readily expandable to studies of viral mutation across the genomes of both viruses. Using this novel sequencing pipeline, we observed that the background error rate was reduced 100-fold over standard Illumina error rates, and 10-fold over traditional unique molecular identifier (UMI)-based sequencing. This technical advancement will allow for the exploration of novel and previously unrecognized sources of viral mutagenesis in both HIV-1 and HIV-2, which will expand our understanding of retroviral diversity and evolution.

Download Full-text

Software Benchmark—Classification Tree Algorithms for Cell Atlases Annotation Using Single-Cell RNA-Sequencing Data

Microbiology Research ◽

10.3390/microbiolres12020022 ◽

2021 ◽

Vol 12 (2) ◽

pp. 317-334

Author(s):

Omar Alaqeeli ◽

Li Xing ◽

Xuekui Zhang

Keyword(s):

Single Cell ◽

Rna Sequencing ◽

Classification Tree ◽

Area Under The Curve ◽

Data Sets ◽

Sequencing Data ◽

Single Cell Rna Sequencing ◽

Tree Algorithms ◽

R Packages

Classification tree is a widely used machine learning method. It has multiple implementations as R packages; rpart, ctree, evtree, tree and C5.0. The details of these implementations are not the same, and hence their performances differ from one application to another. We are interested in their performance in the classification of cells using the single-cell RNA-Sequencing data. In this paper, we conducted a benchmark study using 22 Single-Cell RNA-sequencing data sets. Using cross-validation, we compare packages’ prediction performances based on their Precision, Recall, F1-score, Area Under the Curve (AUC). We also compared the Complexity and Run-time of these R packages. Our study shows that rpart and evtree have the best Precision; evtree is the best in Recall, F1-score and AUC; C5.0 prefers more complex trees; tree is consistently much faster than others, although its complexity is often higher than others.

Download Full-text

Estimating sequencing error rates using families

BioData Mining ◽

10.1186/s13040-021-00259-6 ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Kelley Paskov ◽

Jae-Yoon Jung ◽

Brianna Chrisman ◽

Nate T. Stockham ◽

Peter Washington ◽

...

Keyword(s):

Whole Genome Sequencing ◽

Exome Sequencing ◽

Genome Sequencing ◽

Variant Calling ◽

Error Rates ◽

Sequencing Error ◽

Whole Genome ◽

Sequencing Data ◽

Sequencing Platform ◽

Whole Exome

Abstract Background As next-generation sequencing technologies make their way into the clinic, knowledge of their error rates is essential if they are to be used to guide patient care. However, sequencing platforms and variant-calling pipelines are continuously evolving, making it difficult to accurately quantify error rates for the particular combination of assay and software parameters used on each sample. Family data provide a unique opportunity for estimating sequencing error rates since it allows us to observe a fraction of sequencing errors as Mendelian errors in the family, which we can then use to produce genome-wide error estimates for each sample. Results We introduce a method that uses Mendelian errors in sequencing data to make highly granular per-sample estimates of precision and recall for any set of variant calls, regardless of sequencing platform or calling methodology. We validate the accuracy of our estimates using monozygotic twins, and we use a set of monozygotic quadruplets to show that our predictions closely match the consensus method. We demonstrate our method’s versatility by estimating sequencing error rates for whole genome sequencing, whole exome sequencing, and microarray datasets, and we highlight its sensitivity by quantifying performance increases between different versions of the GATK variant-calling pipeline. We then use our method to demonstrate that: 1) Sequencing error rates between samples in the same dataset can vary by over an order of magnitude. 2) Variant calling performance decreases substantially in low-complexity regions of the genome. 3) Variant calling performance in whole exome sequencing data decreases with distance from the nearest target region. 4) Variant calls from lymphoblastoid cell lines can be as accurate as those from whole blood. 5) Whole-genome sequencing can attain microarray-level precision and recall at disease-associated SNV sites. Conclusion Genotype datasets from families are powerful resources that can be used to make fine-grained estimates of sequencing error for any sequencing platform and variant-calling methodology.

Download Full-text

Computational classification of microRNAs in next-generation sequencing data

Theoretical Chemistry Accounts ◽

10.1007/s00214-009-0684-z ◽

2009 ◽

Vol 125 (3-6) ◽

pp. 637-642

Author(s):

Joshua Riback ◽

Artemis G. Hatzigeorgiou ◽

Martin Reczko

Keyword(s):

Next Generation Sequencing ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Generation Sequencing

Download Full-text

Estimating error rates in the classification of paired organs

Statistics in Medicine ◽

10.1002/sim.3310 ◽

2008 ◽

Vol 27 (22) ◽

pp. 4515-4531 ◽

Cited By ~ 17

Author(s):

Alexander Brenning ◽

Berthold Lausen

Keyword(s):

Error Rates

Download Full-text

A METAGENOMIC ASSESSMENT OF BACTERIAL CONTAMINATION OF DUST EVENTS IN SENEGAL

International Journal of Advanced Research ◽

10.21474/ijar01/12610 ◽

2021 ◽

Vol 9 (03) ◽

pp. 509-526

Author(s):

Alioune Marone ◽

◽

Malick Mbengue ◽

Gregory Jenkins ◽

Demba Ndao Niang ◽

...

Keyword(s):

Respiratory Diseases ◽

Rrna Gene ◽

Source Population ◽

Sequencing Analysis ◽

Human Pathogens ◽

Sequencing Data ◽

Bacterial Populations ◽

African Dust ◽

Hypervariable Regions ◽

Dust Events

Previous work in the Caribbean and West Africa have shown that air samples taken during dust events contain microorganisms (bacteria, fungi, viruses), including human pathogens that can cause many respiratory diseases. To better understand the potential downstream effect of bacteria dust on human health and public ecosystems, it is important to characterize the source population. In this study, we aimed to explore the bacterial populations of African dust samples collected between 2013-2017. The dust samples were collected using the spatula method, then the hypervariable regions (V3 and V4) of the 16S rRNA gene were amplified using PCR followed byMiSeq Illumina sequencing. Analysis of the sequencing data were performed using MG-RAST. At the phylum level, the proportions of Actinobacteria (22%), Firmicutes (20%), Proteobacteria (19%), and Bacteroidetes (13%) were respectively predominant in all dust samples. At the genus level, Bacillus(16%), Pseudomonas(10%), Nocardiodes and Exiguobacterium (5%) are the most dominated genera in African dust samples collected in this study.The study showed that molecular characterization of dust microbial population remains a very efficient method, also applicable to the search for viruses and fungi in this type of sample. It is important to note that the majority of microorganisms identified in this study can cause respiratory diseases.

Download Full-text

Streaming histogram sketching for rapid microbiome analytics

10.1101/408070 ◽

2018 ◽

Author(s):

Will P. M. Rowe ◽

Anna Paola Carrieri ◽

Cristina Alcon-Giner ◽

Shabhonam Caim ◽

Alex Shaw ◽

...

Keyword(s):

Locality Sensitive Hashing ◽

Genomic Research ◽

Compact Representation ◽

Sample Type ◽

Sequencing Data ◽

Similarity Estimation ◽

Microbiome Research ◽

Microbiome Data ◽

Similarity Searches

AbstractMotivationThe growth in publically available microbiome data in recent years has yielded an invaluable resource for genomic research; allowing for the design of new studies, augmentation of novel datasets and reanalysis of published works. This vast amount of microbiome data, as well as the widespread proliferation of microbiome research and the looming era of clinical metagenomics, means there is an urgent need to develop analytics that can process huge amounts of data in a short amount of time.To address this need, we propose a new method for the compact representation of microbiome sequencing data using similarity-preserving sketches of streaming k-mer spectra. These sketches allow for dissimilarity estimation, rapid microbiome catalogue searching, and classification of microbiome samples in near real-time.ResultsWe apply streaming histogram sketching to microbiome samples as a form of dimensionality reduction, creating a compressed ‘histosketch’ that can be used to efficiently represent microbiome k-mer spectra. Using public microbiome datasets, we show that histosketches can be clustered by sample type using pairwise Jaccard similarity estimation, consequently allowing for rapid microbiome similarity searches via a locality sensitive hashing indexing scheme. Furthermore, we show that histosketches can be used to train machine learning classifiers to accurately label microbiome samples. Specifically, using a collection of 108 novel microbiome samples from a cohort of premature neonates, we trained and tested a Random Forest Classifier that could accurately predict whether the neonate had received antibiotic treatment (95% accuracy, precision 97%) and could subsequently be used to classify microbiome data streams in less than 12 seconds.We provide our implementation, Histosketching Using Little K-mers (HULK), which can histosketch a typical 2GB microbiome in 50 seconds on a standard laptop using 4 cores, with the sketch occupying 3000 bytes of disk space.AvailabilityOur implementation (HULK) is written in Go and is available at: https://github.com/will-rowe/hulk (MIT License)

Download Full-text

Supervised linear classification of Gaussian spatio-temporal data

Lietuvos matematikos rinkinys ◽

10.15388/lmr.2021.25214 ◽

2021 ◽

Vol 62 ◽

pp. 9-15

Author(s):

Marta Karaliutė ◽

Kęstutis Dučinskas

Keyword(s):

Time Moment ◽

Gaussian Random Field ◽

Covariance Structure ◽

Simulated Data ◽

Spatial Location ◽

Error Rates ◽

Prior Probabilities ◽

Structure Factors ◽

Spatio Temporal

In this article we focus on the problem of supervised classifying of the spatio-temporal Gaussian random field observation into one of two classes, specified by different mean parameters. The main distinctive feature of the proposed approach is allowing the class label to depend on spatial location as well as on time moment. It is assumed that the spatio-temporal covariance structure factors into a purely spatial component and a purely temporal component following AR(p) model. In numerical illustrations with simulated data, the influence of the values of spatial and temporal covariance parameters to the derived error rates for several prior probabilities models are studied.

Download Full-text

PhredEM: A Phred-Score-Informed Genotype-Calling Approach for Next-Generation Sequencing Studies

10.1101/046136 ◽

2016 ◽

Author(s):

Peizhou Liao ◽

Glen A. Satten ◽

Yi-juan Hu

Keyword(s):

Logistic Regression ◽

Next Generation Sequencing ◽

Em Algorithm ◽

Error Rates ◽

Next Generation Sequencing Data ◽

Next Generation ◽

Sequencing Data ◽

Genotype Calling ◽

Sequencing Studies ◽

Generation Sequencing

ABSTRACTA fundamental challenge in analyzing next-generation sequencing data is to determine an individual’s genotype correctly as the accuracy of the inferred genotype is essential to downstream analyses. Some genotype callers, such as GATK and SAMtools, directly calculate the base-calling error rates from phred scores or recalibrated base quality scores. Others, such as SeqEM, estimate error rates from the read data without using any quality scores. It is also a common quality control procedure to filter out reads with low phred scores. However, choosing an appropriate phred score threshold is problematic as a too-high threshold may lose data while a too-low threshold may introduce errors. We propose a new likelihood-based genotype-calling approach that exploits all reads and estimates the per-base error rates by incorporating phred scores through a logistic regression model. The algorithm, which we call PhredEM, uses the Expectation-Maximization (EM) algorithm to obtain consistent estimates of genotype frequencies and logistic regression parameters. We also develop a simple, computationally efficient screening algorithm to identify loci that are estimated to be monomorphic, so that only loci estimated to be non-monomorphic require application of the EM algorithm. We evaluate the performance of PhredEM using both simulated data and real sequencing data from the UK10K project. The results demonstrate that PhredEM is an improved, robust and widely applicable genotype-calling approach for next-generation sequencing studies. The relevant software is freely available.

Download Full-text