Accurate estimation of microbial sequence diversity with Distanced

2019 ◽  
Author(s):  
Timothy J Hackmann

Abstract Motivation Microbes are the most diverse organisms on the planet. Deep sequencing of ribosomal DNA (rDNA) suggests thousands of different microbes may be present in a single sample. However, errors in sequencing have made any estimate of within-sample (alpha) diversity uncertain. Results We developed a tool to estimate alpha diversity of rDNA sequences from microbes (and other sequences). Our tool, Distanced, calculates how different (distant) sequences would be without sequencing errors. It does this using a Bayesian approach. Using this approach, Distanced accurately estimated alpha diversity of rDNA sequences from bacteria and fungi. It had lower root mean square prediction error (RMSPE) than when using no tool (leaving sequencing errors uncorrected). It was also accurate with non-microbial sequences (antibody mRNA). State-of-the-art tools (DADA2 and Deblur) were far less accurate. They often had higher RMSPE than when using no tool. Distanced thus represents an improvement over existing tools. Distanced will be useful to several disciplines, given microbial diversity affects everything from human health to ecosystem function. Availability and implementation Distanced is freely available at https://github.com/thackmann/Distanced. Contact [email protected] Supplementary information Supplementary data are available at Bioinformatics online.

2018 ◽  
Author(s):  
Timothy J. Hackmann

Microbes are the most diverse organisms on Earth. Sequencing their DNA suggests thousands of different microbes could be present in a single sample. Errors in sequencing, however, make it challenging to estimate exactly how diverse microbes are. Here we developed a tool that estimates diversity accurately, even in the presence of sequencing errors. We first evaluated two existing tools, DADA2 and Deblur, which work by correcting sequencing errors. We found that these tools estimated within-sample (alpha) diversity poorly. In fact, we obtained better estimates if did not use the tools at all (left errors uncorrected). These tools performed poorly because they changed the relative abundance of different sequences; this is a side effect of correcting errors and discarding up to 90% of sequence reads in the process. Previous evaluations ignored sequence abundance when calculating diversity, overlooking this problem. Our tool, Distanced, differs from existing tools because it does not correct sequencing errors. Instead, it corrects sequence distances, which are used to calculate diversity. It does this correction with Phred quality scores and Bayes theorem. No sequence reads are discarded in the process. In our evaluation, Distanced accurately estimated diversity of bacterial DNA, fungal DNA, and even antibody mRNA. Given its accuracy, Distanced will help investigators answer important questions about microbial diversity. For example, it could answer how important is diversity for the planets ecosystems and human health.


2019 ◽  
Author(s):  
Soumyabrata Dev ◽  
Florian M. Savoy ◽  
Yee Hui Lee ◽  
Stefan Winkler

Abstract. Ground-based whole sky cameras are now-a-days extensively used for localized monitoring of the clouds. They capture hemispherical images of the sky at regular intervals using a fisheye lens. In this paper, we derive a model for estimating the solar irradiance using pictures taken by those imagers. Unlike pyranometers, these sky images contain information about the cloud coverage and can be used to derive cloud movement. An accurate estimation of the solar irradiance using solely those images is thus a first step towards short-term solar energy generation forecasting, as cloud movement can also be derived from them. We derive and validate our model using pyranometers co-located with our whole sky imagers. We achieve a root mean square error of 178 Watt/m2 between estimated and measured solar irradiance, outperforming state-of-the-art methods using other weather instruments. Our method shows a significant improvement in estimating strong short-term variations.


2019 ◽  
Vol 35 (22) ◽  
pp. 4624-4631 ◽  
Author(s):  
Xin Li ◽  
Samaneh Saadat ◽  
Haiyan Hu ◽  
Xiaoman Li

Abstract Motivation The bacterial haplotype reconstruction is critical for selecting proper treatments for diseases caused by unknown haplotypes. Existing methods and tools do not work well on this task, because they are usually developed for viral instead of bacterial populations. Results In this study, we developed BHap, a novel algorithm based on fuzzy flow networks, for reconstructing bacterial haplotypes from next generation sequencing data. Tested on simulated and experimental datasets, we showed that BHap was capable of reconstructing haplotypes of bacterial populations with an average F1 score of 0.87, an average precision of 0.87 and an average recall of 0.88. We also demonstrated that BHap had a low susceptibility to sequencing errors, was capable of reconstructing haplotypes with low coverage and could handle a wide range of mutation rates. Compared with existing approaches, BHap outperformed them in terms of higher F1 scores, better precision, better recall and more accurate estimation of the number of haplotypes. Availability and implementation The BHap tool is available at http://www.cs.ucf.edu/∼xiaoman/BHap/. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Matteo Chiara ◽  
Federico Zambelli ◽  
Marco Antonio Tangaro ◽  
Pietro Mandreoli ◽  
David S Horner ◽  
...  

Abstract Summary While over 200 000 genomic sequences are currently available through dedicated repositories, ad hoc methods for the functional annotation of SARS-CoV-2 genomes do not harness all currently available resources for the annotation of functionally relevant genomic sites. Here, we present CorGAT, a novel tool for the functional annotation of SARS-CoV-2 genomic variants. By comparisons with other state of the art methods we demonstrate that, by providing a more comprehensive and rich annotation, our method can facilitate the identification of evolutionary patterns in the genome of SARS-CoV-2. Availabilityand implementation Galaxy   http://corgat.cloud.ba.infn.it/galaxy; software: https://github.com/matteo14c/CorGAT/tree/Revision_V1; docker: https://hub.docker.com/r/laniakeacloud/galaxy_corgat. Supplementary information Supplementary data are available at Bioinformatics online.


2015 ◽  
Vol 32 (6) ◽  
pp. 821-827 ◽  
Author(s):  
Enrique Audain ◽  
Yassel Ramos ◽  
Henning Hermjakob ◽  
Darren R. Flower ◽  
Yasset Perez-Riverol

Abstract Motivation: In any macromolecular polyprotic system—for example protein, DNA or RNA—the isoelectric point—commonly referred to as the pI—can be defined as the point of singularity in a titration curve, corresponding to the solution pH value at which the net overall surface charge—and thus the electrophoretic mobility—of the ampholyte sums to zero. Different modern analytical biochemistry and proteomics methods depend on the isoelectric point as a principal feature for protein and peptide characterization. Protein separation by isoelectric point is a critical part of 2-D gel electrophoresis, a key precursor of proteomics, where discrete spots can be digested in-gel, and proteins subsequently identified by analytical mass spectrometry. Peptide fractionation according to their pI is also widely used in current proteomics sample preparation procedures previous to the LC-MS/MS analysis. Therefore accurate theoretical prediction of pI would expedite such analysis. While such pI calculation is widely used, it remains largely untested, motivating our efforts to benchmark pI prediction methods. Results: Using data from the database PIP-DB and one publically available dataset as our reference gold standard, we have undertaken the benchmarking of pI calculation methods. We find that methods vary in their accuracy and are highly sensitive to the choice of basis set. The machine-learning algorithms, especially the SVM-based algorithm, showed a superior performance when studying peptide mixtures. In general, learning-based pI prediction methods (such as Cofactor, SVM and Branca) require a large training dataset and their resulting performance will strongly depend of the quality of that data. In contrast with Iterative methods, machine-learning algorithms have the advantage of being able to add new features to improve the accuracy of prediction. Contact: [email protected] Availability and Implementation: The software and data are freely available at https://github.com/ypriverol/pIR. Supplementary information: Supplementary data are available at Bioinformatics online.


2020 ◽  
Vol 36 (10) ◽  
pp. 3011-3017 ◽  
Author(s):  
Olga Mineeva ◽  
Mateo Rojas-Carulla ◽  
Ruth E Ley ◽  
Bernhard Schölkopf ◽  
Nicholas D Youngblut

Abstract Motivation Methodological advances in metagenome assembly are rapidly increasing in the number of published metagenome assemblies. However, identifying misassemblies is challenging due to a lack of closely related reference genomes that can act as pseudo ground truth. Existing reference-free methods are no longer maintained, can make strong assumptions that may not hold across a diversity of research projects, and have not been validated on large-scale metagenome assemblies. Results We present DeepMAsED, a deep learning approach for identifying misassembled contigs without the need for reference genomes. Moreover, we provide an in silico pipeline for generating large-scale, realistic metagenome assemblies for comprehensive model training and testing. DeepMAsED accuracy substantially exceeds the state-of-the-art when applied to large and complex metagenome assemblies. Our model estimates a 1% contig misassembly rate in two recent large-scale metagenome assembly publications. Conclusions DeepMAsED accurately identifies misassemblies in metagenome-assembled contigs from a broad diversity of bacteria and archaea without the need for reference genomes or strong modeling assumptions. Running DeepMAsED is straight-forward, as well as is model re-training with our dataset generation pipeline. Therefore, DeepMAsED is a flexible misassembly classifier that can be applied to a wide range of metagenome assembly projects. Availability and implementation DeepMAsED is available from GitHub at https://github.com/leylabmpi/DeepMAsED. Supplementary information Supplementary data are available at Bioinformatics online.


2014 ◽  
Vol 2014 ◽  
pp. 1-19 ◽  
Author(s):  
Mark J. van der Laan ◽  
Richard J. C. M. Starmans

This outlook paper reviews the research of van der Laan’s group on Targeted Learning, a subfield of statistics that is concerned with the construction of data adaptive estimators of user-supplied target parameters of the probability distribution of the data and corresponding confidence intervals, aiming at only relying on realistic statistical assumptions. Targeted Learning fully utilizes the state of the art in machine learning tools, while still preserving the important identity of statistics as a field that is concerned with both accurate estimation of the true target parameter value and assessment of uncertainty in order to make sound statistical conclusions. We also provide a philosophical historical perspective on Targeted Learning, also relating it to the new developments in Big Data. We conclude with some remarks explaining the immediate relevance of Targeted Learning to the current Big Data movement.


2019 ◽  
Author(s):  
Christina Huan Shi ◽  
Kevin Y. Yip

AbstractK-mer counting has many applications in sequencing data processing and analysis. However, sequencing errors can produce many false k-mers that substantially increase the memory requirement during counting. We propose a fast k-mer counting method, CQF-deNoise, which has a novel component for dynamically identifying and removing false k-mers while preserving counting accuracy. Compared with four state-of-the-art k-mer counting methods, CQF-deNoise consumed 49-76% less memory than the second best method, but still ran competitively fast. The k-mer counts from CQF-deNoise produced cell clusters from single-cell RNA-seq data highly consistent with CellRanger but required only 5% of the running time at the same memory consumption, suggesting that CQF-deNoise can be used for a preview of cell clusters for an early detection of potential data problems, before running a much more time-consuming full analysis pipeline.


2010 ◽  
Vol 15 (2) ◽  
pp. 113-133 ◽  
Author(s):  
Muhammad Zakaria ◽  
Shujat Ali

Using Theil’s inequality coefficient based on the mean square prediction error, this paper evaluates the forecasting efficiency of the central government budget and revised budget estimates in Pakistan for the period 1987/88 to 2007/08 and decomposes the errors into biasedness, unequal variation and random components to analyze the source of error. The results reveal that budgetary forecasting is inefficient in Pakistan and the error is due mainly to exogenous variables (random factors). We also find that neither the budget nor revised budget estimates of revenue and expenditure satisfy the criteria of rational expectations of forecasting. Further, there is very little evidence of improvement in the efficiency of budgetary forecasts over time.


2020 ◽  
Vol 36 (Supplement_2) ◽  
pp. i745-i753
Author(s):  
Yisu Peng ◽  
Shantanu Jain ◽  
Yong Fuga Li ◽  
Michal Greguš ◽  
Alexander R. Ivanov ◽  
...  

Abstract Motivation Accurate estimation of false discovery rate (FDR) of spectral identification is a central problem in mass spectrometry-based proteomics. Over the past two decades, target-decoy approaches (TDAs) and decoy-free approaches (DFAs) have been widely used to estimate FDR. TDAs use a database of decoy species to faithfully model score distributions of incorrect peptide-spectrum matches (PSMs). DFAs, on the other hand, fit two-component mixture models to learn the parameters of correct and incorrect PSM score distributions. While conceptually straightforward, both approaches lead to problems in practice, particularly in experiments that push instrumentation to the limit and generate low fragmentation-efficiency and low signal-to-noise-ratio spectra. Results We introduce a new decoy-free framework for FDR estimation that generalizes present DFAs while exploiting more search data in a manner similar to TDAs. Our approach relies on multi-component mixtures, in which score distributions corresponding to the correct PSMs, best incorrect PSMs and second-best incorrect PSMs are modeled by the skew normal family. We derive EM algorithms to estimate parameters of these distributions from the scores of best and second-best PSMs associated with each experimental spectrum. We evaluate our models on multiple proteomics datasets and a HeLa cell digest case study consisting of more than a million spectra in total. We provide evidence of improved performance over existing DFAs and improved stability and speed over TDAs without any performance degradation. We propose that the new strategy has the potential to extend beyond peptide identification and reduce the need for TDA on all analytical platforms. Availabilityand implementation https://github.com/shawn-peng/FDR-estimation. Supplementary information Supplementary data are available at Bioinformatics online.


Sign in / Sign up

Export Citation Format

Share Document