scholarly journals Detecting and correcting misclassified sequences in the large-scale public databases

2020 ◽  
Vol 36 (18) ◽  
pp. 4699-4705
Author(s):  
Hamid Bagheri ◽  
Andrew J Severin ◽  
Hridesh Rajan

Abstract Motivation As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the NR database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of misclassification in the entire database has not been quantified. We propose a heuristic method to detect potentially misclassified taxonomic assignments in the NR database. We applied a curation technique and quality control to find the most probable taxonomic assignment. Our method incorporates provenance and frequency of each annotation from manually and computationally created databases and clustering information at 95% similarity. Results We found more than two million potentially taxonomically misclassified proteins in the NR database. Using simulated data, we show a high precision of 97% and a recall of 87% for detecting taxonomically misclassified proteins. The proposed approach and findings could also be applied to other databases. Availability and implementation Source code, dataset, documentation, Jupyter notebooks and Docker container are available at https://github.com/boalang/nr. Supplementary information Supplementary data are available at Bioinformatics online.

2018 ◽  
Author(s):  
Lucas Czech ◽  
Alexandros Stamatakis

AbstractMotivationIn most metagenomic sequencing studies, the initial analysis step consists in assessing the evolutionary provenance of the sequences. Phylogenetic (or Evolutionary) Placement methods can be employed to determine the evolutionary position of sequences with respect to a given reference phylogeny. These placement methods do however face certain limitations: The manual selection of reference sequences is labor-intensive; the computational effort to infer reference phylogenies is substantially larger than for methods that rely on sequence similarity; the number of taxa in the reference phylogeny should be small enough to allow for visually inspecting the results.ResultsWe present algorithms to overcome the above limitations. First, we introduce a method to automatically construct representative sequences from databases to infer reference phylogenies. Second, we present an approach for conducting large-scale phylogenetic placements on nested phylogenies. Third, we describe a preprocessing pipeline that allows for handling huge sequence data sets. Our experiments on empirical data show that our methods substantially accelerate the workflow and yield highly accurate placement results.ImplementationFreely available under GPLv3 at http://github.com/lczech/[email protected] InformationSupplementary data are available at Bioinformatics online.


2018 ◽  
Vol 35 (11) ◽  
pp. 1901-1906 ◽  
Author(s):  
Mary D Fortune ◽  
Chris Wallace

Abstract Motivation Methods for analysis of GWAS summary statistics have encouraged data sharing and democratized the analysis of different diseases. Ideal validation for such methods is application to simulated data, where some ‘truth’ is known. As GWAS increase in size, so does the computational complexity of such evaluations; standard practice repeatedly simulates and analyses genotype data for all individuals in an example study. Results We have developed a novel method based on an alternative approach, directly simulating GWAS summary data, without individual data as an intermediate step. We mathematically derive the expected statistics for any set of causal variants and their effect sizes, conditional upon control haplotype frequencies (available from public reference datasets). Simulation of GWAS summary output can be conducted independently of sample size by simulating random variates about these expected values. Across a range of scenarios, our method, produces very similar output to that from simulating individual genotypes with a substantial gain in speed even for modest sample sizes. Fast simulation of GWAS summary statistics will enable more complete and rapid evaluation of summary statistic methods as well as opening new potential avenues of research in fine mapping and gene set enrichment analysis. Availability and implementation Our method is available under a GPL license as an R package from http://github.com/chr1swallace/simGWAS. Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Tizian Schulz ◽  
Roland Wittler ◽  
Sven Rahmann ◽  
Faraz Hach ◽  
Jens Stoye

Abstract Motivation Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet. Results We present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome. Availability Source code and test data are available from https://gitlab.ub.uni-bielefeld.de/gi/plast. Supplementary information Supplementary data are available at Bioinformatics online.


2018 ◽  
Author(s):  
Mary D. Fortune ◽  
Chris Wallace

AbstractMotivationMethods for analysis of GWAS summary statistics have encouraged data sharing and democratised the analysis of different diseases. Ideal validation for such methods is application to simulated data, where some “truth” is known. As GWAS increase in size, so does the computational complexity of such evaluations; standard practice repeatedly simulates and analyses genotype data for all individuals in an example study.ResultsWe have developed a novel method based on an alternative approach, directly simulating GWAS summary data, without individual data as an intermediate step. We mathematically derive the expected statistics for any set of causal variants and their effect sizes, conditional upon control haplotype frequencies (available from public reference datasets). Simulation of GWAS summary output can be conducted independently of sample size by simulating random variates about these expected values. Across a range of scenarios, our method, produces very similar output to that from simulating individual genotypes with a substantial gain in speed even for modest sample sizes. Fast simulation of GWAS summary statistics will enable more complete and rapid evaluation of summary statistic methods as well as opening new potential avenues of research in fine mapping and gene set enrichment analysis.Availability and ImplementationOur method is available under a GPL license as an R package from http://github.com/chr1swallace/[email protected] InformationSupplementary Information is appended.


Author(s):  
Daniel A Nissley ◽  
Anna Carbery ◽  
Mark Chonofsky ◽  
Charlotte M Deane

Abstract Motivation Protein synthesis is a non-equilibrium process, meaning that the speed of translation can influence the ability of proteins to fold and function. Assuming that structurally similar proteins fold by similar pathways, the profile of translation speed along an mRNA should be evolutionarily conserved between related proteins to direct correct folding and downstream function. The only evidence to date for such conservation of translation speed between homologous proteins has used codon rarity as a proxy for translation speed. There are, however, many other factors including mRNA structure and the chemistry of the amino acids in the A- and P-sites of the ribosome that influence the speed of amino acid addition. Results Ribosome profiling experiments provide a signal directly proportional to the underlying translation times at the level of individual codons. We compared ribosome occupancy profiles (extracted from five different large-scale yeast ribosome profiling studies) between related protein domains to more directly test if their translation schedule was conserved. Our analysis reveals that the ribosome occupancy profiles of paralogous domains tend to be significantly more similar to one another than to profiles of non-paralogous domains. This trend does not depend on domain length, structural classes, amino acid composition or sequence similarity. Our results indicate that entire ribosome occupancy profiles and not just rare codon locations are conserved between even distantly related domains in yeast, providing support for the hypothesis that translation schedule is conserved between structurally related domains to retain folding pathways and facilitate efficient folding. Availability and implementation Python3 code is available on GitHub at https://github.com/DanNissley/Compare-ribosome-occupancy. Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
Monica R. Young ◽  
Jeremy R. deWaard ◽  
Paul D. N. Hebert

AbstractAlthough mites (Acari) are abundant in many terrestrial and freshwater ecosystems, their diversity is poorly understood. Since most mite species can be distinguished by variation in the DNA barcode region of cytochrome c oxidase I, the Barcode Index Number (BIN) system provides a reliable species proxy that facilitates large-scale surveys. Such analysis reveals many new BINs that can only be identified as Acari until they are examined by a taxonomic specialist. This study demonstrates that the Barcode of Life Datasystem’s identification engine (BOLD ID) generally delivers correct ordinal and family assignments from both full-length DNA barcodes and their truncated versions gathered in metabarcoding studies. This result was demonstrated by examining BOLD ID’s capacity to assign 7021 mite BINs to their correct order (4) and family (189). Identification success improved with sequence length and taxon coverage but varied among orders indicating the need for lineage-specific thresholds. A strict sequence similarity threshold (86.6%) prevented all ordinal misassignments and allowed the identification of 78.6% of the 7021 BINs. However, higher thresholds were required to eliminate family misassignments for Sarcoptiformes (89.9%), and Trombidiformes (91.4%), consequently reducing the proportion of BINs identified to 68.6%. Lineages with low barcode coverage in the reference library should be prioritized for barcode library expansion to improve assignment success.


2020 ◽  
Author(s):  
Hamid Bagheri ◽  
Robert Dyer ◽  
Andrew Severin ◽  
Hridesh Rajan

Abstract Background: Scientists around the world use NCBI’s non-redundant (NR) database to identify the taxonomic origin and functional annotation of their favorite protein sequences using BLAST. Unfortunately, due to the exponential growth of this database, many scientists do not have a good understanding of the contents of the NR database. There is a need for tools to explore the contents of large biological datasets, such as NR, to better understand the assumptions and limitations of the data they contain. Results: Protein sequence data, protein functional annotation, and taxonomic assignment from NCBI’s NR database were placed into a BoaG database, a domain-specific language and shared data science infrastructure for genomics, along with a CD-HIT clustering of all these protein sequences at different sequence similarity levels. We show that BoaG can efficiently perform queries on this large dataset to determine the average length of protein sequences and identify the most common taxonomic assignments and functional annotations. Using the clustering information, we also show that the non-redundant (NR) database has a considerable amount of annotation redundancy at the 95% similarity level. Conclusions: We implemented BoaG and provided a web-based interface to BoaG’s infrastructure that will help researchers to explore the dataset further. Researchers can submit queries and download the results or share them with others. Availability and implementation: The web-interface of the BoaG infrastructure can be accessed here: http://boa.cs.iastate.edu/boag. Please use user = boag and password = boag to login. Source code and other documentation are also provided as a GitHub repository: https://github.com/boalang/NR_Dataset.


2019 ◽  
Vol 35 (19) ◽  
pp. 3651-3662 ◽  
Author(s):  
F J Campos-Laborie ◽  
A Risueño ◽  
M Ortiz-Estévez ◽  
B Rosón-Burgo ◽  
C Droste ◽  
...  

Abstract Motivation Patient and sample diversity is one of the main challenges when dealing with clinical cohorts in biomedical genomics studies. During last decade, several methods have been developed to identify biomarkers assigned to specific individuals or subtypes of samples. However, current methods still fail to discover markers in complex scenarios where heterogeneity or hidden phenotypical factors are present. Here, we propose a method to analyze and understand heterogeneous data avoiding classical normalization approaches of reducing or removing variation. Results DEcomposing heterogeneous Cohorts using Omic data profiling (DECO) is a method to find significant association among biological features (biomarkers) and samples (individuals) analyzing large-scale omic data. The method identifies and categorizes biomarkers of specific phenotypic conditions based on a recurrent differential analysis integrated with a non-symmetrical correspondence analysis. DECO integrates both omic data dispersion and predictor–response relationship from non-symmetrical correspondence analysis in a unique statistic (called h-statistic), allowing the identification of closely related sample categories within complex cohorts. The performance is demonstrated using simulated data and five experimental transcriptomic datasets, and comparing to seven other methods. We show DECO greatly enhances the discovery and subtle identification of biomarkers, making it especially suited for deep and accurate patient stratification. Availability and implementation DECO is freely available as an R package (including a practical vignette) at Bioconductor repository (http://bioconductor.org/packages/deco/). Supplementary information Supplementary data are available at Bioinformatics online.


Author(s):  
Jose S. Hleap ◽  
Joanne E. Littlefair ◽  
Dirk Steinke ◽  
Paul D. N. Hebert ◽  
Melania E. Cristescu

ABSTRACTThe effective use of metabarcoding in biodiversity science has brought important analytical challenges due to the need to generate accurate taxonomic assignments. The assignment of sequences to a generic or species level is critical for biodiversity surveys and biomonitoring, but it is particularly challenging. Researchers must select the approach that best recovers information on species composition. This study evaluates the performance and accuracy of seven methods in recovering the species composition of mock communities which vary in species number and specimen abundance, while holding upstream molecular and bioinformatic variables constant. It also evaluates the impact of parameter optimization on the quality of the predictions. Despite the general belief that BLAST top hit underperforms newer methods, our results indicate that it competes well with more complex approaches if optimized for the mock community under study. For example, the two machine learning methods that were benchmarked proved more sensitive to the reference database heterogeneity and completeness than methods based on sequence similarity. The accuracy of assignments was impacted by both species and specimen counts which will influence the selection of appropriate software. We urge the usage of realistic mock communities to allow optimization of parameters, regardless of the taxonomic assignment method used.


2019 ◽  
Author(s):  
Ryther Anderson ◽  
Achay Biong ◽  
Diego Gómez-Gualdrón

<div>Tailoring the structure and chemistry of metal-organic frameworks (MOFs) enables the manipulation of their adsorption properties to suit specific energy and environmental applications. As there are millions of possible MOFs (with tens of thousands already synthesized), molecular simulation, such as grand canonical Monte Carlo (GCMC), has frequently been used to rapidly evaluate the adsorption performance of a large set of MOFs. This allows subsequent experiments to focus only on a small subset of the most promising MOFs. In many instances, however, even molecular simulation becomes prohibitively time consuming, underscoring the need for alternative screening methods, such as machine learning, to precede molecular simulation efforts. In this study, as a proof of concept, we trained a neural network as the first example of a machine learning model capable of predicting full adsorption isotherms of different molecules not included in the training of the model. To achieve this, we trained our neural network only on alchemical species, represented only by their geometry and force field parameters, and used this neural network to predict the loadings of real adsorbates. We focused on predicting room temperature adsorption of small (one- and two-atom) molecules relevant to chemical separations. Namely, argon, krypton, xenon, methane, ethane, and nitrogen. However, we also observed surprisingly promising predictions for more complex molecules, whose properties are outside the range spanned by the alchemical adsorbates. Prediction accuracies suitable for large-scale screening were achieved using simple MOF (e.g. geometric properties and chemical moieties), and adsorbate (e.g. forcefield parameters and geometry) descriptors. Our results illustrate a new philosophy of training that opens the path towards development of machine learning models that can predict the adsorption loading of any new adsorbate at any new operating conditions in any new MOF.</div>


Sign in / Sign up

Export Citation Format

Share Document