scholarly journals Searching sequence databases for functional homologs using profile HMMs: how to set bit score thresholds?

2021 ◽  
Author(s):  
Jaya Srivastava ◽  
Ritu Hembrom ◽  
Ankita Kumawat ◽  
Petety V. Balaji

UniProt and BFD databases together have 2.5 billion protein sequences. A large majority of these proteins have been electronically annotated. Automated annotation pipelines, vis-á-vis manual curation, have the advantage of scale and speed but are fraught with relatively higher error rates. This is because sequence homology does not necessarily translate to functional homology, molecular function specification is hierarchic and not all functional families have the same amount of experimental data that one can exploit for annotation. Consequently, customization of annotation workflow is inevitable to minimize annotation errors. In this study, we illustrate possible ways of customizing the search of sequence databases for functional homologs using profile HMMs. Choosing an optimal bit score threshold is a critical step in the application of HMMs. We illustrate ways in which an optimal bit score can be arrived at using four Case Studies. These are the single domain nucleotide sugar 6-dehydrogenase and lysozyme-C families, and SH3 and GT-A domains which are typically found as a part of multi-domain proteins. We also discuss the limitations of using profile HMMs for functional annotation and suggests some possible ways to partially overcome such limitations.

2018 ◽  
Author(s):  
Robert C. Edgar

AbstractSequencing of the 16S ribosomal RNA (rRNA) gene and the fungal Internal Transcribed Spacer (ITS) region is widely used to survey microbial communities. Specialized ribosomal sequence databases have been developed to support this approach including Greengenes, SILVA and RDP. Most taxonomy annotations in these databases are predictions from sequence rather than authoritative assignments based on studies of type strains or isolates. Here, I investigate the error rates of taxonomy annotations in these databases. I found 253,485 sequences with conflicting annotations in SILVA v128 and Greengenes v13.5 at ranks up to phylum (9,644 conflicts), indicating that the annotation error rate in these databases is ~15%. I found that 34% of non-singleton genera have overlapping subtrees in the Greengenes tree from 2001 according to the RDP taxonomy, most of which are probably due to branching order errors in the Greengenes tree, which is therefore an unreliable guide to phylogeny. Using a blinded test, I estimated that the annotation error rate of the RDP database is ~10%.


2020 ◽  
Author(s):  
Yarden Cohen ◽  
David Nicholson ◽  
Alexa Sanchioni ◽  
Emily K. Mallaber ◽  
Viktoriya Skidanova ◽  
...  

AbstractSongbirds have long been studied as a model system of sensory-motor learning. Many analyses of birdsong require time-consuming manual annotation of the individual elements of song, known as syllables or notes. Here we describe the first automated algorithm for birdsong annotation that is applicable to complex song such as canary song. We developed a neural network architecture, “TweetyNet”, that is trained with a small amount of hand-labeled data using supervised learning methods. We first show TweetyNet achieves significantly lower error on Bengalese finch song than a similar method, using less training data, and maintains low error rates across days. Applied to canary song, TweetyNet achieves fully automated annotation of canary song, accurately capturing the complex statistical structure previously discovered in a manually annotated dataset. We conclude that TweetyNet will make it possible to ask a wide range of new questions focused on complex songs where manual annotation was impractical.


2020 ◽  
Vol 49 (D1) ◽  
pp. D475-D479
Author(s):  
Joicymara S Xavier ◽  
Thanh-Binh Nguyen ◽  
Malancha Karmarkar ◽  
Stephanie Portelli ◽  
Pâmela M Rezende ◽  
...  

Abstract Proteins are intricate, dynamic structures, and small changes in their amino acid sequences can lead to large effects on their folding, stability and dynamics. To facilitate the further development and evaluation of methods to predict these changes, we have developed ThermoMutDB, a manually curated database containing >14,669 experimental data of thermodynamic parameters for wild type and mutant proteins. This represents an increase of 83% in unique mutations over previous databases and includes thermodynamic information on 204 new proteins. During manual curation we have also corrected annotation errors in previously curated entries. Associated with each entry, we have included information on the unfolding Gibbs free energy and melting temperature change, and have associated entries with available experimental structural information. ThermoMutDB supports users to contribute to new data points and programmatic access to the database via a RESTful API. ThermoMutDB is freely available at: http://biosig.unimelb.edu.au/thermomutdb.


2005 ◽  
Vol 193 (2) ◽  
pp. 223-234 ◽  
Author(s):  
Walter R. Gilks ◽  
Benjamin Audit ◽  
Daniela de Angelis ◽  
Sophia Tsoka ◽  
Christos A. Ouzounis

Author(s):  
Thomas Adejoh ◽  
Chukwuemeka H. Elugwu ◽  
Mohammed Sidi ◽  
Emeka E. Ezugwu ◽  
Chijioke O. Asogwa ◽  
...  

Abstract Background Errors in radiographic image annotation by radiographers could potentially lead to misdiagnoses by radiologists and wrong side surgery by surgeons. Such medical negligence has dire medico-legal consequences. It was hypothesized that newer technology of computed radiography (CR) and direct digital radiography (DDR) image annotation would potentially lead to a change in practice with subsequent reduction in annotation errors. Following installation of computed radiography, a modality with electronic, post-processing image annotation, the hypothesis was investigated in our study centre. Results A total of 72,602 and 126,482 images were documented for film-screen radiography (FSR) and computed radiography (CR), respectively in the department. From these, a sample size of 9452 made up of 4726 each for FSR and CR was drawn. Anatomical side marker errors were common in every anatomy imaged, with more errors seen in FSR (4.6%) than CR (0.6%). Collectively, an error rate of 3.0% was observed. Errors noticed were as a result of marker burnout due to over-exposure as well as marker cone off due to tight beam collimation. Conclusion Error rates were considerably reduced following a change from film-screen radiography (FSR) to computed radiography (CR) at the study centre. This change was, however, influenced more by a team of quality control radiographers stationed at CR workstation than by actual practice in x-ray imaging suite. Presence of anthropomorphic phantom in the teaching laboratories in the universities for demonstrations will significantly inculcate the skill needed to completely eliminate anatomical side marker (ASM) error in practice.


2018 ◽  
Author(s):  
Lucas Beasley ◽  
Prashanti Manda

Manual curation of scientific literature for ontology-based knowledge representation has proven infeasible and unscalable to the large and growing volume of scientific literature. Automated annotation solutions that leverage text mining and Natural Language Processing (NLP) have been developed to ameliorate the problem of literature curation. These NLP approaches use parsing, syntactical, and lexical analysis of text to recognize and annotate pieces of text with ontology concepts. Here, we conduct a comparison of four state of the art NLP tools at the task of recognizing Gene Ontology concepts from biomedical literature using the Colorado Richly Annotated Full-Text (CRAFT) corpus as a gold standard reference. We demonstrate the use of semantic similarity metrics to compare NLP tool annotations to the gold standard.


Author(s):  
Lucas Beasley ◽  
Prashanti Manda

Manual curation of scientific literature for ontology-based knowledge representation has proven infeasible and unscalable to the large and growing volume of scientific literature. Automated annotation solutions that leverage text mining and Natural Language Processing (NLP) have been developed to ameliorate the problem of literature curation. These NLP approaches use parsing, syntactical, and lexical analysis of text to recognize and annotate pieces of text with ontology concepts. Here, we conduct a comparison of four state of the art NLP tools at the task of recognizing Gene Ontology concepts from biomedical literature using the Colorado Richly Annotated Full-Text (CRAFT) corpus as a gold standard reference. We demonstrate the use of semantic similarity metrics to compare NLP tool annotations to the gold standard.


2021 ◽  
Author(s):  
Joseph Robinson ◽  
Yun Fu ◽  
Samson Timoner, ◽  
Yann Henon ◽  
Can qin

There are demographic biases in current models used for facial recognition (FR). Our Balanced Faces In the Wild (BFW) dataset serves as a proxy to measure bias across ethnicity and gender subgroups, allowing one to characterize FR performances per subgroup. We show performances are non-optimal when a single score threshold is used to determine whether sample pairs are genuine or imposter. Across subgroups, performance ratings vary from the reported across the entire dataset. Thus, claims of specific error rates only hold true for populations matching that of the validation data. We mitigate the imbalanced performances using a novel domain adaptation learning scheme on the facial features extracted using state-of-the-art. Not only does this technique balance performance, but it also boosts the overall performance. A benefit of the proposed is to preserve identity information in facial features while removing demographic knowledge in the lower dimensional features. The removal of demographic knowledge prevents future potential biases from being injected into decision-making. This removal satisfies privacy concerns. We explore why this works qualitatively; we also show quantitatively that subgroup classifiers can no longer learn from the features mapped by the proposed.


2021 ◽  
Author(s):  
Joseph Robinson ◽  
Yun Fu ◽  
Samson Timoner, ◽  
Yann Henon ◽  
Can qin

There are demographic biases in current models used for facial recognition (FR). Our Balanced Faces In the Wild (BFW) dataset serves as a proxy to measure bias across ethnicity and gender subgroups, allowing one to characterize FR performances per subgroup. We show performances are non-optimal when a single score threshold is used to determine whether sample pairs are genuine or imposter. Across subgroups, performance ratings vary from the reported across the entire dataset. Thus, claims of specific error rates only hold true for populations matching that of the validation data. We mitigate the imbalanced performances using a novel domain adaptation learning scheme on the facial features extracted using state-of-the-art. Not only does this technique balance performance, but it also boosts the overall performance. A benefit of the proposed is to preserve identity information in facial features while removing demographic knowledge in the lower dimensional features. The removal of demographic knowledge prevents future potential biases from being injected into decision-making. This removal satisfies privacy concerns. We explore why this works qualitatively; we also show quantitatively that subgroup classifiers can no longer learn from the features mapped by the proposed.


Author(s):  
William A. Heeschen

Two new morphological measurements based on digital image analysis, CoContinuity and CoContinuity Balance, have been developed and implemented for quantitative measurement of morphology in polymer blends. The morphology of polymer blends varies with phase ratio, composition and processing. A typical morphological evolution for increasing phase ratio of polymer A to polymer B starts with discrete domains of A in a matrix of B (A/B < 1), moves through a cocontinuous distribution of A and B (A/B ≈ 1) and finishes with discrete domains of B in a matrix of A (A/B > 1). For low phase ratios, A is often seen as solid convex particles embedded in the continuous B phase. As the ratio increases, A domains begin to evolve into irregular shapes, though still recognizable as separate domains. Further increase in the phase ratio leads to A domains which extend into and surround the B phase while the B phase simultaneously extends into and surrounds the A phase.


Sign in / Sign up

Export Citation Format

Share Document