annotation errors
Recently Published Documents


TOTAL DOCUMENTS

64
(FIVE YEARS 22)

H-INDEX

12
(FIVE YEARS 4)

2021 ◽  
Author(s):  
Jaya Srivastava ◽  
Ritu Hembrom ◽  
Ankita Kumawat ◽  
Petety V. Balaji

UniProt and BFD databases together have 2.5 billion protein sequences. A large majority of these proteins have been electronically annotated. Automated annotation pipelines, vis-á-vis manual curation, have the advantage of scale and speed but are fraught with relatively higher error rates. This is because sequence homology does not necessarily translate to functional homology, molecular function specification is hierarchic and not all functional families have the same amount of experimental data that one can exploit for annotation. Consequently, customization of annotation workflow is inevitable to minimize annotation errors. In this study, we illustrate possible ways of customizing the search of sequence databases for functional homologs using profile HMMs. Choosing an optimal bit score threshold is a critical step in the application of HMMs. We illustrate ways in which an optimal bit score can be arrived at using four Case Studies. These are the single domain nucleotide sugar 6-dehydrogenase and lysozyme-C families, and SH3 and GT-A domains which are typically found as a part of multi-domain proteins. We also discuss the limitations of using profile HMMs for functional annotation and suggests some possible ways to partially overcome such limitations.


Genes ◽  
2021 ◽  
Vol 12 (7) ◽  
pp. 963
Author(s):  
Friedhelm Pfeiffer ◽  
Mike Dyall-Smith

Background: Annotation ambiguities and annotation errors are a general challenge in genomics. While a reliable protein function assignment can be obtained by experimental characterization, this is expensive and time-consuming, and the number of such Gold Standard Proteins (GSP) with experimental support remains very low compared to proteins annotated by sequence homology, usually through automated pipelines. Even a GSP may give a misleading assignment when used as a reference: the homolog may be close enough to support isofunctionality, but the substrate of the GSP is absent from the species being annotated. In such cases, the enzymes cannot be isofunctional. Here, we examined a variety of such issues in halophilic archaea (class Halobacteria), with a strong focus on the model haloarchaeon Haloferax volcanii. Results: Annotated proteins of Hfx. volcanii were identified for which public databases tend to assign a function that is probably incorrect. In some cases, an alternative, probably correct, function can be predicted or inferred from the available evidence, but this has not been adopted by public databases because experimental validation is lacking. In other cases, a probably invalid specific function is predicted by homology, and while there is evidence that this assigned function is unlikely, the true function remains elusive. We listed 50 of those cases, each with detailed background information, so that a conclusion about the most likely biological function can be drawn. For reasons of brevity and comprehension, only the key aspects are listed in the main text, with detailed information being provided in a corresponding section of the Supplementary Materials. Conclusions: Compiling, describing and summarizing these open annotation issues and functional predictions will benefit the scientific community in the general effort to improve the evaluation of protein function assignments and more thoroughly detail them. By highlighting the gaps and likely annotation errors currently in the databases, we hope this study will provide a framework for experimentalists to systematically confirm (or disprove) our function predictions or to uncover yet more unexpected functions.


2021 ◽  
Author(s):  
Friedhelm Pfeiffer ◽  
Mike Dyall-Smith

Background: Annotation ambiguities and annotation errors are a general challenge in genomics. While a reliable protein function assignment can be obtained by experimental characterization, this is expensive and time-consuming, and the number of such Gold Standard Proteins (GSP) with experimental support remains very low compared to proteins annotated by sequence homology, usually through automated pipelines. Even a GSP may give a misleading assignment when used as a reference: the homolog may be close enough to support isofunctionality, but the substrate of the GSP is absent from the species being annotated. In such cases the enzymes cannot be isofunctional. Here, we examine a variety of such issues in halophilic archaea (class Halobacteria), with a strong focus on the model haloarchaeon Haloferax volcanii. Results: Annotated proteins of Hfx. volcanii were identified for which public databases tend to assign a function that is probably incorrect. In some cases, an alternative, probably correct, function can be predicted or inferred from the available evidence but this has not been adopted by public databases because experimental validation is lacking. In other cases, a probably invalid specific function is predicted by homology, and while there is evidence that this assigned function is unlikely, the true function remains elusive. We list 50 of those cases, each with detailed background information so that a conclusion about the most likely biological function can be drawn. For reasons of brevity and comprehension, only key aspects are listed in the main text, with detailed information being provided in a corresponding section of the Supplementary Material. Conclusions: Compiling, describing and summarizing these open annotation issues and functional predictions will benefit the scientific community in the general effort to improve the evaluation of protein function assignments and more thoroughly detail them. By highlighting the gaps and likely annotation errors currently in the databases, we hope this study will provide a framework for experimentalists to sytematically confirm (or disprove) our function predictions or to uncover yet unexpected functions.


2021 ◽  
Author(s):  
Michael Y. Galperin ◽  
Yuri I. Wolf ◽  
Sofya K. Garushyants ◽  
Roberto Vera Alvarez ◽  
Eugene V. Koonin

Ribosomal proteins (RPs) are highly conserved across the bacterial and archaeal domains. Although many RPs are essential for survival, genome analysis demonstrates the absence of some RP genes in many bacterial and archaeal genomes. Furthermore, global transposon mutagenesis and/or targeted deletion showed that elimination of some RP genes had only a moderate effect on the bacterial growth rate. Here, we systematically analyze the evolutionary conservation of RPs in prokaryotes by compiling the list of the ribosomal genes that are missing from one or more genomes in the recently updated version of the Clusters of Orthologous Genes (COG) database. Some of these absences occurred because the respective genes carried frameshifts, presumably, resulting from sequencing errors, while others were overlooked and not translated during genome annotation. Apart from these annotation errors, we identified multiple genuine losses of RP genes in a variety of bacteria and archaea. Some of these losses are clade-specific, whereas others occur in symbionts and parasites with dramatically reduced genomes. The lists of computationally and experimentally defined non-essential ribosomal genes show a substantial overlap, revealing a common trend in prokaryote ribosome evolution that could be linked to the architecture and assembly of the ribosomes. Thus, RPs that are located at the surface of the ribosome and/or are incorporated at a late stage of ribosome assembly are more likely to be non-essential and to be lost during microbial evolution, particularly, in the course of genome compaction. IMPORTANCE In many prokaryote genomes, one or more ribosomal protein (RP) genes are missing. Analysis of 1,309 prokaryote genomes included in the COG database shows that only about half of the RPs are universally conserved in bacteria and archaea. In contrast, up to 16 other RPs are missing in some genomes, primarily, tiny (<1 Mb) genomes of host-associated bacteria and archaea. Ten universal and nine archaea-specific ribosomal proteins show clear patterns of lineage-specific gene loss. Most of the RPs that are frequently lost from bacterial genomes are located on the ribosome periphery and are non-essential in Escherichia coli and Bacillus subtilis. These results reveal general trends and common constraints in the architecture and evolution of ribosomes in prokaryotes.


2020 ◽  
Vol 11 (1) ◽  
pp. 24
Author(s):  
Jin Tao ◽  
Kelly Brayton ◽  
Shira Broschat

Advances in genome sequencing technology and computing power have brought about the explosive growth of sequenced genomes in public repositories with a concomitant increase in annotation errors. Many protein sequences are annotated using computational analysis rather than experimental verification, leading to inaccuracies in annotation. Confirmation of existing protein annotations is urgently needed before misannotation becomes even more prevalent due to error propagation. In this work we present a novel approach for automatically confirming the existence of manually curated information with experimental evidence of protein annotation. Our ensemble learning method uses a combination of recurrent convolutional neural network, logistic regression, and support vector machine models. Natural language processing in the form of word embeddings is used with journal publication titles retrieved from the UniProtKB database. Importantly, we use recall as our most significant metric to ensure the maximum number of verifications possible; results are reported to a human curator for confirmation. Our ensemble model achieves 91.25% recall, 71.26% accuracy, 65.19% precision, and an F1 score of 76.05% and outperforms the Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) model with fine-tuning using the same data.


Author(s):  
Thomas Adejoh ◽  
Chukwuemeka H. Elugwu ◽  
Mohammed Sidi ◽  
Emeka E. Ezugwu ◽  
Chijioke O. Asogwa ◽  
...  

Abstract Background Errors in radiographic image annotation by radiographers could potentially lead to misdiagnoses by radiologists and wrong side surgery by surgeons. Such medical negligence has dire medico-legal consequences. It was hypothesized that newer technology of computed radiography (CR) and direct digital radiography (DDR) image annotation would potentially lead to a change in practice with subsequent reduction in annotation errors. Following installation of computed radiography, a modality with electronic, post-processing image annotation, the hypothesis was investigated in our study centre. Results A total of 72,602 and 126,482 images were documented for film-screen radiography (FSR) and computed radiography (CR), respectively in the department. From these, a sample size of 9452 made up of 4726 each for FSR and CR was drawn. Anatomical side marker errors were common in every anatomy imaged, with more errors seen in FSR (4.6%) than CR (0.6%). Collectively, an error rate of 3.0% was observed. Errors noticed were as a result of marker burnout due to over-exposure as well as marker cone off due to tight beam collimation. Conclusion Error rates were considerably reduced following a change from film-screen radiography (FSR) to computed radiography (CR) at the study centre. This change was, however, influenced more by a team of quality control radiographers stationed at CR workstation than by actual practice in x-ray imaging suite. Presence of anthropomorphic phantom in the teaching laboratories in the universities for demonstrations will significantly inculcate the skill needed to completely eliminate anatomical side marker (ASM) error in practice.


2020 ◽  
Vol 49 (D1) ◽  
pp. D475-D479
Author(s):  
Joicymara S Xavier ◽  
Thanh-Binh Nguyen ◽  
Malancha Karmarkar ◽  
Stephanie Portelli ◽  
Pâmela M Rezende ◽  
...  

Abstract Proteins are intricate, dynamic structures, and small changes in their amino acid sequences can lead to large effects on their folding, stability and dynamics. To facilitate the further development and evaluation of methods to predict these changes, we have developed ThermoMutDB, a manually curated database containing &gt;14,669 experimental data of thermodynamic parameters for wild type and mutant proteins. This represents an increase of 83% in unique mutations over previous databases and includes thermodynamic information on 204 new proteins. During manual curation we have also corrected annotation errors in previously curated entries. Associated with each entry, we have included information on the unfolding Gibbs free energy and melting temperature change, and have associated entries with available experimental structural information. ThermoMutDB supports users to contribute to new data points and programmatic access to the database via a RESTful API. ThermoMutDB is freely available at: http://biosig.unimelb.edu.au/thermomutdb.


2020 ◽  
Author(s):  
Tapan Kumar Mohanta ◽  
Abeer Hashem ◽  
Elsayed Fathi Abd_Allah ◽  
Ahmed AL Harrasi

Abstract Background The genome sequencing data are accumulating at a rapid pace, with the current genome sequence data of more than 5780 species being publicly available at the National Center for Biotechnology Information (NCBI) database alone. However, for the researcher communities to use these data, an error-free functional annotation report is a must. Results Analyses of the whole proteome sequence data of 689 fungal species (7.15 million protein sequences) to find the presence of functional annotation error in several species. Hence, calcium dependent protein kinases (CDPKs) and selenoproteins were targeted for the analysis as it is absent all across the fungi kingdom. The analyses revealed the presence of protein with the functional annotation name CDPK. InterproScan analysis revealed that, none of the protein sequences tagged with name “calcium dependent protein kinase” was found to encode calcium binding EF-hands at the regulatory domain. Similarly, none of a protein sequences with annotation name associated with “selenocysteine” was found to encode Sec (U) amino acid. Conclusion The presence of naming of such functional annotation errors in the fungal kingdom is raised a great concern and need to address it at the earliest possible time.


Sign in / Sign up

Export Citation Format

Share Document