Metric Labeling and Semimetric Embedding for Protein Annotation Prediction

Author(s):  
Emre Sefer ◽  
Carl Kingsford
Keyword(s):  
2020 ◽  
Vol 11 (1) ◽  
pp. 24
Author(s):  
Jin Tao ◽  
Kelly Brayton ◽  
Shira Broschat

Advances in genome sequencing technology and computing power have brought about the explosive growth of sequenced genomes in public repositories with a concomitant increase in annotation errors. Many protein sequences are annotated using computational analysis rather than experimental verification, leading to inaccuracies in annotation. Confirmation of existing protein annotations is urgently needed before misannotation becomes even more prevalent due to error propagation. In this work we present a novel approach for automatically confirming the existence of manually curated information with experimental evidence of protein annotation. Our ensemble learning method uses a combination of recurrent convolutional neural network, logistic regression, and support vector machine models. Natural language processing in the form of word embeddings is used with journal publication titles retrieved from the UniProtKB database. Importantly, we use recall as our most significant metric to ensure the maximum number of verifications possible; results are reported to a human curator for confirmation. Our ensemble model achieves 91.25% recall, 71.26% accuracy, 65.19% precision, and an F1 score of 76.05% and outperforms the Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT) model with fine-tuning using the same data.


2011 ◽  
pp. 106-122
Author(s):  
Amandeep S. Sidhu ◽  
Tharam S. Dillon ◽  
Elizabeth Chang

Traditional approaches to integrate protein data generally involved keyword searches, which immediately excludes unannotated or poorly annotated data. An alternative protein annotation approach is to rely on sequence identity, or structural similarity, or functional identification. Some proteins have high degree of sequence identity, or structural similarity, or similarity in functions that are unique to members of that family alone. Consequently, this approach can’t be generalized to integrate the protein data. Clearly, these traditional approaches have limitations in capturing and integrating data for Protein Annotation. For these reasons, we have adopted an alternative method that does not rely on keywords or similarity metrics, but instead uses ontology. In this chapter we discuss conceptual framework of Protein Ontology that has a hierarchical classification of concepts represented as classes, from general to specific; a list of attributes related to each concept, for each class; a set of relations between classes to link concepts in ontology in more complicated ways then implied by the hierarchy, to promote reuse of concepts in the ontology; and a set of algebraic operators for querying protein ontology instances.


2004 ◽  
Vol 20 (Suppl 1) ◽  
pp. i342-i347 ◽  
Author(s):  
D. Wieser ◽  
E. Kretschmann ◽  
R. Apweiler
Keyword(s):  

2002 ◽  
Vol 30 (17) ◽  
pp. 3901-3916 ◽  
Author(s):  
I. Rigoutsos
Keyword(s):  

2005 ◽  
Vol 21 (16) ◽  
pp. 3450-3451 ◽  
Author(s):  
G. Dieterich ◽  
U. Karst ◽  
J. Wehland ◽  
L. Jansch

2005 ◽  
Vol 6 (Suppl 1) ◽  
pp. S20 ◽  
Author(s):  
Karin Verspoor ◽  
Judith Cohn ◽  
Cliff Joslyn ◽  
Sue Mniszewski ◽  
Andreas Rechtsteiner ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document