scholarly journals Benchmarking gene ontology function predictions using negative annotations

2020 ◽  
Vol 36 (Supplement_1) ◽  
pp. i210-i218 ◽  
Author(s):  
Alex Warwick Vesztrocy ◽  
Christophe Dessimoz

Abstract Motivation With the ever-increasing number and diversity of sequenced species, the challenge to characterize genes with functional information is even more important. In most species, this characterization almost entirely relies on automated electronic methods. As such, it is critical to benchmark the various methods. The Critical Assessment of protein Function Annotation algorithms (CAFA) series of community experiments provide the most comprehensive benchmark, with a time-delayed analysis leveraging newly curated experimentally supported annotations. However, the definition of a false positive in CAFA has not fully accounted for the open world assumption (OWA), leading to a systematic underestimation of precision. The main reason for this limitation is the relative paucity of negative experimental annotations. Results This article introduces a new, OWA-compliant, benchmark based on a balanced test set of positive and negative annotations. The negative annotations are derived from expert-curated annotations of protein families on phylogenetic trees. This approach results in a large increase in the average information content of negative annotations. The benchmark has been tested using the naïve and BLAST baseline methods, as well as two orthology-based methods. This new benchmark could complement existing ones in future CAFA experiments. Availability and Implementation All data, as well as code used for analysis, is available from https://lab.dessimoz.org/20_not. Supplementary information Supplementary data are available at Bioinformatics online.

2020 ◽  
Author(s):  
Yue Cao ◽  
Yang Shen

AbstractMotivationFacing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a high-throughput alternative to experimental approaches. However, current methods can have limited applicability while relying on data besides sequences, or lack generalizability to novel sequences, species and functions.ResultsTo overcome aforementioned barriers in applicability and generalizability, we propose a novel deep learning model, named Transformer-based protein function Annotation through joint sequence–Label Embedding (TALE). For generalizbility to novel sequences we use self attention-based transformers to capture global patterns in sequences. For generalizability to unseen or rarely seen functions, we also embed protein function labels (hierarchical GO terms on directed graphs) together with inputs/features (sequences) in a joint latent space. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. It even outperformed a state-of-the-art method using network information besides sequence, in two of the three gene ontologies. Furthermore, TALE and TALE+ showed superior generalizability to proteins of low homology and never/rarely annotated novel species or functions compared to training data, revealing deep insights into the protein sequence–function relationship. Ablation studies elucidated contributions of algorithmic components toward the accuracy and the generalizability.AvailabilityThe data, source codes and models are available at https://github.com/Shen-Lab/[email protected] informationSupplementary data are available at Bioinformatics online.


Author(s):  
Yue Cao ◽  
Yang Shen

Abstract Motivation Facing the increasing gap between high-throughput sequence data and limited functional insights, computational protein function annotation provides a high-throughput alternative to experimental approaches. However, current methods can have limited applicability while relying on protein data besides sequences, or lack generalizability to novel sequences, species and functions. Results To overcome aforementioned barriers in applicability and generalizability, we propose a novel deep learning model using only sequence information for proteins, named Transformer-based protein function Annotation through joint sequence–Label Embedding (TALE). For generalizability to novel sequences we use self attention-based transformers to capture global patterns in sequences. For generalizability to unseen or rarely seen functions (tail labels), we embed protein function labels (hierarchical GO terms on directed graphs) together with inputs/features (1D sequences) in a joint latent space. Combining TALE and a sequence similarity-based method, TALE+ outperformed competing methods when only sequence input is available. It even outperformed a state-of-the-art method using network information besides sequence, in two of the three gene ontologies. Furthermore, TALE and TALE+ showed superior generalizability to proteins of low similarity, new species, or rarely annotated functions compared to training data, revealing deep insights into the protein sequence–function relationship. Ablation studies elucidated contributions of algorithmic components toward the accuracy and the generalizability. Availability The data, source codes and models are available at https://github.com/Shen-Lab/TALE Supplementary information Supplementary data are available at Bioinformatics online.


2021 ◽  
Vol 118 (12) ◽  
pp. e2021244118
Author(s):  
Alessio Caminata ◽  
Noah Giansiracusa ◽  
Han-Bom Moon ◽  
Luca Schaffler

In 2004, Pachter and Speyer introduced the higher dissimilarity maps for phylogenetic trees and asked two important questions about their relation to the tropical Grassmannian. Multiple authors, using independent methods, answered affirmatively the first of these questions, showing that dissimilarity vectors lie on the tropical Grassmannian, but the second question, whether the set of dissimilarity vectors forms a tropical subvariety, remained opened. We resolve this question by showing that the tropical balancing condition fails. However, by replacing the definition of the dissimilarity map with a weighted variant, we show that weighted dissimilarity vectors form a tropical subvariety of the tropical Grassmannian in exactly the way that Pachter and Speyer envisioned. Moreover, we provide a geometric interpretation in terms of configurations of points on rational normal curves and construct a finite tropical basis that yields an explicit characterization of weighted dissimilarity vectors.


Entropy ◽  
2020 ◽  
Vol 22 (9) ◽  
pp. 993 ◽  
Author(s):  
Bin Yang ◽  
Dingyi Gan ◽  
Yongchuan Tang ◽  
Yan Lei

Quantifying uncertainty is a hot topic for uncertain information processing in the framework of evidence theory, but there is limited research on belief entropy in the open world assumption. In this paper, an uncertainty measurement method that is based on Deng entropy, named Open Deng entropy (ODE), is proposed. In the open world assumption, the frame of discernment (FOD) may be incomplete, and ODE can reasonably and effectively quantify uncertain incomplete information. On the basis of Deng entropy, the ODE adopts the mass value of the empty set, the cardinality of FOD, and the natural constant e to construct a new uncertainty factor for modeling the uncertainty in the FOD. Numerical example shows that, in the closed world assumption, ODE can be degenerated to Deng entropy. An ODE-based information fusion method for sensor data fusion is proposed in uncertain environments. By applying it to the sensor data fusion experiment, the rationality and effectiveness of ODE and its application in uncertain information fusion are verified.


2020 ◽  
Vol 36 (10) ◽  
pp. 3263-3265 ◽  
Author(s):  
Lucas Czech ◽  
Pierre Barbera ◽  
Alexandros Stamatakis

Abstract Summary We present genesis, a library for working with phylogenetic data, and gappa, an accompanying command-line tool for conducting typical analyses on such data. The tools target phylogenetic trees and phylogenetic placements, sequences, taxonomies and other relevant data types, offer high-level simplicity as well as low-level customizability, and are computationally efficient, well-tested and field-proven. Availability and implementation Both genesis and gappa are written in modern C++11, and are freely available under GPLv3 at http://github.com/lczech/genesis and http://github.com/lczech/gappa. Supplementary information Supplementary data are available at Bioinformatics online.


2009 ◽  
pp. 257-281
Author(s):  
Cristiano Fugazza ◽  
Stefano David ◽  
Anna Montesanto ◽  
Cesare Rocchi

There are different approaches to modeling a computational system, each providing different semantics. We present a comparison among different approaches to semantics and we aim at identifying which peculiarities are needed to provide a system with uniquely interpretable semantics. We discuss different approaches, namely, Description Logics, Artificial Neural Networks, and relational database management systems. We identify classification (the process of building a taxonomy) as common trait. However, in this chapter we also argue that classification is not enough to provide a system with a Semantics, which emerges only when relations among classes are established and used among instances. Our contribution also analyses additional features of the formalisms that distinguish the approaches: closed versus. open world assumption, dynamic versus. static nature of knowledge, the management of knowledge, and the learning process.


2013 ◽  
Vol 11 (Suppl 1) ◽  
pp. S1 ◽  
Author(s):  
Alfredo Benso ◽  
Stefano Di Carlo ◽  
Hafeez ur Rehman ◽  
Gianfranco Politano ◽  
Alessandro Savino ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document