Improving the Measurement of Semantic Similarity between Gene Ontology Terms and Gene Products: Insights from an Edge- and IC-Based Hybrid Method

Existing methods for calculating semantic similarities between pairs of Gene Ontology (GO) terms and gene products often rely on external databases like Gene Ontology Annotation (GOA) that annotate gene products using the GO terms. This dependency leads to some limitations in real applications. Here, we present a semantic similarity algorithm (SSA), that relies exclusively on the GO. When calculating the semantic similarity between a pair of input GO terms, SSA takes into account the shortest path between them, the depth of their nearest common ancestor, and a novel similarity score calculated between the definitions of the involved GO terms. In our work, we use SSA to calculate semantic similarities between pairs of proteins by combining pairwise semantic similarities between the GO terms that annotate the involved proteins. The reliability of SSA was evaluated by comparing the resulting semantic similarities between proteins with the functional similarities between proteins derived from expert annotations or sequence similarity. Comparisons with existing state-of-the-art methods showed that SSA is highly competitive with the other methods. SSA provides a reliable measure for semantics similarity independent of external databases of functional-annotation observations.

Download Full-text

Correlating Information Contents of Gene Ontology Terms to Infer Semantic Similarity of Gene Products

Computational and Mathematical Methods in Medicine ◽

10.1155/2014/891842 ◽

2014 ◽

Vol 2014 ◽

pp. 1-9 ◽

Cited By ~ 4

Author(s):

Mingxin Gan

Keyword(s):

Gene Ontology ◽

Gene Product ◽

Correlation Coefficient ◽

Semantic Similarity ◽

Biological Process ◽

Jaccard Index ◽

Biological Knowledge ◽

Gene Products ◽

Functional Relationships ◽

Information Contents

Successful applications of the gene ontology to the inference of functional relationships between gene products in recent years have raised the need for computational methods to automatically calculate semantic similarity between gene products based on semantic similarity of gene ontology terms. Nevertheless, existing methods, though having been widely used in a variety of applications, may significantly overestimate semantic similarity between genes that are actually not functionally related, thereby yielding misleading results in applications. To overcome this limitation, we propose to represent a gene product as a vector that is composed of information contents of gene ontology terms annotated for the gene product, and we suggest calculating similarity between two gene products as the relatedness of their corresponding vectors using three measures: Pearson’s correlation coefficient, cosine similarity, and the Jaccard index. We focus on the biological process domain of the gene ontology and annotations of yeast proteins to study the effectiveness of the proposed measures. Results show that semantic similarity scores calculated using the proposed measures are more consistent with known biological knowledge than those derived using a list of existing methods, suggesting the effectiveness of our method in characterizing functional relationships between gene products.

Download Full-text

The Gene Ontology resource: enriching a GOld mine

Nucleic Acids Research ◽

10.1093/nar/gkaa1113 ◽

2020 ◽

Vol 49 (D1) ◽

pp. D325-D334

Author(s):

◽

Seth Carbon ◽

Eric Douglass ◽

Benjamin M Good ◽

Deepak R Unni ◽

...

Keyword(s):

Experimental Data ◽

Gene Ontology ◽

Gold Mine ◽

Gene Products ◽

Gene Ontology Consortium ◽

The Past ◽

Historical Archive ◽

File Structure

Abstract The Gene Ontology Consortium (GOC) provides the most comprehensive resource currently available for computable knowledge regarding the functions of genes and gene products. Here, we report the advances of the consortium over the past two years. The new GO-CAM annotation framework was notably improved, and we formalized the model with a computational schema to check and validate the rapidly increasing repository of 2838 GO-CAMs. In addition, we describe the impacts of several collaborations to refine GO and report a 10% increase in the number of GO annotations, a 25% increase in annotated gene products, and over 9,400 new scientific articles annotated. As the project matures, we continue our efforts to review older annotations in light of newer findings, and, to maintain consistency with other ontologies. As a result, 20 000 annotations derived from experimental data were reviewed, corresponding to 2.5% of experimental GO annotations. The website (http://geneontology.org) was redesigned for quick access to documentation, downloads and tools. To maintain an accurate resource and support traceability and reproducibility, we have made available a historical archive covering the past 15 years of GO data with a consistent format and file structure for both the ontology and annotations.

Download Full-text

MeSH-Informed Enrichment Analysis and MeSH-Guided Semantic Similarity Among Functional Terms and Gene Products in Chicken

G3 Genes|Genome|Genetics ◽

10.1534/g3.116.031096 ◽

2016 ◽

Vol 6 (8) ◽

pp. 2447-2453 ◽

Cited By ~ 6

Author(s):

Gota Morota ◽

Timothy M. Beissinger ◽

Francisco Peñagaricano

Keyword(s):

Semantic Similarity ◽

Enrichment Analysis ◽

Gene Products

Download Full-text

Word and Sentence Embedding Tools to Measure Semantic Similarity of Gene Ontology Terms by Their Definitions

Journal of Computational Biology ◽

10.1089/cmb.2018.0093 ◽

2019 ◽

Vol 26 (1) ◽

pp. 38-52 ◽

Cited By ~ 3

Author(s):

Dat Duong ◽

Wasi Uddin Ahmad ◽

Eleazar Eskin ◽

Kai-Wei Chang ◽

Jingyi Jessica Li

Keyword(s):

Gene Ontology ◽

Semantic Similarity

Download Full-text

An integrated information-based similarity measurement of gene ontology terms

Computer Science and Information Systems ◽

10.2298/csis141130053z ◽

2015 ◽

Vol 12 (4) ◽

pp. 1235-1253 ◽

Cited By ~ 1

Author(s):

Shu-Bo Zhang ◽

Jian-Huang Lai

Keyword(s):

Gene Ontology ◽

Semantic Similarity ◽

Semantic Information ◽

Gene Expression Dataset ◽

Similarity Measurement ◽

Depth Information ◽

Go Terms ◽

Validation Experiments ◽

Integrated Information ◽

Common Ancestors

Measuring the semantic similarity between pairs of terms in Gene Ontology (GO) can help to compare genes that can not be compared by other computational methods. In this study, we proposed an integrated information-based similarity measurement (IISM) to calculate the semantic similarity between two GO terms by taking into account multiple common ancestors that they share, and aggregating the semantic information and depth information of the non-redundant common ancestors. Our method searches for non-redundant common ancestors in an effective way. Validation experiments were conducted on both gene expression dataset and pathway dataset, and the experimental results suggest the superiority of our method against some existing methods.

Download Full-text

Annotation of Gene Products in the Literature with Gene Ontology Terms Using Syntactic Dependencies

Natural Language Processing – IJCNLP 2004 - Lecture Notes in Computer Science ◽

10.1007/978-3-540-30211-7_84 ◽

2005 ◽

pp. 787-796 ◽

Cited By ~ 1

Author(s):

Jung-jae Kim ◽

Jong C. Park

Keyword(s):

Gene Ontology ◽

Gene Products ◽

Syntactic Dependencies

Download Full-text

Simwos: Improving Semantic Similarity Between Gene Ontology Terms Based On Pfam Clans And Pathway Analysis

International Journal of Pharmaceutical Research ◽

10.31838/ijpr/2020.12.04.598 ◽

2020 ◽

Vol 12 (04) ◽

Keyword(s):

Gene Ontology ◽

Semantic Similarity ◽

Pathway Analysis

Download Full-text

Term Matrix: a novel Gene Ontology annotation quality control system based on ontology term co-annotation patterns

Open Biology ◽

10.1098/rsob.200149 ◽

2020 ◽

Vol 10 (9) ◽

pp. 200149 ◽

Cited By ~ 1

Author(s):

Valerie Wood ◽

Seth Carbon ◽

Midori A. Harris ◽

Antonia Lock ◽

Stacia R. Engel ◽

...

Keyword(s):

Quality Control ◽

Gene Ontology ◽

Biological Processes ◽

Gene Products ◽

Model Species ◽

Ontology Term ◽

Protein Coding ◽

Ontology Structure ◽

Exclusive Processes ◽

Annotation Quality

Biological processes are accomplished by the coordinated action of gene products. Gene products often participate in multiple processes, and can therefore be annotated to multiple Gene Ontology (GO) terms. Nevertheless, processes that are functionally, temporally and/or spatially distant may have few gene products in common, and co-annotation to unrelated processes probably reflects errors in literature curation, ontology structure or automated annotation pipelines. We have developed an annotation quality control workflow that uses rules based on mutually exclusive processes to detect annotation errors, based on and validated by case studies including the three we present here: fission yeast protein-coding gene annotations over time; annotations for cohesin complex subunits in human and model species; and annotations using a selected set of GO biological process terms in human and five model species. For each case study, we reviewed available GO annotations, identified pairs of biological processes which are unlikely to be correctly co-annotated to the same gene products (e.g. amino acid metabolism and cytokinesis), and traced erroneous annotations to their sources. To date we have generated 107 quality control rules, and corrected 289 manual annotations in eukaryotes and over 52 700 automatically propagated annotations across all taxa.

Download Full-text

GSAn: an alternative to enrichment analysis for annotating gene sets

NAR Genomics and Bioinformatics ◽

10.1093/nargab/lqaa017 ◽

2020 ◽

Vol 2 (2) ◽

Cited By ~ 5

Author(s):

Aaron Ayllon-Benitez ◽

Romain Bourqui ◽

Patricia Thébault ◽

Fleur Mougin

Keyword(s):

Gene Ontology ◽

Semantic Similarity ◽

A Priori ◽

Similarity Measures ◽

Enrichment Analysis ◽

Biological Information ◽

Underlying Structure ◽

Gene Set ◽

Sequencing Technologies ◽

Gene Coverage

Abstract The revolution in new sequencing technologies is greatly leading to new understandings of the relations between genotype and phenotype. To interpret and analyze data that are grouped according to a phenotype of interest, methods based on statistical enrichment became a standard in biology. However, these methods synthesize the biological information by a priori selecting the over-represented terms and may suffer from focusing on the most studied genes that represent a limited coverage of annotated genes within a gene set. Semantic similarity measures have shown great results within the pairwise gene comparison by making advantage of the underlying structure of the Gene Ontology. We developed GSAn, a novel gene set annotation method that uses semantic similarity measures to synthesize a priori Gene Ontology annotation terms. The originality of our approach is to identify the best compromise between the number of retained annotation terms that has to be drastically reduced and the number of related genes that has to be as large as possible. Moreover, GSAn offers interactive visualization facilities dedicated to the multi-scale analysis of gene set annotations. Compared to enrichment analysis tools, GSAn has shown excellent results in terms of maximizing the gene coverage while minimizing the number of terms.

Download Full-text