scholarly journals Correlating Information Contents of Gene Ontology Terms to Infer Semantic Similarity of Gene Products

2014 ◽  
Vol 2014 ◽  
pp. 1-9 ◽  
Author(s):  
Mingxin Gan

Successful applications of the gene ontology to the inference of functional relationships between gene products in recent years have raised the need for computational methods to automatically calculate semantic similarity between gene products based on semantic similarity of gene ontology terms. Nevertheless, existing methods, though having been widely used in a variety of applications, may significantly overestimate semantic similarity between genes that are actually not functionally related, thereby yielding misleading results in applications. To overcome this limitation, we propose to represent a gene product as a vector that is composed of information contents of gene ontology terms annotated for the gene product, and we suggest calculating similarity between two gene products as the relatedness of their corresponding vectors using three measures: Pearson’s correlation coefficient, cosine similarity, and the Jaccard index. We focus on the biological process domain of the gene ontology and annotations of yeast proteins to study the effectiveness of the proposed measures. Results show that semantic similarity scores calculated using the proposed measures are more consistent with known biological knowledge than those derived using a list of existing methods, suggesting the effectiveness of our method in characterizing functional relationships between gene products.

Author(s):  
JAMES M. KELLER ◽  
JAMES C. BEZDEK ◽  
MIHAIL POPESCU ◽  
NIKHIL R. PAL ◽  
JOYCE A. MITCHELL ◽  
...  

The standard method for comparing gene products (proteins or RNA) is to compare their DNA or amino acid sequences. Additional information about some gene products may come from multiple sources, including the set of Gene Ontology (GO) annotations and the set of journal abstracts related to each gene product. Gene product similarity measures can be based on evaluating sets of descriptor terms found in the GO taxonomy, and/or the index term sets of the related documents (MeSH annotations). While our techniques can be applied to term sets from any taxonomy, we restrict our examples in this article to GO annotations. We investigate the use of linear order statistics (LOS) to build similarity relations on pairs of terms that are used in the GO as linguistic descriptors of genes and gene products. One of our objectives is to investigate the construction and utility of visual assessments of relational data (in this case, dissimilarity matrices) for discovering tendencies of groups of gene products to "cluster together". We use gene product data derived from a group of 194 gene products representing three protein families extracted from ENSEMBL. Our examples suggest that LOS similarity measures are more effective than traditional sequence-based similarity measures at capturing relationships between pairs of gene products in ENSEMBL families when annotation information is available. We show examples of how these similarity measures can assist in knowledge discovery and gene product family validation.


2011 ◽  
Vol 09 (06) ◽  
pp. 681-695 ◽  
Author(s):  
MARCO A. ALVAREZ ◽  
CHANGHUI YAN

Existing methods for calculating semantic similarities between pairs of Gene Ontology (GO) terms and gene products often rely on external databases like Gene Ontology Annotation (GOA) that annotate gene products using the GO terms. This dependency leads to some limitations in real applications. Here, we present a semantic similarity algorithm (SSA), that relies exclusively on the GO. When calculating the semantic similarity between a pair of input GO terms, SSA takes into account the shortest path between them, the depth of their nearest common ancestor, and a novel similarity score calculated between the definitions of the involved GO terms. In our work, we use SSA to calculate semantic similarities between pairs of proteins by combining pairwise semantic similarities between the GO terms that annotate the involved proteins. The reliability of SSA was evaluated by comparing the resulting semantic similarities between proteins with the functional similarities between proteins derived from expert annotations or sequence similarity. Comparisons with existing state-of-the-art methods showed that SSA is highly competitive with the other methods. SSA provides a reliable measure for semantics similarity independent of external databases of functional-annotation observations.


2012 ◽  
Vol 12 (1) ◽  
pp. 101-108 ◽  
Author(s):  
Diane O. Inglis ◽  
Marek S. Skrzypek ◽  
Martha B. Arnaud ◽  
Jonathan Binkley ◽  
Prachi Shah ◽  
...  

ABSTRACTThe opportunistic fungal pathogenCandida albicansis a significant medical threat, especially for immunocompromised patients. Experimental research has focused on specific areas ofC. albicansbiology, with the goal of understanding the multiple factors that contribute to its pathogenic potential. Some of these factors include cell adhesion, invasive or filamentous growth, and the formation of drug-resistant biofilms. The Gene Ontology (GO) (www.geneontology.org) is a standardized vocabulary that theCandidaGenome Database (CGD) (www.candidagenome.org) and other groups use to describe the functions of gene products. To improve the breadth and accuracy of pathogenicity-related gene product descriptions and to facilitate the description of as yet uncharacterized but potentially pathogenicity-related genes inCandidaspecies, CGD undertook a three-part project: first, the addition of terms to the biological process branch of the GO to improve the description of fungus-related processes; second, manual recuration of gene product annotations in CGD to use the improved GO vocabulary; and third, computational ortholog-based transfer of GO annotations from experimentally characterized gene products, using these new terms, to uncharacterized orthologs in otherCandidaspecies. Through genome annotation and analysis, we identified candidate pathogenicity genes in seven non-C. albicans Candidaspecies and in one additionalC. albicansstrain, WO-1. We also defined a set ofC. albicansgenes at the intersection of biofilm formation, filamentous growth, pathogenesis, and phenotypic switching of this opportunistic fungal pathogen, which provides a compelling list of candidates for further experimentation.


2013 ◽  
Vol 2013 ◽  
pp. 1-10 ◽  
Author(s):  
Richard W. Francis

The Gene Ontology (GO) provides a resource for consistent annotation of genes and gene products that is extensively used by numerous large public repositories. The GO is constructed of three subontologies describing the cellular component of action, molecular function, and overall biological process of a gene or gene product. Querying across the subontologies is problematic and no standard method exists to, for example, find all molecular functions occurring in a particular cellular component. GOLink addresses this problem by finding terms from all subontologies cooccurring with a term of interest in annotation across the entire GO database. Genes annotated with this term are exported and their GO annotation is assigned to three separate GOLink terms lists based on specific criteria. The software was used to predict the most likely Biological Process for a group of genes using just their Molecular Function terms giving sensitivity, specificity, and accuracy between 80 and 90% across all the terms lists. GOLink is made freely available for noncommercial use and can be downloaded from the project website.


Genes ◽  
2018 ◽  
Vol 9 (12) ◽  
pp. 593 ◽  
Author(s):  
Barbara Kramarz ◽  
Paola Roncaglia ◽  
Birgit H. M. Meldal ◽  
Rachael P. Huntley ◽  
Maria J. Martin  ◽  
...  

The analysis and interpretation of high-throughput datasets relies on access to high-quality bioinformatics resources, as well as processing pipelines and analysis tools. Gene Ontology (GO, geneontology.org) is a major resource for gene enrichment analysis. The aim of this project, funded by the Alzheimer’s Research United Kingdom (ARUK) foundation and led by the University College London (UCL) biocuration team, was to enhance the GO resource by developing new neurological GO terms, and use GO terms to annotate gene products associated with dementia. Specifically, proteins and protein complexes relevant to processes involving amyloid-beta and tau have been annotated and the resulting annotations are denoted in GO databases as ‘ARUK-UCL’. Biological knowledge presented in the scientific literature was captured through the association of GO terms with dementia-relevant protein records; GO itself was revised, and new GO terms were added. This literature biocuration increased the number of Alzheimer’s-relevant gene products that were being associated with neurological GO terms, such as ‘amyloid-beta clearance’ or ‘learning or memory’, as well as neuronal structures and their compartments. Of the total 2055 annotations that we contributed for the prioritised gene products, 526 have associated proteins and complexes with neurological GO terms. To ensure that these descriptive annotations could be provided for Alzheimer’s-relevant gene products, over 70 new GO terms were created. Here, we describe how the improvements in ontology development and biocuration resulting from this initiative can benefit the scientific community and enhance the interpretation of dementia data.


2015 ◽  
Author(s):  
Gota Morota ◽  
Timothy M Beissinger ◽  
Francisco Peñagaricano

AbstractBiomedical vocabularies and ontologies aid in recapitulating biological knowledge. The annotation of gene products is mainly accelerated by Gene Ontology (GO) and more recently by Medical Subject Headings (MeSH). Here we report a suite of MeSH packages for chicken in Bioconductor and illustrate some features of different MeSH-based analyses, including MeSH-informed enrichment analysis and MeSH-guided semantic similarity among terms and gene products, using two lists of chicken genes available in public repositories. The two published datasets that were employed represent (i) differentially expressed genes and (ii) candidate genes under selective sweep or epistatic selection. The comparison of MeSH with GO overrepresentation analyses suggested not only that MeSH supports the findings obtained from GO analysis but also that MeSH is able to further enrich the representation of biological knowledge and often provide more interpretable results. Based on the hierarchical structures of MeSH and GO, we computed semantic similarities among vocabularies as well as semantic similarities among selected genes. These yielded the similarity levels between significant functional terms, and the annotation of each gene yielded the measures of gene similarity. Our findings show the benefits of using MeSH as an alternative choice of annotation in order to draw biological inferences from a list of genes of interest. We argue that the use of MeSH in conjunction with GO will be instrumental in facilitating the understanding of the genetic basis of complex traits.


Sign in / Sign up

Export Citation Format

Share Document