scholarly journals Multi-faceted Semantic Clustering With Text-derived Phenotypes

Author(s):  
Luke T Slater ◽  
John A Williams ◽  
Andreas Karwath ◽  
Hilary Fanning ◽  
Simon Ball ◽  
...  

Identification of ontology concepts in clinical narrative text enables the creation of phenotype profiles that can be associated with clinical entities, such as patients or drugs. Constructing patient phenotype profiles using formal ontologies enables their analysis via semantic similarity, in turn enabling the use of background knowledge in clustering or classification analyses. However, traditional semantic similarity approaches collapse complex relationships between patient phenotypes into a unitary similarity scores for each pair of patients. Moreover, single scores may be based only on matching terms with the greatest information content (IC), ignoring other dimensions of patient similarity. This process necessarily leads to a loss of information in the resulting representation of patient similarity, and is especially apparent when using very large text-derived and highly multi-morbid phenotype profiles. Moreover, it renders finding a biological explanation for similarity very difficult; the black box problem. In this article, we explore the generation of multiple semantic similarity scores for patients based on different facets of their phenotypic manifestation, which we define through different sub-graphs in the Human Phenotype Ontology. We further present a new methodology for deriving sets of qualitative class descriptions for groups of entities described by ontology terms. Leveraging this strategy to obtain meaningful explanations for our semantic clusters alongside other evaluation techniques, we show that semantic clustering with ontology-derived facets enables the representation, and thus identification of, clinically relevant phenotype relationships not easily recoverable using overall clustering alone. In this way, we demonstrate the potential of faceted semantic clustering for gaining a deeper and more nuanced understanding of text-derived patient phenotypes.

2021 ◽  
Author(s):  
◽  
Chia-wen Fang

<p>Ontologies are formal specifications of shared conceptualizations of a domain. Important applications of ontologies include distributed knowledge based systems, such as the semantic web, and the evaluation of modelling languages, e.g. for business process or conceptual modelling. These applications require formal ontologies of good quality. In this thesis, we present a multi-method ontology evaluation methodology, which consists of two techniques (sentence verification task and recall) based on principles of cognitive psychology, to test how well a specification of a formal ontology corresponds to the ontology users' conceptualization of a domain. Two experiments were conducted, each evaluating the SUMO ontology and WordNet with an experimental technique, as demonstrations of the multi-method evaluation methodology. We also tested the applicability of the two evaluation techniques by conducting a replication study for each. The replication studies obtained findings that point towards the same direction as the original studies, although no significance was achieved. Overall, the evaluation using the multi-method methodology suggests that neither of the two ontologies we examined is a good specification of the conceptualization of the domain. Both the terminology and the structure of the ontologies, may benefit from improvement.</p>


2018 ◽  
Vol 19 (11) ◽  
pp. 3410 ◽  
Author(s):  
Xiujuan Lei ◽  
Zengqiang Fang ◽  
Luonan Chen ◽  
Fang-Xiang Wu

CircRNAs have particular biological structure and have proven to play important roles in diseases. It is time-consuming and costly to identify circRNA-disease associations by biological experiments. Therefore, it is appealing to develop computational methods for predicting circRNA-disease associations. In this study, we propose a new computational path weighted method for predicting circRNA-disease associations. Firstly, we calculate the functional similarity scores of diseases based on disease-related gene annotations and the semantic similarity scores of circRNAs based on circRNA-related gene ontology, respectively. To address missing similarity scores of diseases and circRNAs, we calculate the Gaussian Interaction Profile (GIP) kernel similarity scores for diseases and circRNAs, respectively, based on the circRNA-disease associations downloaded from circR2Disease database (http://bioinfo.snnu.edu.cn/CircR2Disease/). Then, we integrate disease functional similarity scores and circRNA semantic similarity scores with their related GIP kernel similarity scores to construct a heterogeneous network made up of three sub-networks: disease similarity network, circRNA similarity network and circRNA-disease association network. Finally, we compute an association score for each circRNA-disease pair based on paths connecting them in the heterogeneous network to determine whether this circRNA-disease pair is associated. We adopt leave one out cross validation (LOOCV) and five-fold cross validations to evaluate the performance of our proposed method. In addition, three common diseases, Breast Cancer, Gastric Cancer and Colorectal Cancer, are used for case studies. Experimental results illustrate the reliability and usefulness of our computational method in terms of different validation measures, which indicates PWCDA can effectively predict potential circRNA-disease associations.


2014 ◽  
Vol 15 (1) ◽  
pp. 248 ◽  
Author(s):  
Aaron J Masino ◽  
Elizabeth T Dechene ◽  
Matthew C Dulik ◽  
Alisha Wilkens ◽  
Nancy B Spinner ◽  
...  

Author(s):  
Marianna Milano ◽  
Pietro Guzzi ◽  
Mario Cannataro

Omics sciences are widely used to analyze diseases at a molecular level. Usually, results of omics experiments are sets of candidate genes potentially involved in different diseases. The interpretation of results and the filtering of candidate genes or proteins selected in an experiment is a challenge in some scenarios. This problem is particularly evident in clinical environments in which researchers are interested in the behavior of few molecules related to some specific disease while results may contains thousands of data and have very relevant dimensions. The filtering requires the use of domain-specific knowledge that is usually encoded into ontologies. Consequently, to filter out false positive genes, different approaches for selecting genes have been introduced. Such approaches are often referred to as Gene prioritization methods. They aim to identify the most related genes to a disease among a larger set of candidates genes, through the use of computational methods. We implemented GoD (Gene ranking based On Diseases), an algorithm that ranks a given set of genes based on ontology annotations. The algorithm orders genes by the semantic similarity computed with respect to a disease among the annotations of each gene and those describing the selected disease.The current version of GoD enables the prioritization of a list of input genes for a selected disease. It uses HPO (Human Phenotype Ontology), GO (Gene Ontology), and DO (Disease Ontology) ontologies for the calculation of the ranking. It takes as input a list of genes or gene products annotated with GO Terms, HPO Terms, DO Terms and a selected disease described regarding annotation of GO, HPO or DO (user may also provide novel annotations). It produces as output the ranking of those genes with respect of the input disease. Package consists of three main functions: hpoGoD (for HPO based prioritization), goGoD (for GO based prioritization), and doGoD (for DO based prioritization). We tested GoD on Gene Regulatory Networks (GRNs). Biological network inference aims to reconstruct network of interactions (or associations) among biological genes starting from experimental observations. We selected three expression datasets: Dataset 1 (GDS3285) , related to breast cancer disease; Dataset 2 (GDS5072), related to prostate cancer disease; and Dataset 3 (GDS5093), related to Dengue virus (DENV) infection. Initially, experimental data are given as input to five GRN inference algorithms, i.e. ARACNE, CLR, MRNET, GENIE3 and GGM, to produce 5 inferred GRN networks. For each inferred GRN, GoD receives as input the list of top genes and produces for each gene a semantic similarity value on a selected disease considering one of the previous ontologies (e.g. Disease Ontology). For each GRN, the genes are ranked and reordered on the basis of the computed semantic similarity and are compared allowing to rank each GRN inference method with respect to the initially selected disease.


Database ◽  
2020 ◽  
Vol 2020 ◽  
Author(s):  
Carlota Cardoso ◽  
Rita T Sousa ◽  
Sebastian Köhler ◽  
Catia Pesquita

Abstract The ability to compare entities within a knowledge graph is a cornerstone technique for several applications, ranging from the integration of heterogeneous data to machine learning. It is of particular importance in the biomedical domain, where semantic similarity can be applied to the prediction of protein–protein interactions, associations between diseases and genes, cellular localization of proteins, among others. In recent years, several knowledge graph-based semantic similarity measures have been developed, but building a gold standard data set to support their evaluation is non-trivial. We present a collection of 21 benchmark data sets that aim at circumventing the difficulties in building benchmarks for large biomedical knowledge graphs by exploiting proxies for biomedical entity similarity. These data sets include data from two successful biomedical ontologies, Gene Ontology and Human Phenotype Ontology, and explore proxy similarities calculated based on protein sequence similarity, protein family similarity, protein–protein interactions and phenotype-based gene similarity. Data sets have varying sizes and cover four different species at different levels of annotation completion. For each data set, we also provide semantic similarity computations with state-of-the-art representative measures. Database URL: https://github.com/liseda-lab/kgsim-benchmark.


2013 ◽  
Vol 2013 ◽  
pp. 1-11 ◽  
Author(s):  
Gaston K. Mazandu ◽  
Nicola J. Mulder

Several approaches have been proposed for computing term information content (IC) and semantic similarity scores within the gene ontology (GO) directed acyclic graph (DAG). These approaches contributed to improving protein analyses at the functional level. Considering the recent proliferation of these approaches, a unified theory in a well-defined mathematical framework is necessary in order to provide a theoretical basis for validating these approaches. We review the existing IC-based ontological similarity approaches developed in the context of biomedical and bioinformatics fields to propose a general framework and unified description of all these measures. We have conducted an experimental evaluation to assess the impact of IC approaches, different normalization models, and correction factors on the performance of a functional similarity metric. Results reveal that considering only parents or only children of terms when assessing information content or semantic similarity scores negatively impacts the approach under consideration. This study produces a unified framework for current and future GO semantic similarity measures and provides theoretical basics for comparing different approaches. The experimental evaluation of different approaches based on different term information content models paves the way towards a solution to the issue of scoring a term’s specificity in the GO DAG.


2010 ◽  
Vol 11 (1) ◽  
pp. 290 ◽  
Author(s):  
Jing Wang ◽  
Xianxiao Zhou ◽  
Jing Zhu ◽  
Chenggui Zhou ◽  
Zheng Guo

Sign in / Sign up

Export Citation Format

Share Document