Matching Biomedical Ontologies: Construction of Matching Clues and Systematic Evaluation of Different Combinations of Matchers (Preprint)

BACKGROUND Ontology matching seeks to find semantic correspondences between ontologies. With an increasing number of biomedical ontologies being developed independently, matching these ontologies to solve the interoperability problem has become a critical task in biomedical applications. However, some challenges remain. First, extracting and constructing matching clues from biomedical ontologies is a nontrivial problem. Second, it is unknown whether there are dominant matchers while matching biomedical ontologies. Finally, ontology matching also suffers from computational complexity owing to the large-scale sizes of biomedical ontologies. OBJECTIVE To investigate the effectiveness of matching clues and composite match approaches, this paper presents a spectrum of matchers with different combination strategies and empirically studies their influence on matching biomedical ontologies. Besides, extended reduction anchors are introduced to effectively decrease the time complexity while matching large biomedical ontologies. METHODS In this paper, atomic and composite matching clues are first constructed in 4 dimensions: terminology, structure, external knowledge, and representation learning. Then, a spectrum of matchers based on a flexible combination of atomic clues are designed and utilized to comprehensively study the effectiveness. Besides, we carry out a systematic comparative evaluation of different combinations of matchers. Finally, extended reduction anchor is proposed to significantly alleviate the time complexity for matching large-scale biomedical ontologies. RESULTS Experimental results show that considering distinguishable matching clues in biomedical ontologies leads to a substantial improvement in all available information. Besides, incorporating different types of matchers with reliability results in a marked improvement, which is comparative to the state-of-the-art methods. The dominant matchers achieve F1 measures of 0.9271, 0.8218, and 0.5 on Anatomy, FMA-NCI (Foundation Model of Anatomy-National Cancer Institute), and FMA-SNOMED data sets, respectively. Extended reduction anchor is able to solve the scalability problem of matching large biomedical ontologies. It achieves a significant reduction in time complexity with little loss of F1 measure at the same time, with a 0.21% decrease on the Anatomy data set and 0.84% decrease on the FMA-NCI data set, but with a 2.65% increase on the FMA-SNOMED data set. CONCLUSIONS This paper systematically analyzes and compares the effectiveness of different matching clues, matchers, and combination strategies. Multiple empirical studies demonstrate that distinguishing clues have significant implications for matching biomedical ontologies. In contrast to the matchers with single clue, those combining multiple clues exhibit more stable and accurate performance. In addition, our results provide evidence that the approach based on extended reduction anchors performs well for large ontology matching tasks, demonstrating an effective solution for the problem.

Download Full-text

Matching Biomedical Ontologies: Construction of Matching Clues and Systematic Evaluation of Different Combinations of Matchers

JMIR Medical Informatics ◽

10.2196/28212 ◽

2021 ◽

Vol 9 (8) ◽

pp. e28212

Author(s):

Peng Wang ◽

Yunyan Hu ◽

Shaochen Bai ◽

Shiyi Zou

Keyword(s):

Time Complexity ◽

Large Scale ◽

Empirical Studies ◽

Representation Learning ◽

Substantial Improvement ◽

Systematic Evaluation ◽

Ontology Matching ◽

Biomedical Ontologies ◽

Data Set ◽

Combination Strategies

Background Ontology matching seeks to find semantic correspondences between ontologies. With an increasing number of biomedical ontologies being developed independently, matching these ontologies to solve the interoperability problem has become a critical task in biomedical applications. However, some challenges remain. First, extracting and constructing matching clues from biomedical ontologies is a nontrivial problem. Second, it is unknown whether there are dominant matchers while matching biomedical ontologies. Finally, ontology matching also suffers from computational complexity owing to the large-scale sizes of biomedical ontologies. Objective To investigate the effectiveness of matching clues and composite match approaches, this paper presents a spectrum of matchers with different combination strategies and empirically studies their influence on matching biomedical ontologies. Besides, extended reduction anchors are introduced to effectively decrease the time complexity while matching large biomedical ontologies. Methods In this paper, atomic and composite matching clues are first constructed in 4 dimensions: terminology, structure, external knowledge, and representation learning. Then, a spectrum of matchers based on a flexible combination of atomic clues are designed and utilized to comprehensively study the effectiveness. Besides, we carry out a systematic comparative evaluation of different combinations of matchers. Finally, extended reduction anchor is proposed to significantly alleviate the time complexity for matching large-scale biomedical ontologies. Results Experimental results show that considering distinguishable matching clues in biomedical ontologies leads to a substantial improvement in all available information. Besides, incorporating different types of matchers with reliability results in a marked improvement, which is comparative to the state-of-the-art methods. The dominant matchers achieve F1 measures of 0.9271, 0.8218, and 0.5 on Anatomy, FMA-NCI (Foundation Model of Anatomy-National Cancer Institute), and FMA-SNOMED data sets, respectively. Extended reduction anchor is able to solve the scalability problem of matching large biomedical ontologies. It achieves a significant reduction in time complexity with little loss of F1 measure at the same time, with a 0.21% decrease on the Anatomy data set and 0.84% decrease on the FMA-NCI data set, but with a 2.65% increase on the FMA-SNOMED data set. Conclusions This paper systematically analyzes and compares the effectiveness of different matching clues, matchers, and combination strategies. Multiple empirical studies demonstrate that distinguishing clues have significant implications for matching biomedical ontologies. In contrast to the matchers with single clue, those combining multiple clues exhibit more stable and accurate performance. In addition, our results provide evidence that the approach based on extended reduction anchors performs well for large ontology matching tasks, demonstrating an effective solution for the problem.

Download Full-text

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity

Computational Linguistics ◽

10.1162/coli_a_00391 ◽

2020 ◽

pp. 1-51

Author(s):

Ivan Vulić ◽

Simon Baker ◽

Edoardo Maria Ponti ◽

Ulla Petti ◽

Ira Leviant ◽

...

Keyword(s):

Semantic Similarity ◽

Large Scale ◽

Representation Learning ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lexical Representations ◽

Language Data ◽

Weakly Supervised ◽

Cross Lingual

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 crosslingual semantic similarity data sets. Because of its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and crosslingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised crosslingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex -style resources for additional languages.We make these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning—available via aWeb site that will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.

Download Full-text

Learning supervised embeddings for large scale sequence comparisons

10.1101/620153 ◽

2019 ◽

Cited By ~ 1

Author(s):

Dhananjay Kimothi ◽

Pravesh Biyani ◽

James M Hogan ◽

Akshay Soni ◽

Wayne Kelly

Keyword(s):

Large Scale ◽

Representation Learning ◽

Substantial Improvement ◽

Target Space ◽

Learning Sequence ◽

Retrieval Task ◽

Sequence Comparisons ◽

Ongoing Research ◽

Related Sequence ◽

Hybrid Approaches

Similarity-based search of sequence collections is a core task in bioinformatics, one dominated for most of the genomic era by exact and heuristic alignment-based algorithms. However, even efficient heuristics such as BLAST may not scale to the data sets now emerging, motivating a range of alignment-free alternatives exploiting the underlying lexical structure of each sequence.In this paper, we introduce SuperVec, a novel supervised approach to learning sequence embeddings. Our method extends earlier Representation Learning (RL) based methods to include jointly contextual and class-related information for each sequence during training. This ensures that related sequence fragments have proximal representations in the target space, better reflecting the structure of the domain.Such representations may be used for downstream machine learning tasks or employed directly. Here, we apply SuperVec embeddings to a sequence retrieval task, where the goal is to retrieve sequences with the same family label as a given query. The SuperVec approach is extended further through H-SuperVec, a tree-based hierarchical method which learns embeddings across a range of feature spaces based on the class labels and their exclusive and exhaustive subsets.Experiments show that supervised learning of embeddings based on sequence labels using SuperVec and H-SuperVec provides a substantial improvement in retrieval performance over existing (unsupervised) RL-based approaches. Further, the new methods are an order of magnitude faster than BLAST for the database retrieval task, supporting hybrid approaches in which SuperVec rapidly filters the collection so that only potentially relevant records remain, allowing slower, more accurate methods to be executed quickly over a far smaller dataset. Thus, we may achieve faster query processing and higher precision than before.Finally, for some problems, direct use of embeddings is already sufficient to yield high levels of precision and recall. Extending this work to encompass weaker homology is the subject of ongoing research.

Download Full-text

An effective drug-disease associations prediction model based on graphic representation learning over multi-biomolecular network

BMC Bioinformatics ◽

10.1186/s12859-021-04553-2 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Hanjing Jiang ◽

Yabing Huang

Keyword(s):

High Performance ◽

Large Scale ◽

Representation Learning ◽

Biological Data ◽

Graph Representation ◽

Data Set ◽

Validation Experiment ◽

Biomolecular Network ◽

Disease Associations ◽

Drug Reposition

Abstract Background Drug-disease associations (DDAs) can provide important information for exploring the potential efficacy of drugs. However, up to now, there are still few DDAs verified by experiments. Previous evidence indicates that the combination of information would be conducive to the discovery of new DDAs. How to integrate different biological data sources and identify the most effective drugs for a certain disease based on drug-disease coupled mechanisms is still a challenging problem. Results In this paper, we proposed a novel computation model for DDA predictions based on graph representation learning over multi-biomolecular network (GRLMN). More specifically, we firstly constructed a large-scale molecular association network (MAN) by integrating the associations among drugs, diseases, proteins, miRNAs, and lncRNAs. Then, a graph embedding model was used to learn vector representations for all drugs and diseases in MAN. Finally, the combined features were fed to a random forest (RF) model to predict new DDAs. The proposed model was evaluated on the SCMFDD-S data set using five-fold cross-validation. Experiment results showed that GRLMN model was very accurate with the area under the ROC curve (AUC) of 87.9%, which outperformed all previous works in terms of both accuracy and AUC in benchmark dataset. To further verify the high performance of GRLMN, we carried out two case studies for two common diseases. As a result, in the ranking of drugs that were predicted to be related to certain diseases (such as kidney disease and fever), 15 of the top 20 drugs have been experimentally confirmed. Conclusions The experimental results show that our model has good performance in the prediction of DDA. GRLMN is an effective prioritization tool for screening the reliable DDAs for follow-up studies concerning their participation in drug reposition.

Download Full-text

FROM EMPIRICAL DATA TO INTER-INDIVIDUAL INTERACTIONS: UNVEILING THE RULES OF COLLECTIVE ANIMAL BEHAVIOR

Mathematical Models and Methods in Applied Sciences ◽

10.1142/s0218202510004660 ◽

2010 ◽

Vol 20 (supp01) ◽

pp. 1491-1510 ◽

Cited By ~ 56

Author(s):

ANDREA CAVAGNA ◽

ALESSIO CIMARELLI ◽

IRENE GIARDINA ◽

GIORGIO PARISI ◽

RAFFAELE SANTAGATI ◽

...

Keyword(s):

Animal Behavior ◽

Empirical Data ◽

Large Scale ◽

Empirical Studies ◽

Theoretical Models ◽

Fixed Number ◽

Data Set ◽

Self Organized ◽

Theoretical Approaches ◽

Collective Animal Behavior

Animal groups represent magnificent archetypes of self-organized collective behavior. As such, they have attracted enormous interdisciplinary interest in the last years. From a mechanistic point of view, animal aggregations remind physical systems of particles or spins, where the individual constituents interact locally, giving rise to ordering at the global scale. This analogy has fostered important research, where numerical and theoretical approaches from physics have been applied to models of self-organized motion. In this paper, we discuss how the physics methodology may provide precious conceptual and technical instruments in empirical studies of collective animal behavior. We focus on three-dimensional groups, for which empirical data have been extremely scarce until recently, and describe novel experimental protocols that allow reconstructing aggregations of thousands of individuals. We show how an appropriate statistical analysis of these large-scale data allows inferring important information on the interactions between individuals in a group, a key issue in behavioral studies and a basic ingredient of theoretical models. To this aim, we revisit the approach we recently used on starling flocks, and apply it to a much larger data set, never analyzed before. The results confirm our previous findings and indicate that interactions between birds have a topological rather than metric nature, each individual interacting with a fixed number of neighbors irrespective of their distances.

Download Full-text

HyperLex: A Large-Scale Evaluation of Graded Lexical Entailment

Computational Linguistics ◽

10.1162/coli_a_00301 ◽

2017 ◽

Vol 43 (4) ◽

pp. 781-835 ◽

Cited By ~ 9

Author(s):

Ivan Vulić ◽

Daniela Gerz ◽

Douwe Kiela ◽

Felix Hill ◽

Anna Korhonen

Keyword(s):

Large Scale ◽

Human Performance ◽

Semantic Category ◽

Representation Learning ◽

Native English Speakers ◽

Category Membership ◽

Future Application ◽

Data Set ◽

Diverse Range ◽

Concept Pairs

We introduce HyperLex—a data set and evaluation resource that quantifies the extent of the semantic category membership, that is, type-of relation, also known as hyponymy–hypernymy or lexical entailment (LE) relation between 2,616 concept pairs. Cognitive psychology research has established that typicality and category/class membership are computed in human semantic memory as a gradual rather than binary relation. Nevertheless, most NLP research and existing large-scale inventories of concept category membership (WordNet, DBPedia, etc.) treat category membership and LE as binary. To address this, we asked hundreds of native English speakers to indicate typicality and strength of category membership between a diverse range of concept pairs on a crowdsourcing platform. Our results confirm that category membership and LE are indeed more gradual than binary. We then compare these human judgments with the predictions of automatic systems, which reveals a huge gap between human performance and state-of-the-art LE, distributional and representation learning models, and substantial differences between the models themselves. We discuss a pathway for improving semantic models to overcome this discrepancy, and indicate future application areas for improved graded LE systems.

Download Full-text

A protocol to evaluate RNA sequencing normalization methods

BMC Bioinformatics ◽

10.1186/s12859-019-3247-x ◽

2019 ◽

Vol 20 (S24) ◽

Cited By ~ 8

Author(s):

Zachary B. Abrams ◽

Travis S. Johnson ◽

Kun Huang ◽

Philip R. O. Payne ◽

Kevin Coombes

Keyword(s):

Rna Sequencing ◽

Large Scale ◽

Systematic Evaluation ◽

Sequencing Data ◽

Data Set ◽

Standard Data ◽

Sequencing Technologies ◽

Large Scale Data ◽

Normalization Methods ◽

Rnaseq Data

Abstract Background RNA sequencing technologies have allowed researchers to gain a better understanding of how the transcriptome affects disease. However, sequencing technologies often unintentionally introduce experimental error into RNA sequencing data. To counteract this, normalization methods are standardly applied with the intent of reducing the non-biologically derived variability inherent in transcriptomic measurements. However, the comparative efficacy of the various normalization techniques has not been tested in a standardized manner. Here we propose tests that evaluate numerous normalization techniques and applied them to a large-scale standard data set. These tests comprise a protocol that allows researchers to measure the amount of non-biological variability which is present in any data set after normalization has been performed, a crucial step to assessing the biological validity of data following normalization. Results In this study we present two tests to assess the validity of normalization methods applied to a large-scale data set collected for systematic evaluation purposes. We tested various RNASeq normalization procedures and concluded that transcripts per million (TPM) was the best performing normalization method based on its preservation of biological signal as compared to the other methods tested. Conclusion Normalization is of vital importance to accurately interpret the results of genomic and transcriptomic experiments. More work, however, needs to be performed to optimize normalization methods for RNASeq data. The present effort helps pave the way for more systematic evaluations of normalization methods across different platforms. With our proposed schema researchers can evaluate their own or future normalization methods to further improve the field of RNASeq normalization.

Download Full-text

Semantic Data Set Construction from Human Clustering and Spatial Arrangement

Computational Linguistics ◽

10.1162/coli_a_00396 ◽

2021 ◽

pp. 1-48

Author(s):

Olga Majewska ◽

Diana McCarthy ◽

Jasper J. F. van den Bosch ◽

Nikolaus Kriegeskorte ◽

Ivan Vulić ◽

...

Keyword(s):

Semantic Similarity ◽

Large Scale ◽

Evaluation Method ◽

Representation Learning ◽

Full Description ◽

Second Phase ◽

Learning Models ◽

Data Set ◽

Semantic Classification ◽

Similarity Judgments

Research into representation learning models of lexical semantics usually utilizes some form of intrinsic evaluation to ensure that the learned representations reflect human semantic judgments. Lexical semantic similarity estimation is a widely used evaluation method, but efforts have typically focused on pairwise judgments of words in isolation, or are limited to specific contexts and lexical stimuli. There are limitations with these approaches that either do not provide any context for judgments, and thereby ignore ambiguity, or provide very specific sentential contexts that cannot then be used to generate a larger lexical resource. Furthermore, similarity between more than two items is not considered. We provide a full description and analysis of our recently proposed methodology for large-scale data set construction that produces a semantic classification of a large sample of verbs in the first phase, as well as multiway similarity judgments made within the resultant semantic classes in the second phase. The methodology uses a spatial multi-arrangement approach proposed in the field of cognitive neuroscience for capturing multi-way similarity judgments of visual stimuli. We have adapted this method to handle polysemous linguistic stimuli and much larger samples than previous work.We specifically target verbs, but the method can equally be applied to other parts of speech. We perform cluster analysis on the data from the first phase and demonstrate how this might be useful in the construction of a comprehensive verb resource. We also analyze the semantic information captured by the second phase and discuss the potential of the spatially induced similarity judgments to better reflect human notions of word similarity.We demonstrate how the resultant data set can be used for fine-grained analyses and evaluation of representation learning models on the intrinsic tasks of semantic clustering and semantic similarity. In particular, we find that stronger static word embedding methods still outperform lexical representations emerging from more recent pre-training methods, both on word-level similarity and clustering. Moreover, thanks to the data set’s vast coverage, we are able to compare the benefits of specializing vector representations for a particular type of external knowledge by evaluating FrameNet- and VerbNet-retrofitted models on specific semantic domains such as “Heat” or “Motion.”

Download Full-text

Landmark FN-DBSCAN: An Efficient Density-Based Clustering Algorithm with Fuzzy Neighborhood

Journal of Advanced Computational Intelligence and Intelligent Informatics ◽

10.20965/jaciii.2013.p0060 ◽

2013 ◽

Vol 17 (1) ◽

pp. 60-73

Author(s):

Hao Liu ◽

◽

Satoshi Oyama ◽

Masahito Kurihara ◽

Haruhiko Sato

Keyword(s):

Time Complexity ◽

Large Scale ◽

Clustering Algorithm ◽

Data Sets ◽

Clustering Methods ◽

Data Set ◽

Large Scale Data ◽

Density Based Clustering ◽

Scale Data ◽

Large Scale Data Sets

Clustering is an important tool for data analysis and many clustering techniques have been proposed over the past years. Among them are density-based clustering methods, which have several benefits such as the number of clusters is not required before carrying out clustering; the detected clusters can be represented in an arbitrary shape and outliers can be detected and removed. Recently, the density-based algorithms were extended with the fuzzy set theory, which has made these algorithm more robust. However, the density-based clustering algorithms usually require a time complexity ofO(n2) wherenis the number of data in the data set, implying that they are not suitable to work with large scale data sets. In this paper, a novel clustering algorithm called landmark fuzzy neighborhood DBSCAN (landmark FN-DBSCAN) is proposed. The concept, landmark, is used to represent a subset of the input data set which makes the algorithm efficient on large scale data sets. We give a theoretical analysis on time complexity and space complexity, which shows both of them are linear to the size of the data set. The experiments show that the landmark FN-DBSCAN is much faster than FN-DBSCAN and provides a very good quality of clustering.

Download Full-text

ProGen:Provenance database generator for large-scale data set

Journal of Computer Applications ◽

10.3724/sp.j.1087.2008.02737 ◽

2009 ◽

Vol 28 (11) ◽

pp. 2737-2740

Author(s):

Xiao ZHANG ◽

Shan WANG ◽

Na LIAN

Keyword(s):

Large Scale ◽

Data Set ◽

Large Scale Data ◽

Scale Data

Download Full-text