Semantic Data Set Construction from Human Clustering and Spatial Arrangement

Research into representation learning models of lexical semantics usually utilizes some form of intrinsic evaluation to ensure that the learned representations reflect human semantic judgments. Lexical semantic similarity estimation is a widely used evaluation method, but efforts have typically focused on pairwise judgments of words in isolation, or are limited to specific contexts and lexical stimuli. There are limitations with these approaches that either do not provide any context for judgments, and thereby ignore ambiguity, or provide very specific sentential contexts that cannot then be used to generate a larger lexical resource. Furthermore, similarity between more than two items is not considered. We provide a full description and analysis of our recently proposed methodology for large-scale data set construction that produces a semantic classification of a large sample of verbs in the first phase, as well as multiway similarity judgments made within the resultant semantic classes in the second phase. The methodology uses a spatial multi-arrangement approach proposed in the field of cognitive neuroscience for capturing multi-way similarity judgments of visual stimuli. We have adapted this method to handle polysemous linguistic stimuli and much larger samples than previous work.We specifically target verbs, but the method can equally be applied to other parts of speech. We perform cluster analysis on the data from the first phase and demonstrate how this might be useful in the construction of a comprehensive verb resource. We also analyze the semantic information captured by the second phase and discuss the potential of the spatially induced similarity judgments to better reflect human notions of word similarity.We demonstrate how the resultant data set can be used for fine-grained analyses and evaluation of representation learning models on the intrinsic tasks of semantic clustering and semantic similarity. In particular, we find that stronger static word embedding methods still outperform lexical representations emerging from more recent pre-training methods, both on word-level similarity and clustering. Moreover, thanks to the data set’s vast coverage, we are able to compare the benefits of specializing vector representations for a particular type of external knowledge by evaluating FrameNet- and VerbNet-retrofitted models on specific semantic domains such as “Heat” or “Motion.”

Download Full-text

Multi-SimLex: A Large-Scale Evaluation of Multilingual and Cross-Lingual Lexical Semantic Similarity

Computational Linguistics ◽

10.1162/coli_a_00391 ◽

2020 ◽

pp. 1-51

Author(s):

Ivan Vulić ◽

Simon Baker ◽

Edoardo Maria Ponti ◽

Ulla Petti ◽

Ira Leviant ◽

...

Keyword(s):

Semantic Similarity ◽

Large Scale ◽

Representation Learning ◽

Data Sets ◽

Word Embeddings ◽

Data Set ◽

Lexical Representations ◽

Language Data ◽

Weakly Supervised ◽

Cross Lingual

We introduce Multi-SimLex, a large-scale lexical resource and evaluation benchmark covering data sets for 12 typologically diverse languages, including major languages (e.g., Mandarin Chinese, Spanish, Russian) as well as less-resourced ones (e.g., Welsh, Kiswahili). Each language data set is annotated for the lexical relation of semantic similarity and contains 1,888 semantically aligned concept pairs, providing a representative coverage of word classes (nouns, verbs, adjectives, adverbs), frequency ranks, similarity intervals, lexical fields, and concreteness levels. Additionally, owing to the alignment of concepts across languages, we provide a suite of 66 crosslingual semantic similarity data sets. Because of its extensive size and language coverage, Multi-SimLex provides entirely novel opportunities for experimental evaluation and analysis. On its monolingual and crosslingual benchmarks, we evaluate and analyze a wide array of recent state-of-the-art monolingual and crosslingual representation models, including static and contextualized word embeddings (such as fastText, monolingual and multilingual BERT, XLM), externally informed lexical representations, as well as fully unsupervised and (weakly) supervised crosslingual word embeddings. We also present a step-by-step data set creation protocol for creating consistent, Multi-Simlex -style resources for additional languages.We make these contributions—the public release of Multi-SimLex data sets, their creation protocol, strong baseline results, and in-depth analyses which can be be helpful in guiding future developments in multilingual lexical semantics and representation learning—available via aWeb site that will encourage community effort in further expansion of Multi-Simlex to many more languages. Such a large-scale semantic resource could inspire significant further advances in NLP across languages.

Download Full-text

Climate change impact on streamflow in large-scale river basins: projections and their uncertainties sourced from GCMs and RCP scenarios

Proceedings of the International Association of Hydrological Sciences ◽

10.5194/piahs-379-139-2018 ◽

2018 ◽

Vol 379 ◽

pp. 139-144 ◽

Cited By ~ 3

Author(s):

Olga N. Nasonova ◽

Yeugeniy M. Gusev ◽

Evgeny E. Kovalev ◽

Georgy V. Ayzel

Keyword(s):

Climate Change ◽

River Runoff ◽

Climate Change Impact ◽

Large Scale ◽

River Basins ◽

Surface Model ◽

Second Phase ◽

Rcp Scenarios ◽

Data Set ◽

Change Impact

Abstract. Climate change impact on river runoff was investigated within the framework of the second phase of the Inter-Sectoral Impact Model Intercomparison Project (ISI-MIP2) using a physically-based land surface model Soil Water – Atmosphere – Plants (SWAP) (developed in the Institute of Water Problems of the Russian Academy of Sciences) and meteorological projections (for 2006–2099) simulated by five General Circulation Models (GCMs) (including GFDL-ESM2M, HadGEM2-ES, IPSL-CM5A-LR, MIROC-ESM-CHEM, and NorESM1-M) for each of four Representative Concentration Pathway (RCP) scenarios (RCP2.6, RCP4.5, RCP6.0, and RCP8.5). Eleven large-scale river basins were used in this study. First of all, SWAP was calibrated and validated against monthly values of measured river runoff with making use of forcing data from the WATCH data set and all GCMs' projections were bias-corrected to the WATCH. Then, for each basin, 20 projections of possible changes in river runoff during the 21st century were simulated by SWAP. Analysis of the obtained hydrological projections allowed us to estimate their uncertainties resulted from application of different GCMs and RCP scenarios. On the average, the contribution of different GCMs to the uncertainty of the projected river runoff is nearly twice larger than the contribution of RCP scenarios. At the same time the contribution of GCMs slightly decreases with time.

Download Full-text

An effective drug-disease associations prediction model based on graphic representation learning over multi-biomolecular network

BMC Bioinformatics ◽

10.1186/s12859-021-04553-2 ◽

2022 ◽

Vol 23 (1) ◽

Author(s):

Hanjing Jiang ◽

Yabing Huang

Keyword(s):

High Performance ◽

Large Scale ◽

Representation Learning ◽

Biological Data ◽

Graph Representation ◽

Data Set ◽

Validation Experiment ◽

Biomolecular Network ◽

Disease Associations ◽

Drug Reposition

Abstract Background Drug-disease associations (DDAs) can provide important information for exploring the potential efficacy of drugs. However, up to now, there are still few DDAs verified by experiments. Previous evidence indicates that the combination of information would be conducive to the discovery of new DDAs. How to integrate different biological data sources and identify the most effective drugs for a certain disease based on drug-disease coupled mechanisms is still a challenging problem. Results In this paper, we proposed a novel computation model for DDA predictions based on graph representation learning over multi-biomolecular network (GRLMN). More specifically, we firstly constructed a large-scale molecular association network (MAN) by integrating the associations among drugs, diseases, proteins, miRNAs, and lncRNAs. Then, a graph embedding model was used to learn vector representations for all drugs and diseases in MAN. Finally, the combined features were fed to a random forest (RF) model to predict new DDAs. The proposed model was evaluated on the SCMFDD-S data set using five-fold cross-validation. Experiment results showed that GRLMN model was very accurate with the area under the ROC curve (AUC) of 87.9%, which outperformed all previous works in terms of both accuracy and AUC in benchmark dataset. To further verify the high performance of GRLMN, we carried out two case studies for two common diseases. As a result, in the ranking of drugs that were predicted to be related to certain diseases (such as kidney disease and fever), 15 of the top 20 drugs have been experimentally confirmed. Conclusions The experimental results show that our model has good performance in the prediction of DDA. GRLMN is an effective prioritization tool for screening the reliable DDAs for follow-up studies concerning their participation in drug reposition.

Download Full-text

Matching Biomedical Ontologies: Construction of Matching Clues and Systematic Evaluation of Different Combinations of Matchers (Preprint)

10.2196/preprints.28212 ◽

2021 ◽

Author(s):

Peng Wang ◽

Yunyan Hu ◽

Shaochen Bai ◽

Shiyi Zou

Keyword(s):

Time Complexity ◽

Large Scale ◽

Empirical Studies ◽

Representation Learning ◽

Substantial Improvement ◽

Systematic Evaluation ◽

Ontology Matching ◽

Biomedical Ontologies ◽

Data Set ◽

Combination Strategies

BACKGROUND Ontology matching seeks to find semantic correspondences between ontologies. With an increasing number of biomedical ontologies being developed independently, matching these ontologies to solve the interoperability problem has become a critical task in biomedical applications. However, some challenges remain. First, extracting and constructing matching clues from biomedical ontologies is a nontrivial problem. Second, it is unknown whether there are dominant matchers while matching biomedical ontologies. Finally, ontology matching also suffers from computational complexity owing to the large-scale sizes of biomedical ontologies. OBJECTIVE To investigate the effectiveness of matching clues and composite match approaches, this paper presents a spectrum of matchers with different combination strategies and empirically studies their influence on matching biomedical ontologies. Besides, extended reduction anchors are introduced to effectively decrease the time complexity while matching large biomedical ontologies. METHODS In this paper, atomic and composite matching clues are first constructed in 4 dimensions: terminology, structure, external knowledge, and representation learning. Then, a spectrum of matchers based on a flexible combination of atomic clues are designed and utilized to comprehensively study the effectiveness. Besides, we carry out a systematic comparative evaluation of different combinations of matchers. Finally, extended reduction anchor is proposed to significantly alleviate the time complexity for matching large-scale biomedical ontologies. RESULTS Experimental results show that considering distinguishable matching clues in biomedical ontologies leads to a substantial improvement in all available information. Besides, incorporating different types of matchers with reliability results in a marked improvement, which is comparative to the state-of-the-art methods. The dominant matchers achieve F1 measures of 0.9271, 0.8218, and 0.5 on Anatomy, FMA-NCI (Foundation Model of Anatomy-National Cancer Institute), and FMA-SNOMED data sets, respectively. Extended reduction anchor is able to solve the scalability problem of matching large biomedical ontologies. It achieves a significant reduction in time complexity with little loss of F1 measure at the same time, with a 0.21% decrease on the Anatomy data set and 0.84% decrease on the FMA-NCI data set, but with a 2.65% increase on the FMA-SNOMED data set. CONCLUSIONS This paper systematically analyzes and compares the effectiveness of different matching clues, matchers, and combination strategies. Multiple empirical studies demonstrate that distinguishing clues have significant implications for matching biomedical ontologies. In contrast to the matchers with single clue, those combining multiple clues exhibit more stable and accurate performance. In addition, our results provide evidence that the approach based on extended reduction anchors performs well for large ontology matching tasks, demonstrating an effective solution for the problem.

Download Full-text

HyperLex: A Large-Scale Evaluation of Graded Lexical Entailment

Computational Linguistics ◽

10.1162/coli_a_00301 ◽

2017 ◽

Vol 43 (4) ◽

pp. 781-835 ◽

Cited By ~ 9

Author(s):

Ivan Vulić ◽

Daniela Gerz ◽

Douwe Kiela ◽

Felix Hill ◽

Anna Korhonen

Keyword(s):

Large Scale ◽

Human Performance ◽

Semantic Category ◽

Representation Learning ◽

Native English Speakers ◽

Category Membership ◽

Future Application ◽

Data Set ◽

Diverse Range ◽

Concept Pairs

We introduce HyperLex—a data set and evaluation resource that quantifies the extent of the semantic category membership, that is, type-of relation, also known as hyponymy–hypernymy or lexical entailment (LE) relation between 2,616 concept pairs. Cognitive psychology research has established that typicality and category/class membership are computed in human semantic memory as a gradual rather than binary relation. Nevertheless, most NLP research and existing large-scale inventories of concept category membership (WordNet, DBPedia, etc.) treat category membership and LE as binary. To address this, we asked hundreds of native English speakers to indicate typicality and strength of category membership between a diverse range of concept pairs on a crowdsourcing platform. Our results confirm that category membership and LE are indeed more gradual than binary. We then compare these human judgments with the predictions of automatic systems, which reveals a huge gap between human performance and state-of-the-art LE, distributional and representation learning models, and substantial differences between the models themselves. We discuss a pathway for improving semantic models to overcome this discrepancy, and indicate future application areas for improved graded LE systems.

Download Full-text

COVID-19 Data Analysis using Chest X-ray

International Journal of Advanced Medical Sciences and Technology - Regular Issue ◽

10.35940/ijamst.c3018.081421 ◽

2021 ◽

Vol 1 (4) ◽

pp. 5-10

Author(s):

Ishtiaque Ahmed ◽

◽

Manan Darda ◽

Neha Tikyani ◽

Rachit Agrawal ◽

...

Keyword(s):

Deep Learning ◽

Data Analysis ◽

Predictive Value ◽

Large Scale ◽

X Rays ◽

Learning Models ◽

Data Set ◽

X Ray ◽

Dataset Size ◽

Chest X Ray

The COVID-19 pandemic has caused large-scale outbreaks in more than 150 countries worldwide, causing massive damage to the livelihood of many people. The capacity to identify contaminated patients early and get unique treatment is quite possibly the primary stride in the battle against COVID-19. One of the quickest ways to diagnose patients is to use radiography and radiology images to detect the disease. Early studies have shown that chest X-rays of patients infected with COVID-19 have unique abnormalities. To identify COVID-19 patients from chest X-ray images, we used various deep learning models based on previous studies. We first compiled a data set of 2,815 chest radiographs from public sources. The model produces reliable and stable results with an accuracy of 91.6%, a Positive Predictive Value of 80%, a Negative Predictive Value of 100%, specificity of 87.50%, and Sensitivity of 100%. It is observed that the CNN-based architecture can diagnose COVID19 disease. The parameters’ outcomes can be further improved by increasing the dataset size and by developing the CNN-based architecture for training the model.

Download Full-text

COVID-19 Data Analysis using Chest X-Ray

International Journal of Advanced Medical Sciences and Technology - Regular Issue ◽

10.54105/ijamst.c3018.081421 ◽

2021 ◽

pp. 5-10

Author(s):

Ishtiaque Ahmed ◽

◽

Manan Darda ◽

Neha Tikyani ◽

Rachit Agrawal ◽

...

Keyword(s):

Deep Learning ◽

Data Analysis ◽

Predictive Value ◽

Large Scale ◽

X Rays ◽

Learning Models ◽

Data Set ◽

X Ray ◽

Dataset Size ◽

Chest X Ray

The COVID-19 pandemic has caused large-scale outbreaks in more than 150 countries worldwide, causing massive damage to the livelihood of many people. The capacity to identify contaminated patients early and get unique treatment is quite possibly the primary stride in the battle against COVID-19. One of the quickest ways to diagnose patients is to use radiography and radiology images to detect the disease. Early studies have shown that chest X-rays of patients infected with COVID-19 have unique abnormalities. To identify COVID-19 patients from chest X-ray images, we used various deep learning models based on previous studies. We first compiled a data set of 2,815 chest radiographs from public sources. The model produces reliable and stable results with an accuracy of 91.6%, a Positive Predictive Value of 80%, a Negative Predictive Value of 100%, specificity of 87.50%, and Sensitivity of 100%. It is observed that the CNN-based architecture can diagnose COVID-19 disease. The parameters’ outcomes can be further improved by increasing the dataset size and by developing the CNN-based architecture for training the model.

Download Full-text

Matching Biomedical Ontologies: Construction of Matching Clues and Systematic Evaluation of Different Combinations of Matchers

JMIR Medical Informatics ◽

10.2196/28212 ◽

2021 ◽

Vol 9 (8) ◽

pp. e28212

Author(s):

Peng Wang ◽

Yunyan Hu ◽

Shaochen Bai ◽

Shiyi Zou

Keyword(s):

Time Complexity ◽

Large Scale ◽

Empirical Studies ◽

Representation Learning ◽

Substantial Improvement ◽

Systematic Evaluation ◽

Ontology Matching ◽

Biomedical Ontologies ◽

Data Set ◽

Combination Strategies

Background Ontology matching seeks to find semantic correspondences between ontologies. With an increasing number of biomedical ontologies being developed independently, matching these ontologies to solve the interoperability problem has become a critical task in biomedical applications. However, some challenges remain. First, extracting and constructing matching clues from biomedical ontologies is a nontrivial problem. Second, it is unknown whether there are dominant matchers while matching biomedical ontologies. Finally, ontology matching also suffers from computational complexity owing to the large-scale sizes of biomedical ontologies. Objective To investigate the effectiveness of matching clues and composite match approaches, this paper presents a spectrum of matchers with different combination strategies and empirically studies their influence on matching biomedical ontologies. Besides, extended reduction anchors are introduced to effectively decrease the time complexity while matching large biomedical ontologies. Methods In this paper, atomic and composite matching clues are first constructed in 4 dimensions: terminology, structure, external knowledge, and representation learning. Then, a spectrum of matchers based on a flexible combination of atomic clues are designed and utilized to comprehensively study the effectiveness. Besides, we carry out a systematic comparative evaluation of different combinations of matchers. Finally, extended reduction anchor is proposed to significantly alleviate the time complexity for matching large-scale biomedical ontologies. Results Experimental results show that considering distinguishable matching clues in biomedical ontologies leads to a substantial improvement in all available information. Besides, incorporating different types of matchers with reliability results in a marked improvement, which is comparative to the state-of-the-art methods. The dominant matchers achieve F1 measures of 0.9271, 0.8218, and 0.5 on Anatomy, FMA-NCI (Foundation Model of Anatomy-National Cancer Institute), and FMA-SNOMED data sets, respectively. Extended reduction anchor is able to solve the scalability problem of matching large biomedical ontologies. It achieves a significant reduction in time complexity with little loss of F1 measure at the same time, with a 0.21% decrease on the Anatomy data set and 0.84% decrease on the FMA-NCI data set, but with a 2.65% increase on the FMA-SNOMED data set. Conclusions This paper systematically analyzes and compares the effectiveness of different matching clues, matchers, and combination strategies. Multiple empirical studies demonstrate that distinguishing clues have significant implications for matching biomedical ontologies. In contrast to the matchers with single clue, those combining multiple clues exhibit more stable and accurate performance. In addition, our results provide evidence that the approach based on extended reduction anchors performs well for large ontology matching tasks, demonstrating an effective solution for the problem.

Download Full-text

An Improved Single Shot Multibox Detector Method Applied in Body Condition Score for Dairy Cows

Animals ◽

10.3390/ani9070470 ◽

2019 ◽

Vol 9 (7) ◽

pp. 470 ◽

Cited By ~ 5

Author(s):

Xiaoping Huang ◽

Zelin Hu ◽

Xiaorun Wang ◽

Xuanjiang Yang ◽

Jian Zhang ◽

...

Keyword(s):

Body Condition ◽

Large Scale ◽

Evaluation Method ◽

Body Condition Score ◽

Low Cost ◽

Milk Composition ◽

Single Shot ◽

Body Parts ◽

Data Set ◽

Low Efficiency

Body condition scores (BCS) is an important parameter, which is in high correlation with the health status of a dairy cow, metabolic disorder and milk composition during the production period. To evaluate BCS, the traditional methods rely on veterinary experts or skilled staff to look at a cow and touch it. These methods have low efficiency especially on large-scale farms. Computer vision methods are widely used but there are some improvements to increase BCS accuracy. In this study, a low cost BCS evaluation method based on deep learning and machine vision is proposed. Firstly, the back-view images of the cows are captured by network cameras, resulting in 8972 images that constituted the sample data set. The camera is a common 2D camera, which is cheaper and easier to install compared with 3D cameras. Secondly, the key body parts such as tails, pins and rump in the images were labeled manually, the Sing Shot multi-box Detector (SSD) method was used to detect the tail and evaluate the BCS. Inspired by DenseNet and Inception-v4, a new SSD was introduced by changing the network connection method of the original SSD. Finally, the experiments show that the improved SSD method can achieve 98.46% classification accuracy and 89.63% location accuracy, and it has: (1) faster detection speed with 115 fps; (2) smaller model size with 23.1 MB compared to original SSD and YOLO-v3, these are significant advantages for reducing hardware costs.

Download Full-text

Epigenetic Target Prediction with Accurate Machine Learning Models

10.26434/chemrxiv.13522313 ◽

2021 ◽

Author(s):

Norberto Sánchez-Cruz ◽

Jose L. Medina-Franco

Keyword(s):

Machine Learning ◽

Small Molecules ◽

Predictive Models ◽

Large Scale ◽

Target Prediction ◽

Quantitative Measure ◽

Learning Models ◽

Discovery Research ◽

Drug Discovery Research ◽

Machine Learning Models

<p>Epigenetic targets are a significant focus for drug discovery research, as demonstrated by the eight approved epigenetic drugs for treatment of cancer and the increasing availability of chemogenomic data related to epigenetics. This data represents a large amount of structure-activity relationships that has not been exploited thus far for the development of predictive models to support medicinal chemistry efforts. Herein, we report the first large-scale study of 26318 compounds with a quantitative measure of biological activity for 55 protein targets with epigenetic activity. Through a systematic comparison of machine learning models trained on molecular fingerprints of different design, we built predictive models with high accuracy for the epigenetic target profiling of small molecules. The models were thoroughly validated showing mean precisions up to 0.952 for the epigenetic target prediction task. Our results indicate that the herein reported models have considerable potential to identify small molecules with epigenetic activity. Therefore, our results were implemented as freely accessible and easy-to-use web application.</p>

Download Full-text