scholarly journals Expert Concept-Modeling Ground Truth Construction for Word Embeddings Evaluation in Concept-Focused Domains

Arianna Betti ◽  
Martin Reynaert ◽  
Thijs Ossenkoppele ◽  
Yvette Oortwijn ◽  
Andrew Salway ◽  
2020 ◽  
Mohammed Ibrahim ◽  
Susan Gauch ◽  
Omar Salman ◽  
Mohammed Alqahatani

BACKGROUND Clear language makes communication easier between any two parties. A layman may have difficulty communicating with a professional due to not understanding the specialized terms common to the domain. In healthcare, it is rare to find a layman knowledgeable in medical jargon which can lead to poor understanding of their condition and/or treatment. To bridge this gap, several professional vocabularies and ontologies have been created to map laymen medical terms to professional medical terms and vice versa. OBJECTIVE Many of the presented vocabularies are built manually or semi-automatically requiring large investments of time and human effort and consequently the slow growth of these vocabularies. In this paper, we present an automatic method to enrich laymen's vocabularies that has the benefit of being able to be applied to vocabularies in any domain. METHODS Our entirely automatic approach uses machine learning, specifically Global Vectors for Word Embeddings (GloVe), on a corpus collected from a social media healthcare platform to extend and enhance consumer health vocabularies (CHV). Our approach further improves the CHV by incorporating synonyms and hyponyms from the WordNet ontology. The basic GloVe and our novel algorithms incorporating WordNet were evaluated using two laymen datasets from the National Library of Medicine (NLM), Open-Access Consumer Health Vocabulary (OAC CHV) and MedlinePlus Healthcare Vocabulary. RESULTS The results show that GloVe was able to find new laymen terms with an F-score of 48.44%. Furthermore, our enhanced GloVe approach outperformed basic GloVe with an average F-score of 61%, a relative improvement of 25%. CONCLUSIONS This paper presents an automatic approach to enrich consumer health vocabularies using the GloVe word embeddings and an auxiliary lexical source, WordNet. Our approach was evaluated used a healthcare text downloaded from, a healthcare social media platform using two standard laymen vocabularies, OAC CHV, and MedlinePlus. We used the WordNet ontology to expand the healthcare corpus by including synonyms, hyponyms, and hypernyms for each CHV layman term occurrence in the corpus. Given a seed term selected from a concept in the ontology, we measured our algorithms’ ability to automatically extract synonyms for those terms that appeared in the ground truth concept. We found that enhanced GloVe outperformed GloVe with a relative improvement of 25% in the F-score.

2021 ◽  
Vol 3 ◽  
Sola S. Shirai ◽  
Oshani Seneviratne ◽  
Minor E. Gordon ◽  
Ching-Hua Chen ◽  
Deborah L. McGuinness

People can affect change in their eating patterns by substituting ingredients in recipes. Such substitutions may be motivated by specific goals, like modifying the intake of a specific nutrient or avoiding a particular category of ingredients. Determining how to modify a recipe can be difficult because people need to 1) identify which ingredients can act as valid replacements for the original and 2) figure out whether the substitution is “good” for their particular context, which may consider factors such as allergies, nutritional contents of individual ingredients, and other dietary restrictions. We propose an approach to leverage both explicit semantic information about ingredients, encapsulated in a knowledge graph of food, and implicit semantics, captured through word embeddings, to develop a substitutability heuristic to rank plausible substitute options automatically. Our proposed system also helps determine which ingredient substitution options are “healthy” using nutritional information and food classification constraints. We evaluate our substitutability heuristic, diet-improvement ingredient substitutability heuristic (DIISH), using a dataset of ground-truth substitutions scraped from ingredient substitution guides and user reviews of recipes, demonstrating that our approach can help reduce the human effort required to make recipes more suitable for specific dietary needs.

2021 ◽  
Aditya Jadhav ◽  
Tarun Kumar ◽  
Mohit Raghavendra ◽  
Tamizhini Loganathan ◽  
Manikandan Narayanan

AbstractMotivationLarge volumes of biomedical literature present an opportunity to build whole-body human models comprising both within-tissue and across-tissue interactions among genes. Current studies have mostly focused on identifying within-tissue or tissue-agnostic associations, with a heavy emphasis on associations among disease, genes and drugs. Literature mining studies that extract relations pertaining to inter-tissue communication, such as between genes and hormones, are solely missing.ResultsWe present here a first study to identify from literature the genes involved in inter-tissue signaling via a hormone in the human body. Our models BioEmbedS and BioEmbedS-TS respectively predict if a hormone-gene pair is associated or not, and whether an associated gene is involved in the hormone’s production or response. Our models are classifiers trained on word embeddings that we had carefully balanced across different strata of the training data such as across production vs. response genes of a hormone (or) well-studied vs. poorly-represented hormones in the literature. Model training and evaluation are enabled by a unified dataset called HGv1 of ground-truth associations between genes and known endocrine hormones that we had compiled. Our models not only recapitulate known gene mediators of tissue-tissue signaling (e.g., at average 70.4% accuracy for BioEmbedS), but also predicts novel genes involved in inter-tissue communication in humans. Furthermore, the species-agnostic nature of our ground-truth HGv1 data and our predictive modeling approach, demonstrated concretely using human data and generalized to mouse, hold much promise for future work on elucidating inter-tissue signaling in other multi-cellular organisms.AvailabilityProposed HGv1 dataset along with our models’ predictions, and the associated code to reproduce this work are available respectively at, and[email protected]

2021 ◽  
Vol 7 ◽  
pp. e668
Mohammed Ibrahim ◽  
Susan Gauch ◽  
Omar Salman ◽  
Mohammed Alqahtani

Background Clear language makes communication easier between any two parties. A layman may have difficulty communicating with a professional due to not understanding the specialized terms common to the domain. In healthcare, it is rare to find a layman knowledgeable in medical terminology which can lead to poor understanding of their condition and/or treatment. To bridge this gap, several professional vocabularies and ontologies have been created to map laymen medical terms to professional medical terms and vice versa. Objective Many of the presented vocabularies are built manually or semi-automatically requiring large investments of time and human effort and consequently the slow growth of these vocabularies. In this paper, we present an automatic method to enrich laymen’s vocabularies that has the benefit of being able to be applied to vocabularies in any domain. Methods Our entirely automatic approach uses machine learning, specifically Global Vectors for Word Embeddings (GloVe), on a corpus collected from a social media healthcare platform to extend and enhance consumer health vocabularies. Our approach further improves the consumer health vocabularies by incorporating synonyms and hyponyms from the WordNet ontology. The basic GloVe and our novel algorithms incorporating WordNet were evaluated using two laymen datasets from the National Library of Medicine (NLM), Open-Access Consumer Health Vocabulary (OAC CHV) and MedlinePlus Healthcare Vocabulary. Results The results show that GloVe was able to find new laymen terms with an F-score of 48.44%. Furthermore, our enhanced GloVe approach outperformed basic GloVe with an average F-score of 61%, a relative improvement of 25%. Furthermore, the enhanced GloVe showed a statistical significance over the two ground truth datasets with P < 0.001. Conclusions This paper presents an automatic approach to enrich consumer health vocabularies using the GloVe word embeddings and an auxiliary lexical source, WordNet. Our approach was evaluated used healthcare text downloaded from, a healthcare social media platform using two standard laymen vocabularies, OAC CHV, and MedlinePlus. We used the WordNet ontology to expand the healthcare corpus by including synonyms, hyponyms, and hypernyms for each layman term occurrence in the corpus. Given a seed term selected from a concept in the ontology, we measured our algorithms’ ability to automatically extract synonyms for those terms that appeared in the ground truth concept. We found that enhanced GloVe outperformed GloVe with a relative improvement of 25% in the F-score.

Methodology ◽  
2019 ◽  
Vol 15 (Supplement 1) ◽  
pp. 43-60 ◽  
Florian Scharf ◽  
Steffen Nestler

Abstract. It is challenging to apply exploratory factor analysis (EFA) to event-related potential (ERP) data because such data are characterized by substantial temporal overlap (i.e., large cross-loadings) between the factors, and, because researchers are typically interested in the results of subsequent analyses (e.g., experimental condition effects on the level of the factor scores). In this context, relatively small deviations in the estimated factor solution from the unknown ground truth may result in substantially biased estimates of condition effects (rotation bias). Thus, in order to apply EFA to ERP data researchers need rotation methods that are able to both recover perfect simple structure where it exists and to tolerate substantial cross-loadings between the factors where appropriate. We had two aims in the present paper. First, to extend previous research, we wanted to better understand the behavior of the rotation bias for typical ERP data. To this end, we compared the performance of a variety of factor rotation methods under conditions of varying amounts of temporal overlap between the factors. Second, we wanted to investigate whether the recently proposed component loss rotation is better able to decrease the bias than traditional simple structure rotation. The results showed that no single rotation method was generally superior across all conditions. Component loss rotation showed the best all-round performance across the investigated conditions. We conclude that Component loss rotation is a suitable alternative to simple structure rotation. We discuss this result in the light of recently proposed sparse factor analysis approaches.

2020 ◽  
Vol 77 (4) ◽  
pp. 1609-1622
Franziska Mathies ◽  
Catharina Lange ◽  
Anja Mäurer ◽  
Ivayla Apostolova ◽  
Susanne Klutmann ◽  

Background: Positron emission tomography (PET) of the brain with 2-[F-18]-fluoro-2-deoxy-D-glucose (FDG) is widely used for the etiological diagnosis of clinically uncertain cognitive impairment (CUCI). Acute full-blown delirium can cause reversible alterations of FDG uptake that mimic neurodegenerative disease. Objective: This study tested whether delirium in remission affects the performance of FDG PET for differentiation between neurodegenerative and non-neurodegenerative etiology of CUCI. Methods: The study included 88 patients (82.0±5.7 y) with newly detected CUCI during hospitalization in a geriatric unit. Twenty-seven (31%) of the patients were diagnosed with delirium during their current hospital stay, which, however, at time of enrollment was in remission so that delirium was not considered the primary cause of the CUCI. Cases were categorized as neurodegenerative or non-neurodegenerative etiology based on visual inspection of FDG PET. The diagnosis at clinical follow-up after ≥12 months served as ground truth to evaluate the diagnostic performance of FDG PET. Results: FDG PET was categorized as neurodegenerative in 51 (58%) of the patients. Follow-up after 16±3 months was obtained in 68 (77%) of the patients. The clinical follow-up diagnosis confirmed the FDG PET-based categorization in 60 patients (88%, 4 false negative and 4 false positive cases with respect to detection of neurodegeneration). The fraction of correct PET-based categorization did not differ between patients with delirium in remission and patients without delirium (86% versus 89%, p = 0.666). Conclusion: Brain FDG PET is useful for the etiological diagnosis of CUCI in hospitalized geriatric patients, as well as in patients with delirium in remission.

2020 ◽  
Vol 64 (5) ◽  
pp. 50411-1-50411-8
Hoda Aghaei ◽  
Brian Funt

Abstract For research in the field of illumination estimation and color constancy, there is a need for ground-truth measurement of the illumination color at many locations within multi-illuminant scenes. A practical approach to obtaining such ground-truth illumination data is presented here. The proposed method involves using a drone to carry a gray ball of known percent surface spectral reflectance throughout a scene while photographing it frequently during the flight using a calibrated camera. The captured images are then post-processed. In the post-processing step, machine vision techniques are used to detect the gray ball within each frame. The camera RGB of light reflected from the gray ball provides a measure of the illumination color at that location. In total, the dataset contains 30 scenes with 100 illumination measurements on average per scene. The dataset is available for download free of charge.

2020 ◽  
Jingbai Li ◽  
Patrick Reiser ◽  
André Eberhard ◽  
Pascal Friederich ◽  
Steven Lopez

<p>Photochemical reactions are being increasingly used to construct complex molecular architectures with mild and straightforward reaction conditions. Computational techniques are increasingly important to understand the reactivities and chemoselectivities of photochemical isomerization reactions because they offer molecular bonding information along the excited-state(s) of photodynamics. These photodynamics simulations are resource-intensive and are typically limited to 1–10 picoseconds and 1,000 trajectories due to high computational cost. Most organic photochemical reactions have excited-state lifetimes exceeding 1 picosecond, which places them outside possible computational studies. Westermeyr <i>et al.</i> demonstrated that a machine learning approach could significantly lengthen photodynamics simulation times for a model system, methylenimmonium cation (CH<sub>2</sub>NH<sub>2</sub><sup>+</sup>).</p><p>We have developed a Python-based code, Python Rapid Artificial Intelligence <i>Ab Initio</i> Molecular Dynamics (PyRAI<sup>2</sup>MD), to accomplish the unprecedented 10 ns <i>cis-trans</i> photodynamics of <i>trans</i>-hexafluoro-2-butene (CF<sub>3</sub>–CH=CH–CF<sub>3</sub>) in 3.5 days. The same simulation would take approximately 58 years with ground-truth multiconfigurational dynamics. We proposed an innovative scheme combining Wigner sampling, geometrical interpolations, and short-time quantum chemical trajectories to effectively sample the initial data, facilitating the adaptive sampling to generate an informative and data-efficient training set with 6,232 data points. Our neural networks achieved chemical accuracy (mean absolute error of 0.032 eV). Our 4,814 trajectories reproduced the S<sub>1</sub> half-life (60.5 fs), the photochemical product ratio (<i>trans</i>: <i>cis</i> = 2.3: 1), and autonomously discovered a pathway towards a carbene. The neural networks have also shown the capability of generalizing the full potential energy surface with chemically incomplete data (<i>trans</i> → <i>cis</i> but not <i>cis</i> → <i>trans</i> pathways) that may offer future automated photochemical reaction discoveries.</p>

Sign in / Sign up

Export Citation Format

Share Document