scholarly journals Machine learning with biomedical ontologies

Author(s):  
Maxat Kulmanov ◽  
Fatima Zohra Smaili ◽  
Xin Gao ◽  
Robert Hoehndorf

Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge, and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in biomedical ontologies, and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.Key pointsOntologies provide background knowledge that can be exploited in machine learning models.Ontology embeddings are structure-preserving maps from ontologies into vector spaces and provide an important method for utilizing ontologies in machine learning. Embeddings can preserve different structures in ontologies, including their graph structures, syntactic regularities, or their model-theoretic semantics.Axioms in ontologies, in particular those involving negation, can be used as constraints in optimization and machine learning to reduce the search space.

Author(s):  
Maxat Kulmanov ◽  
Fatima Zohra Smaili ◽  
Xin Gao ◽  
Robert Hoehndorf

Abstract Ontologies have long been employed in the life sciences to formally represent and reason over domain knowledge and they are employed in almost every major biological database. Recently, ontologies are increasingly being used to provide background knowledge in similarity-based analysis and machine learning models. The methods employed to combine ontologies and machine learning are still novel and actively being developed. We provide an overview over the methods that use ontologies to compute similarity and incorporate them in machine learning methods; in particular, we outline how semantic similarity measures and ontology embeddings can exploit the background knowledge in ontologies and how ontologies can provide constraints that improve machine learning models. The methods and experiments we describe are available as a set of executable notebooks, and we also provide a set of slides and additional resources at https://github.com/bio-ontology-research-group/machine-learning-with-ontologies.


Author(s):  
Or Biran ◽  
Kathleen McKeown

Human decision makers in many domains can make use of predictions made by machine learning models in their decision making process, but the usability of these predictions is limited if the human is unable to justify his or her trust in the prediction. We propose a novel approach to producing justifications that is geared towards users without machine learning expertise, focusing on domain knowledge and on human reasoning, and utilizing natural language generation. Through a task-based experiment, we show that our approach significantly helps humans to correctly decide whether or not predictions are accurate, and significantly increases their satisfaction with the justification.


2018 ◽  
Author(s):  
Maxat Kulmanov ◽  
Senay Kafkas ◽  
Andreas Karwath ◽  
Alexander Malic ◽  
Georgios V Gkoutos ◽  
...  

AbstractRecent developments in machine learning have lead to a rise of large number of methods for extracting features from structured data. The features are represented as a vectors and may encode for some semantic aspects of data. They can be used in a machine learning models for different tasks or to compute similarities between the entities of the data. SPARQL is a query language for structured data originally developed for querying Resource Description Framework (RDF) data. It has been in use for over a decade as a standardized NoSQL query language. Many different tools have been developed to enable data sharing with SPARQL. For example, SPARQL endpoints make your data interoperable and available to the world. SPARQL queries can be executed across multiple endpoints. We have developed a Vec2SPARQL, which is a general framework for integrating structured data and their vector space representations. Vec2SPARQL allows jointly querying vector functions such as computing similarities (cosine, correlations) or classifications with machine learning models within a single SPARQL query. We demonstrate applications of our approach for biomedical and clinical use cases. Our source code is freely available at https://github.com/bio-ontology-research-group/vec2sparql and we make a Vec2SPARQL endpoint available at http://sparql.bio2vec.net/.


2020 ◽  
Author(s):  
Robin Whytock ◽  
Jędrzej Świeżewski ◽  
Joeri A. Zwerts ◽  
Tadeusz Bara-Słupski ◽  
Aurélie Flore Koumba Pambo ◽  
...  

AbstractEcological data are increasingly collected over vast geographic areas using arrays of digital sensors. Camera trap arrays have become the ‘gold standard’ method for surveying many terrestrial mammals and birds, but these arrays often generate millions of images that are challenging to process. This causes significant latency between data collection and subsequent inference, which can impede conservation at a time of ecological crisis. Machine learning algorithms have been developed to improve camera trap data processing speeds, but these models are not considered accurate enough for fully automated labeling of images.Here, we present a new approach to building and testing a high performance machine learning model for fully automated labeling of camera trap images. As a case-study, the model classifies 26 Central African forest mammal and bird species (or groups). The model was trained on a relatively small dataset (c.300,000 images) but generalizes to fully independent data and outperforms humans in several respects (e.g. detecting ‘invisible’ animals). We show how the model’s precision and accuracy can be evaluated in an ecological modeling context by comparing species richness, activity patterns (n = 4 species tested) and occupancy (n = 4 species tested) derived from machine learning labels with the same estimates derived from expert labels.Results show that fully automated labels can be equivalent to expert labels when calculating species richness, activity patterns (n = 4 species tested) and estimating occupancy (n = 3 of 4 species tested) in completely out-of-sample test data (n = 227 camera stations, n = 23868 images). Simple thresholding (discarding uncertain labels) improved the model’s performance when calculating activity patterns and estimating occupancy, but did not improve estimates of species richness.We provide the user-community with a multi-platform, multi-language user interface for running the model offline, and conclude that high performance machine learning models can fully automate labeling of camera trap data.


2021 ◽  
Author(s):  
Katie Walker ◽  
Jirayus Jiarpakdee ◽  
Anne Loupis ◽  
Chakkrit Tantithamthavorn ◽  
Keith Joe ◽  
...  

AbstractObjectivePatients, families and community members would like emergency department wait time visibility. This would improve patient journeys through emergency medicine. The study objective was to derive, internally and externally validate machine learning models to predict emergency patient wait times that are applicable to a wide variety of emergency departments.MethodsTwelve emergency departments provided three years of retrospective administrative data from Australia (2017-19). Descriptive and exploratory analyses were undertaken on the datasets. Statistical and machine learning models were developed to predict wait times at each site and were internally and externally validated. Model performance was tested on COVID-19 period data (January to June 2020).ResultsThere were 1,930,609 patient episodes analysed and median site wait times varied from 24 to 54 minutes. Individual site model prediction median absolute errors varied from +/−22.6 minutes (95%CI 22.4,22.9) to +/− 44.0 minutes (95%CI 43.4,44.4). Global model prediction median absolute errors varied from +/−33.9 minutes (95%CI 33.4, 34.0) to +/−43.8 minutes (95%CI 43.7, 43.9). Random forest and linear regression models performed the best, rolling average models under-estimated wait times. Important variables were triage category, last-k patient average wait time, and arrival time. Wait time prediction models are not transferable across hospitals. Models performed well during the COVID-19 lockdown period.ConclusionsElectronic emergency demographic and flow information can be used to approximate emergency patient wait times. A general model is less accurate if applied without site specific factors.What is already known on this subject⍰Patients and families want to know approximate emergency wait times, which will improve their ability to manage their logistical, physical and emotional needs whilst waiting⍰There are a few small studies from a limited number of jurisdictions, reporting model methods, important predictor variables and accuracy of derived modelsWhat this study adds⍰Our study demonstrates that predicting wait times from simple, readily available data is complex and provides estimates that aren’t as accurate as patients would like, however rough estimates may still be better than no information⍰We present the most influential variables regarding wait times and advise against using rolling average models, preferring random forest or linear regression techniques⍰Emergency medicine machine learning models may be less generalisable to other sites than we hope for when we read manuscripts or buy commercial off-the-shelf models or algorithms. Models developed for one site lose accuracy at another site and global models built for whole systems may need customisation to each individual site. This may apply to data science clinical decision instruments as well as operational machine learning models.


2018 ◽  
Vol 10 (1) ◽  
Author(s):  
Wang-Chi Cheung ◽  
Weiwen Zhang ◽  
Yong Liu ◽  
Feng Yang ◽  
Rick-Siow-Mong Goh

Recent studies have revealed the success of data-driven machine health monitoring, which motivates the use of machine learning models in machine health prognostic tasks. While the machine learning approach to health monitoring is gaining importance, the construction of machine learning models is often impeded by the difficulty in choosing the underlying hyper-parameter configuration (HP-config), which governs the construction of the machine learning model. While an effective choice of HP-config can be achieved with human effort, such an effort is often time consuming and requires domain knowledge. In this paper, we consider the use of Bayesian optimization algorithms, which automate an effective choice of HP-config by solving the associated hyperparameter optimization problem. Numerical experiments on the data from PHM 2016 Data Challenge demonstrate the salience of the proposed automatic framework, and exhibit improvement over default HP-configs in standard machine learning packages or chosen by a human agent.


Author(s):  
Le-Wen Cai ◽  
Wang-Zhou Dai ◽  
Yu-Xuan Huang ◽  
Yu-Feng Li ◽  
Stephen Muggleton ◽  
...  

Abductive Learning is a framework that combines machine learning with first-order logical reasoning. It allows machine learning models to exploit complex symbolic domain knowledge represented by first-order logic rules. However, it is challenging to obtain or express the ground-truth domain knowledge explicitly as first-order logic rules in many applications. The only accessible knowledge base is implicitly represented by groundings, i.e., propositions or atomic formulas without variables. This paper proposes Grounded Abductive Learning (GABL) to enhance machine learning models with abductive reasoning in a ground domain knowledge base, which offers inexact supervision through a set of logic propositions. We apply GABL on two weakly supervised learning problems and found that the model's initial accuracy plays a crucial role in learning. The results on a real-world OCR task show that GABL can significantly reduce the effort of data labeling than the compared methods.


2020 ◽  
Author(s):  
Yan-Ting Wu ◽  
Chen-Jie Zhang ◽  
Ben Willem Mol ◽  
Cheng Li ◽  
Lei Chen ◽  
...  

AbstractAimsGestational diabetes mellitus (GDM) is a pregnancy-specific disorder that can usually be diagnosed after 24 gestational weeks. So far, there is no accurate method to predict GDM in early pregnancy.MethodsWe collected data extracted from the hospital’s electronic medical record system included 73 features in the first trimester. We also recorded the occurrence of GDM, diagnosed at 24-28 weeks of pregnancy. We conducted a feature selection method to select a panel of most discriminative features. We then developed advanced machine learning models, using Deep Neural Network (DNN), Support Vector Machine (SVM), K-Nearest Neighboring (KNN), and Logistic Regression (LR), based on these features.ResultsWe studied 16,819 women (2,696 GDM) and 14,992 women (1,837 GDM) for the training and validation group. DNN, SVM, KNN, and LR models based on the 73-feature set demonstrated the best discriminative power with corresponding area under the curve (AUC) values of 0.92 (95%CI 0.91, 0.93), 0.82 (95%CI 0.81, 0.83), 0.63 (95%CI 0.62, 0.64), and 0.85 (95%CI 0.84, 0.85), respectively. The 7-feature (selected from the 73-feature set) DNN, SVM, KNN, and LR models had the best discriminative power with corresponding AUCs of 0.84 (95%CI 0.83, 0.84), 0.69 (95%CI 0.68, 0.70), 0.68 (95%CI 0.67, 0.69), and 0.84 (95% CI 0.83, 0.85), respectively. The 7-feature LR model had the best Hosmer-Lemeshow test outcome. Notably, the AUCs of the existing prediction models did not exceed 0.75.ConclusionsOur feature selection and machine learning models showed superior predictive power in early GDM detection than previous methods; these improved models will better serve clinical practices in preventing GDM.Research in Context sectionEvidence before this studyA hysteretic diagnosis of GDM in the 3rd trimester is too late to prevent exposure of the embryos or fetuses to an intrauterine hyperglycemia environment during early pregnancy.Prediction models for gestational diabetes are not uncommon in previous literature reports, but laboratory indicators are rarely involved in predictive indicators.The penetration of AI into the medical field makes us want to introduce it into GDM predictive models.What is the key question?Whether the GDM prediction model established by machine learning has the ability to surpass the traditional LR model?Added value of this studyUsing machine learning to select features is an effective method.DNN prediction model have effective discrimination power for predicting GDM in early pregnancy, but it cannot completely replace LR. KNN and SVM are even worse than LR in this study.Implications of all the available evidenceThe biggest significance of our research is not only to build a prediction model that surpasses previous ones, but also to demonstrate the advantages and disadvantages of different machine learning methods through a practical case.


2020 ◽  
Vol 2 (1) ◽  
pp. 3-6
Author(s):  
Eric Holloway

Imagination Sampling is the usage of a person as an oracle for generating or improving machine learning models. Previous work demonstrated a general system for using Imagination Sampling for obtaining multibox models. Here, the possibility of importing such models as the starting point for further automatic enhancement is explored.


Sign in / Sign up

Export Citation Format

Share Document