Semi-Automated Identification of Biomedical Literature: A Proof of Concept Study

Abstract Background: The typical approach to literature identification involves two discrete and successive steps: (i) formulating a search strategy (i.e., a set of Boolean queries) and (ii) manually identifying the relevant citations in the corpus returned by the query. We have developed a literature identification system (Pythia) that combines the query formulation and citation screening steps and uses modern approaches for text encoding (dense text embeddings) to represent the text of the citations in a form that can be used by information retrieval and machine learning algorithms.Methods: Pythia incorporates a set of natural-language questions with machine-learning algorithms to rank all PubMed citations based on relevance. Pythia returns the 100 top-ranked citations for all questions combined. These 100 articles are exported, and a human screener adjudicates the relevance of each abstract and tags words that indicate relevance. The “curated” articles are then exploited by Pythia to refine the search and re-rank the abstracts, and a new set of 100 abstracts is exported and screened/tagged, until convergence (i.e., no other relevant abstracts are retrieved) or for a set number of iterations (batches). Pythia performance was assessed using seven systematic reviews (three prospectively and four retrospectively). Sensitivity, precision, and the number needed to read were calculated for each review. Results: The ability of Pythia to identify the relevant articles (sensitivity) varied across reviews from a low of 0.09 for a sleep apnea review to a high of 0.58 for a diverticulitis review. The number of abstracts that a reviewer had to read to find one relevant abstract (NNR) was lower than in the manually screened project in four reviews, higher in two, and had mixed results in one. The reviews that had greater overall sensitivity retrieved more relevant citations in early batches, but neither study design, study size, nor specific key question significantly affected retrieval across all reviews.Conclusions: Future research should explore ways to encode domain knowledge in query formulation, possibly by incorporating a "reasoning" aspect to Pythia to elicit more contextual information and leveraging ontologies and knowledge bases to better enrich the questions used in the search.

Download Full-text

Land Use/Land Cover Mapping from Airborne Hyperspectral Images with Machine Learning Algorithms and Contextual Information

Geocarto International ◽

10.1080/10106049.2021.1945149 ◽

2021 ◽

pp. 1-40

Author(s):

Ozlem Akar ◽

Esra Tunc Gormus

Keyword(s):

Machine Learning ◽

Land Use ◽

Land Cover ◽

Contextual Information ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Hyperspectral Images ◽

Land Cover Mapping ◽

Land Use Land Cover

Download Full-text

Predicting Health Material Cognitive Accessibility Using Multidimensional Semantic Features and Readability Tools as Predicators (Preprint)

10.2196/preprints.29175 ◽

2021 ◽

Author(s):

Meng Ji ◽

Yanmeng Liu ◽

Tianyong Hao

Keyword(s):

Machine Learning ◽

Health Education ◽

Health Information ◽

Domain Knowledge ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Semantic Features ◽

Integrated Models ◽

Advanced Education ◽

Cognitive Accessibility

BACKGROUND Much of current health information understandability research uses medical readability formula (MRF) to assess the cognitive difficulty of health education resources. This is based on an implicit assumption that medical domain knowledge represented by uncommon words or jargons form the sole barriers to health information access among the public. Our study challenged this by showing that for readers from non-English speaking backgrounds with higher education attainment, semantic features of English health texts rather than medical jargons can explain the lack of cognitive access of health materials among readers with better understanding of health terms, yet limited exposure to English health education materials. OBJECTIVE Our study explored combined MRF and multidimensional semantic features (MSF) for developing machine learning algorithms to predict the actual level of cognitive accessibility of English health materials on health risks and diseases for specific populations. We compare algorithms to evaluate the cognitive accessibility of specialised health information for non-native English speaker with advanced education levels yet very limited exposure to English health education environments. METHODS We used 108 semantic features to measure the content complexity and accessibility of original English resources. Using 1000 English health texts collected from international health organization websites, rated by international tertiary students, we compared machine learning (decision tree, SVM, discriminant analysis, ensemble tree and logistic regression) after automatic hyperparameter optimization (grid search for the best combination of hyperparameters of minimal classification errors). We applied 10-fold cross-validation on the whole dataset for the model training and testing, calculated the AUC, sensitivity, specificity, and accuracy as the measured of the model performance. RESULTS Using two sets of predictor features: widely tested MRF and MSF proposed in our study, we developed and compared three sets of machine learning algorithms: the first set of algorithms used MRF as predictors only, the second set of algorithms used MSF as predictors only, and the last set of algorithms used both MRF and MSF as integrated models. The results showed that the integrated models outperformed in terms of AUC, sensitivity, accuracy, and specificity. CONCLUSIONS Our study showed that cognitive accessibility of English health texts is not limited to word length and sentence length conventionally measured by MRF. We compared machine learning algorithms combing MRF and MSF to explore the cognitive accessibility of health information from syntactic and semantic perspectives. The results showed the strength of integrated models in terms of statistically increased AUC, sensitivity, and accuracy to predict health resource accessibility for the target readership, indicating that both MRF and MSF contribute to the comprehension of health information, and that for readers with advanced education, semantic features outweigh syntax and domain knowledge.

Download Full-text

Classification of Children’s Sitting Postures Using Machine Learning Algorithms

Applied Sciences ◽

10.3390/app8081280 ◽

2018 ◽

Vol 8 (8) ◽

pp. 1280 ◽

Cited By ~ 14

Author(s):

Yong Kim ◽

Youngdoo Son ◽

Wonjoon Kim ◽

Byungki Jin ◽

Myung Yun

Keyword(s):

Neural Network ◽

Machine Learning ◽

Monitoring System ◽

Multinomial Logistic Regression ◽

Learning Algorithms ◽

Feedback System ◽

Machine Learning Algorithms ◽

Sensor Data ◽

Future Research ◽

Support Vector

Sitting on a chair in an awkward posture or sitting for a long period of time is a risk factor for musculoskeletal disorders. A postural habit that has been formed cannot be changed easily. It is important to form a proper postural habit from childhood as the lumbar disease during childhood caused by their improper posture is most likely to recur. Thus, there is a need for a monitoring system that classifies children’s sitting postures. The purpose of this paper is to develop a system for classifying sitting postures for children using machine learning algorithms. The convolutional neural network (CNN) algorithm was used in addition to the conventional algorithms: Naïve Bayes classifier (NB), decision tree (DT), neural network (NN), multinomial logistic regression (MLR), and support vector machine (SVM). To collect data for classifying sitting postures, a sensing cushion was developed by mounting a pressure sensor mat (8 × 8) inside children’s chair seat cushion. Ten children participated, and sensor data was collected by taking a static posture for the five prescribed postures. The accuracy of CNN was found to be the highest as compared with those of the other algorithms. It is expected that the comprehensive posture monitoring system would be established through future research on enhancing the classification algorithm and providing an effective feedback system.

Download Full-text

CoRg: Commonsense Reasoning Using a Theorem Prover and Machine Learning

10.29007/lt5p ◽

2019 ◽

Cited By ~ 1

Author(s):

Sophie Siebert ◽

Frieder Stolzenburg

Keyword(s):

Machine Learning ◽

Question Answering ◽

Learning Algorithms ◽

Knowledge Bases ◽

Black Box ◽

Machine Learning Algorithms ◽

Theorem Prover ◽

Commonsense Reasoning ◽

Probable Answer ◽

Everyday Task

Commonsense reasoning is an everyday task that is intuitive for humans but hard to implement for computers. It requires large knowledge bases to get the required data from, although this data is still incomplete or even inconsistent. While machine learning algorithms perform rather well on these tasks, the reasoning process remains a black box. To close this gap, our system CoRg aims to build an explainable and well-performing system, which consists of both an explainable deductive derivation process and a machine learning part. We conduct our experiments on the Copa question-answering benchmark using the ontologies WordNet, Adimen-SUMO, and ConceptNet. The knowledge is fed into the theorem prover Hyper and in the end the conducted models will be analyzed using machine learning algorithms, to derive the most probable answer.

Download Full-text

Forecasting Algorithms and Optimization Strategies for Building Energy Management & Demand Response

Proceedings ◽

10.3390/proceedings2151133 ◽

2018 ◽

Vol 2 (15) ◽

pp. 1133 ◽

Cited By ~ 1

Author(s):

Fanlin Meng ◽

Kui Weng ◽

Balsam Shallal ◽

Xiangping Chen ◽

Monjur Mourshed

Keyword(s):

Machine Learning ◽

Energy Management ◽

Demand Response ◽

Learning Algorithms ◽

Building Energy ◽

Machine Learning Algorithms ◽

Future Research ◽

Research Directions ◽

Building Energy Management ◽

Future Research Directions

In this paper, we look at the key forecasting algorithms and optimization strategies for the building energy management and demand response management. By conducting a combined and critical review of forecast learning algorithms and optimization models/algorithms, current research gaps and future research directions and potential technical routes are identified. To be more specific, ensemble/hybrid machine learning algorithms and deep machine learning algorithms are promising in solving challenging energy forecasting problems while large-scale and distributed optimization algorithms are the future research directions for energy optimization in the context of smart buildings and smart grids.

Download Full-text

Developing and evaluating language-based machine learning algorithms for inferring applicant personality in video interviews

10.31234/osf.io/e65qj ◽

2021 ◽

Author(s):

Louis Hickman ◽

Rachel Saef ◽

Vincent Ng ◽

Sang Eun Woo ◽

Louis Tay ◽

...

Keyword(s):

Machine Learning ◽

Personality Traits ◽

Large Scale ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Future Research ◽

Big Five Personality ◽

Test Machine ◽

Self Reports ◽

The Cross

Organizations are increasingly relying on people analytics to aid human resources decision-making. One application involves using machine learning to automatically infer applicant characteristics from employment interview responses. However, management research has provided scant validity evidence to guide organizations’ decisions about whether and how best to implement these algorithmic approaches. To address this gap, we use closed vocabulary text mining on mock video interviews to train and test machine learning algorithms for predicting interviewee’s self-reported (automatic personality recognition) and interviewer-rated personality traits (automatic personality perception). We use 10-fold cross-validation to test the algorithms’ accuracy for predicting Big Five personality traits across both rating sources. The cross-validated accuracy for predicting self-reports was lower than large-scale investigations using language in social media posts as predictors. The cross-validated accuracy for predicting interviewer ratings of personality was more than double that found for predicting self-reports. We discuss implications for future research and practice.

Download Full-text

SciReader: A Cloud-based Recommender System for Biomedical Literature

10.1101/333922 ◽

2018 ◽

Cited By ~ 1

Author(s):

Priya Desai ◽

Natalie Telis ◽

Ben Lehmann ◽

Keith Bettinger ◽

Jonathan K. Pritchard ◽

...

Keyword(s):

Machine Learning ◽

Recommender System ◽

Topic Modeling ◽

Learning Algorithms ◽

Relevant Literature ◽

Machine Learning Algorithms ◽

Biomedical Literature ◽

High Quality ◽

Link Type ◽

Personalized Recommender System

AbstractWith the growing number of biomedical papers published each year, keeping up with relevant literature has become increasingly important, and yet more challenging. SciReader (www.scireader.com) is a cloud-based personalized recommender system that specifically aims to assist biomedical researchers and clinicians identify publications of interest to them. SciReader uses topic modeling and other machine learning algorithms to provide users with recommendations that are recent, relevant, and of high quality1.

Download Full-text

Anomaly detection in the Zwicky Transient Facility DR3

Monthly Notices of the Royal Astronomical Society ◽

10.1093/mnras/stab316 ◽

2021 ◽

Author(s):

K L Malanchev ◽

M V Pruzhinskaya ◽

V S Korolev ◽

P D Aleo ◽

M V Kornilov ◽

...

Keyword(s):

Machine Learning ◽

Anomaly Detection ◽

Domain Knowledge ◽

Learning Algorithms ◽

Machine Learning Algorithms ◽

Light Curves ◽

Image Subtraction ◽

Scientific Application ◽

Public Data ◽

Expert Analysis

Abstract We present results from applying the SNAD anomaly detection pipeline to the third public data release of the Zwicky Transient Facility (ZTF DR3). The pipeline is composed of 3 stages: feature extraction, search of outliers with machine learning algorithms and anomaly identification with followup by human experts. Our analysis concentrates in three ZTF fields, comprising more than 2.25 million objects. A set of 4 automatic learning algorithms was used to identify 277 outliers, which were subsequently scrutinised by an expert. From these, 188 (68%) were found to be bogus light curves – including effects from the image subtraction pipeline as well as overlapping between a star and a known asteroid, 66 (24%) were previously reported sources whereas 23 (8%) correspond to non-catalogued objects, with the two latter cases of potential scientific interest (e. g. 1 spectroscopically confirmed RS Canum Venaticorum star, 4 supernovae candidates, 1 red dwarf flare). Moreover, using results from the expert analysis, we were able to identify a simple bi-dimensional relation which can be used to aid filtering potentially bogus light curves in future studies. We provide a complete list of objects with potential scientific application so they can be further scrutinised by the community. These results confirm the importance of combining automatic machine learning algorithms with domain knowledge in the construction of recommendation systems for astronomy. Our code is publicly available*.

Download Full-text

Predicting Health Material Cognitive Accessibility for Non-Native English Speakers Using Multidimensional Semantic Features as Predictors of Machine Learning Algorithms (Preprint)

10.2196/preprints.25110 ◽

2021 ◽

Author(s):

Meng Ji ◽

Yanmeng Liu ◽

Tianyong Hao

Keyword(s):

Machine Learning ◽

Health Education ◽

Decision Tree ◽

Health Information ◽

Domain Knowledge ◽

Learning Algorithms ◽

Native English Speakers ◽

Machine Learning Algorithms ◽

Semantic Features ◽

Cognitive Accessibility

BACKGROUND Much of current health information understandability research uses medical readability formula to assess the cognitive difficulty of health education resources. This is based on an implicit assumption that medical domain knowledge represented by uncommon words or jargons form the sole barriers to health information access among the public. Our study challenged this by showing that for readers from non-English speaking backgrounds with higher education attainment, semantic features of English health texts which underpin the knowledge structure of English health texts, rather than medical jargons can explain the cognitive accessibility of health materials among readers with better understanding of English health terms, yet very limited exposure to English-based health education environments and traditions. OBJECTIVE Our study explored multidimensional semantic features for developing machine learning algorithms to predict the perceived level of cognitive accessibility of English health materials on health risks and diseases for young adults enrolled in Australian tertiary institutes. We compared algorithms to evaluate the cognitive accessibility of health information for non-native English speaker with advanced education levels yet very limited exposure to English health education environments. METHODS We used 108 semantic features to measure the content complexity and accessibility of original English resources. Using 1000 English health texts collected from Australian and international health organization websites, rated by overseas tertiary students, we compared machine learning (decision tree, SVM, ensemble tree, logistic regression) after hyperparameter optimization (grid search for the best hyperparameter combination of minimal classification errors). We applied 10-fold cross-validation on the whole dataset for the model training and testing, calculated the AUC, sensitivity, specificity, and accuracy as the measurement of the model performance. RESULTS We developed, compared four machine learning algorithms using multidimensional semantic features as predictors. The results showed that ensemble tree (LogitBoost) outperformed in terms of AUC (0.97), sensitivity (0.966), specificity (0.972) and accuracy (0.969). Decision tree followed closely with an AUC (0.924), sensitivity (0.912), specificity (0.9358), and accuracy (0.924), and SVM with an AUC (0.8946), sensitivity (0.8952), specificity (0.894), accuracy (0.8946). Decision tree, ensemble tree, SVM achieved statistically significant improvement over logistic regression in AUC, specificity, accuracy. As the best performing algorithm, ensemble tree reached statistically significant improvement over SVM in AUC, specificity, accuracy, and a statistically significant improvement over decision tree in sensitivity. CONCLUSIONS Our study showed that cognitive accessibility of English health texts is not limited to word length and sentence length as had been conventionally measured by the medical readability formula. We compared machine learning algorithms based on semantic features to explore the cognitive accessibility of health information for non-native English speakers. The results showed the new models reached statistically increased AUC, sensitivity, and accuracy to predict health resource accessibility for the target readership. Our study illustrated that semantic features such as cognitive abilities related semantic features, communicative actions and processes, power relationships in healthcare settings, and lexical familiarity and diversity of health texts are large contributors to the comprehension of health information and that for readers such as international students, semantic features of health texts which outweigh syntax and domain knowledge.

Download Full-text

A Study of Machine Learning Algorithms in Speech Recognition and Language Identification System

Innovations in Computer Science and Engineering - Lecture Notes in Networks and Systems ◽

10.1007/978-981-33-4543-0_54 ◽

2021 ◽

pp. 503-513

Author(s):

Aakansha Mathur ◽

Razia Sultana

Keyword(s):

Machine Learning ◽

Speech Recognition ◽

Learning Algorithms ◽

Language Identification ◽

Machine Learning Algorithms ◽

Identification System

Download Full-text