Machine Learning Approaches to Retrieve High-Quality, Clinically Relevant, Evidence from the Biomedical Literature: A Systematic Review (Preprint)

2021 ◽  
Author(s):  
Wael Abdelkader ◽  
Tamara Navarro ◽  
Rick Parrish ◽  
Chris Cotoi ◽  
Federico Germini ◽  
...  

BACKGROUND The rapid growth of the biomedical literature makes identifying strong evidence a time-consuming task. Applying machine learning to the process could be a viable solution that limits effort while maintaining accuracy. OBJECTIVE To summarize the nature and comparative performance of machine learning approaches that have been applied to retrieve high-quality evidence for clinical consideration from the biomedical literature. METHODS We conducted a systematic review of studies that applied machine learning techniques to identify high-quality clinical articles in the biomedical literature. Multiple databases were searched to July 2020. Extracted data focused on the applied machine learning model, steps in the development of the models, and model performance. RESULTS From 3918 retrieved studies, 10 met our inclusion criteria. All followed a supervised machine learning approach and applied, from a limited range of options, a high-quality standard for the training of their model. The results show that machine learning can achieve a sensitivity of 95% while maintaining a high precision of 86%. CONCLUSIONS Applying machine learning to distinguish studies with strong evidence for clinical care has the potential to decrease the workload of manually identifying these. The evidence base is active and evolving. Reported methods were variable across the studies but focused on supervised machine learning approaches. Performance may improve by applying more sophisticated approaches such as active learning, auto-machine learning, and unsupervised machine learning approaches.

2021 ◽  
Author(s):  
Wael Abdelkader ◽  
Tamara Navarro ◽  
Rick Parrish ◽  
Chris Cotoi ◽  
Federico Germini ◽  
...  

UNSTRUCTURED Due to the continued rapid growth in published biomedical literature, it is increasingly difficult to identify and retrieve high-quality evidence. Machine learning approaches have been applied to address this issue. Some models developed using supervised machine learning approaches have achieved high sensitivity or recall, however precision has been variable. In a series of experiments, we will assess the performance of machine learning models to retrieve high-quality, high relevance evidence for clinical consideration from the biomedical literature. The models will be trained using an automated approach applied to a database of almost 100, 000 articles that have been tagged by highly trained research staff based on criteria for high-quality and assessed for clinical relevance by clinicians. We will evaluate and report on the effects of various classifiers, preprocessing steps, feature selection, and the use of balanced vs unbalanced datasets applied during model development on the performance of the derived supervised machine learning models. The series was devised to improve the precision of the retrieval of high-quality articles by applying a machine learning classifier sequentially after using high sensitivity Boolean search filters to an ongoing literature surveillance process. Our multi-level analysis of the various steps of machine learning model development will help expand the existing knowledge base on the effect of each step on the performance of machine learning models.


Computers ◽  
2021 ◽  
Vol 10 (12) ◽  
pp. 157
Author(s):  
Daniel Santos ◽  
José Saias ◽  
Paulo Quaresma ◽  
Vítor Beires Nogueira

Traffic accidents are one of the most important concerns of the world, since they result in numerous casualties, injuries, and fatalities each year, as well as significant economic losses. There are many factors that are responsible for causing road accidents. If these factors can be better understood and predicted, it might be possible to take measures to mitigate the damages and its severity. The purpose of this work is to identify these factors using accident data from 2016 to 2019 from the district of Setúbal, Portugal. This work aims at developing models that can select a set of influential factors that may be used to classify the severity of an accident, supporting an analysis on the accident data. In addition, this study also proposes a predictive model for future road accidents based on past data. Various machine learning approaches are used to create these models. Supervised machine learning methods such as decision trees (DT), random forests (RF), logistic regression (LR), and naive Bayes (NB) are used, as well as unsupervised machine learning techniques including DBSCAN and hierarchical clustering. Results show that a rule-based model using the C5.0 algorithm is capable of accurately detecting the most relevant factors describing a road accident severity. Further, the results of the predictive model suggests the RF model could be a useful tool for forecasting accident hotspots.


2021 ◽  
Vol 297 ◽  
pp. 01073
Author(s):  
Sabyasachi Pramanik ◽  
K. Martin Sagayam ◽  
Om Prakash Jena

Cancer has been described as a diverse illness with several distinct subtypes that may occur simultaneously. As a result, early detection and forecast of cancer types have graced essentially in cancer fact-finding methods since they may help to improve the clinical treatment of cancer survivors. The significance of categorizing cancer suffers into higher or lower-threat categories has prompted numerous fact-finding associates from the bioscience and genomics field to investigate the utilization of machine learning (ML) algorithms in cancer diagnosis and treatment. Because of this, these methods have been used with the goal of simulating the development and treatment of malignant diseases in humans. Furthermore, the capacity of machine learning techniques to identify important characteristics from complicated datasets demonstrates the significance of these technologies. These technologies include Bayesian networks and artificial neural networks, along with a number of other approaches. Decision Trees and Support Vector Machines which have already been extensively used in cancer research for the creation of predictive models, also lead to accurate decision making. The application of machine learning techniques may undoubtedly enhance our knowledge of cancer development; nevertheless, a sufficient degree of validation is required before these approaches can be considered for use in daily clinical practice. An overview of current machine learning approaches utilized in the simulation of cancer development is presented in this paper. All of the supervised machine learning approaches described here, along with a variety of input characteristics and data samples, are used to build the prediction models. In light of the increasing trend towards the use of machine learning methods in biomedical research, we offer the most current papers that have used these approaches to predict risk of cancer or patient outcomes in order to better understand cancer.


Energies ◽  
2020 ◽  
Vol 13 (3) ◽  
pp. 689 ◽  
Author(s):  
Tyler McCandless ◽  
Susan Dettling ◽  
Sue Ellen Haupt

This work compares the solar power forecasting performance of tree-based methods that include implicit regime-based models to explicit regime separation methods that utilize both unsupervised and supervised machine learning techniques. Previous studies have shown an improvement utilizing a regime-based machine learning approach in a climate with diverse cloud conditions. This study compares the machine learning approaches for solar power prediction at the Shagaya Renewable Energy Park in Kuwait, which is in an arid desert climate characterized by abundant sunshine. The regime-dependent artificial neural network models undergo a comprehensive parameter and hyperparameter tuning analysis to minimize the prediction errors on a test dataset. The final results that compare the different methods are computed on an independent validation dataset. The results show that the tree-based methods, the regression model tree approach, performs better than the explicit regime-dependent approach. These results appear to be a function of the predominantly sunny conditions that limit the ability of an unsupervised technique to separate regimes for which the relationship between the predictors and the predictand would differ for the supervised learning technique.


2018 ◽  
Vol 25 (10) ◽  
pp. 1339-1350 ◽  
Author(s):  
Justin Mower ◽  
Devika Subramanian ◽  
Trevor Cohen

Abstract Objective The aim of this work is to leverage relational information extracted from biomedical literature using a novel synthesis of unsupervised pretraining, representational composition, and supervised machine learning for drug safety monitoring. Methods Using ≈80 million concept-relationship-concept triples extracted from the literature using the SemRep Natural Language Processing system, distributed vector representations (embeddings) were generated for concepts as functions of their relationships utilizing two unsupervised representational approaches. Embeddings for drugs and side effects of interest from two widely used reference standards were then composed to generate embeddings of drug/side-effect pairs, which were used as input for supervised machine learning. This methodology was developed and evaluated using cross-validation strategies and compared to contemporary approaches. To qualitatively assess generalization, models trained on the Observational Medical Outcomes Partnership (OMOP) drug/side-effect reference set were evaluated against a list of ≈1100 drugs from an online database. Results The employed method improved performance over previous approaches. Cross-validation results advance the state of the art (AUC 0.96; F1 0.90 and AUC 0.95; F1 0.84 across the two sets), outperforming methods utilizing literature and/or spontaneous reporting system data. Examination of predictions for unseen drug/side-effect pairs indicates the ability of these methods to generalize, with over tenfold label support enrichment in the top 100 predictions versus the bottom 100 predictions. Discussion and Conclusion Our methods can assist the pharmacovigilance process using information from the biomedical literature. Unsupervised pretraining generates a rich relationship-based representational foundation for machine learning techniques to classify drugs in the context of a putative side effect, given known examples.


2018 ◽  
Vol 8 (12) ◽  
pp. 2570 ◽  
Author(s):  
Yves Rybarczyk ◽  
Rasa Zalakeviciute

Current studies show that traditional deterministic models tend to struggle to capture the non-linear relationship between the concentration of air pollutants and their sources of emission and dispersion. To tackle such a limitation, the most promising approach is to use statistical models based on machine learning techniques. Nevertheless, it is puzzling why a certain algorithm is chosen over another for a given task. This systematic review intends to clarify this question by providing the reader with a comprehensive description of the principles underlying these algorithms and how they are applied to enhance prediction accuracy. A rigorous search that conforms to the PRISMA guideline is performed and results in the selection of the 46 most relevant journal papers in the area. Through a factorial analysis method these studies are synthetized and linked to each other. The main findings of this literature review show that: (i) machine learning is mainly applied in Eurasian and North American continents and (ii) estimation problems tend to implement Ensemble Learning and Regressions, whereas forecasting make use of Neural Networks and Support Vector Machines. The next challenges of this approach are to improve the prediction of pollution peaks and contaminants recently put in the spotlights (e.g., nanoparticles).


2021 ◽  
Author(s):  
Pablo Cresta Morgado ◽  
Martín Carusso ◽  
Laura Alonso Alemany ◽  
Laura Acion

In a continuum with applied statistics, machine learning offers a wide variety of tools to explore, analyze, and understand addiction data. These tools include algorithms that can leverage useful information from data to build models. These models are capable of addressing different scientific problems. In this second part of this two-part machine learning review, we develop how to apply machine learning methods. We explain the main limitations of machine learning approaches and ways to address them. Like other analytical tools, machine learning methods require careful implementation to carry out a reproducible and transparent research process with reliable results. This review describes a helpful workflow to guide the application of machine learning. This workflow has several steps: study design, data collection, data pre-processing, modeling, and communication. How to train, validate and test a model, detect and characterize overfitting, and determine an adequate sample size are some of the key issues to handle when applying machine learning. We also illustrate the process and particular nuances with examples of how researchers in addiction have applied machine learning techniques with different goals, study designs, or data sources.


Mathematics ◽  
2021 ◽  
Vol 9 (11) ◽  
pp. 1166
Author(s):  
Deepak Kumar ◽  
Chaman Verma ◽  
Pradeep Kumar Singh ◽  
Maria Simona Raboaca ◽  
Raluca-Andreea Felseghi ◽  
...  

The present study accentuated a hybrid approach to evaluate the impact, association and discrepancies of demographic characteristics on a student’s job placement. The present study extracted several significant academic features that determine the Master of Business Administration (MBA) student placement and confirm the placed gender. This paper recommended a novel futuristic roadmap for students, parents, guardians, institutions, and companies to benefit at a certain level. Out of seven experiments, the first five experiments were conducted with deep statistical computations, and the last two experiments were performed with supervised machine learning approaches. On the one hand, the Support Vector Machine (SVM) outperformed others with the uppermost accuracy of 90% to predict the employment status. On the other hand, the Random Forest (RF) attained a maximum accuracy of 88% to recognize the gender of placed students. Further, several significant features are also recommended to identify the placement of gender and placement status. A statistical t-test at 0.05 significance level proved that the student’s gender did not influence their offered salary during job placement and MBA specializations Marketing and Finance (Mkt&Fin) and Marketing and Human Resource (Mkt&HR) (p > 0.05). Additionally, the result of the t-test also showed that gender did not affect student’s placement test percentage scores (p > 0.05) and degree streams such as Science and Technology (Sci&Tech), Commerce and Management (Comm&Mgmt). Others did not affect the offered salary (p > 0.05). Further, the χ2 test revealed a significant association between a student’s course specialization and student’s placement status (p < 0.05). It also proved that there is no significant association between a student’s degree and placement status (p > 0.05). The current study recommended automatic placement prediction with demographic impact identification for the higher educational universities and institutions that will help human communities (students, teachers, parents, institutions) to prepare for the future accordingly.


2017 ◽  
Author(s):  
Sabrina Jaeger ◽  
Simone Fulle ◽  
Samo Turk

Inspired by natural language processing techniques we here introduce Mol2vec which is an unsupervised machine learning approach to learn vector representations of molecular substructures. Similarly, to the Word2vec models where vectors of closely related words are in close proximity in the vector space, Mol2vec learns vector representations of molecular substructures that are pointing in similar directions for chemically related substructures. Compounds can finally be encoded as vectors by summing up vectors of the individual substructures and, for instance, feed into supervised machine learning approaches to predict compound properties. The underlying substructure vector embeddings are obtained by training an unsupervised machine learning approach on a so-called corpus of compounds that consists of all available chemical matter. The resulting Mol2vec model is pre-trained once, yields dense vector representations and overcomes drawbacks of common compound feature representations such as sparseness and bit collisions. The prediction capabilities are demonstrated on several compound property and bioactivity data sets and compared with results obtained for Morgan fingerprints as reference compound representation. Mol2vec can be easily combined with ProtVec, which employs the same Word2vec concept on protein sequences, resulting in a proteochemometric approach that is alignment independent and can be thus also easily used for proteins with low sequence similarities.


Sign in / Sign up

Export Citation Format

Share Document