Machine learning approach for the binary classification of biomedical literature

Abstract Background: We have applied machine learning techniques to automate the screening of biomedical literature prior to the manual curation of clinical databases such as performed by the Human Gene Mutation Database (HGMD). Methods: We have developed two machine learning models, one based on title and abstract data only, the other on the full text of the article. The models were built using a Natural Language Processing (NLP) pipeline and a logistic regression classifier. Our pipelines are implemented in Python and can be run using Docker. They are made available to the wider community via GitHub (https://github.com/annacprice/nlp-bio-tools) and Docker Hub. Results: During testing, both models performed well, correctly predicting HGMD relevant articles more than 93% of the time and correctly discarding irrelevant articles more than 96% of the time, with Matthews Correlation Coefficients (MCC's) of over 0.89. Evaluation of the finalised model using an unseen validation dataset demonstrated that the full text model correctly predicted HGMD-relevant articles more than 97% of the time, an accuracy 9.5% higher than that obtained with the title/abstract model. Conclusions: Through this work we have demonstrated that machine learning models can act as an effective pre-screen of biomedical literature, with the results indicating that a full text approach to screening biomedical literature is preferable to using just the title/abstract data.

Download Full-text

Twitter Data Sentimental Analysis Using Multiple Classifications

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2020.9319 ◽

2020 ◽

Vol 17 (8) ◽

pp. 3776-3781

Author(s):

M. Adimoolam ◽

Raghav Sharma ◽

A. John ◽

M. Suresh Kumar ◽

K. Ashok Kumar

Keyword(s):

Machine Learning ◽

Language Processing ◽

Machine Learning Algorithms ◽

Machine Learning Techniques ◽

Human Beings ◽

Learning Models ◽

The Past ◽

Twitter Data ◽

Learning Techniques ◽

Machine Learning Models

In the past few decades human beings have knowledgeable tremendous intensification in the interaction in particular micro blogging websites and various social media as online resources. Many kinds of data have been used and classification data to group and store are challenging in this real world scenario. Various machine and Natural Language Processing (NLP) were being applied to analysis the sentiment. A major concentration of this work was on using several machine learning algorithms to perform sentimental analysis and comparing various machine learning models for the sentiment classification. This work analysed various sentimental using multiple classifications. From the evaluation of this experiment, it can be concluded that NLP and machine learning Techniques are efficient for sentimental analysis.

Download Full-text

Application of Machine Learning Techniques to Predict Binding Affinity for Drug Targets: A Study of Cyclin-Dependent Kinase 2

Current Medicinal Chemistry ◽

10.2174/2213275912666191102162959 ◽

2020 ◽

Vol 28 (2) ◽

pp. 253-265 ◽

Cited By ~ 3

Author(s):

Gabriela Bitencourt-Ferreira ◽

Amauri Duarte da Silva ◽

Walter Filgueira de Azevedo

Keyword(s):

Machine Learning ◽

Binding Affinity ◽

Predictive Performance ◽

Supervised Machine Learning ◽

Machine Learning Techniques ◽

Scoring Functions ◽

Cyclin Dependent Kinase ◽

Learning Models ◽

Learning Techniques ◽

Machine Learning Models

Background: The elucidation of the structure of cyclin-dependent kinase 2 (CDK2) made it possible to develop targeted scoring functions for virtual screening aimed to identify new inhibitors for this enzyme. CDK2 is a protein target for the development of drugs intended to modulate cellcycle progression and control. Such drugs have potential anticancer activities. Objective: Our goal here is to review recent applications of machine learning methods to predict ligand- binding affinity for protein targets. To assess the predictive performance of classical scoring functions and targeted scoring functions, we focused our analysis on CDK2 structures. Methods: We have experimental structural data for hundreds of binary complexes of CDK2 with different ligands, many of them with inhibition constant information. We investigate here computational methods to calculate the binding affinity of CDK2 through classical scoring functions and machine- learning models. Results: Analysis of the predictive performance of classical scoring functions available in docking programs such as Molegro Virtual Docker, AutoDock4, and Autodock Vina indicated that these methods failed to predict binding affinity with significant correlation with experimental data. Targeted scoring functions developed through supervised machine learning techniques showed a significant correlation with experimental data. Conclusion: Here, we described the application of supervised machine learning techniques to generate a scoring function to predict binding affinity. Machine learning models showed superior predictive performance when compared with classical scoring functions. Analysis of the computational models obtained through machine learning could capture essential structural features responsible for binding affinity against CDK2.

Download Full-text

Machine learning models to identify low adherence to influenza vaccination among Korean adults with cardiovascular disease

BMC Cardiovascular Disorders ◽

10.1186/s12872-021-01925-7 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Moojung Kim ◽

Young Jae Kim ◽

Sung Jin Park ◽

Kwang Gi Kim ◽

Pyung Chun Oh ◽

...

Keyword(s):

Machine Learning ◽

Cardiovascular Disease ◽

Influenza Vaccination ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Age Group ◽

Learning Models ◽

Extreme Gradient Boosting ◽

Machine Learning Models

Abstract Background Annual influenza vaccination is an important public health measure to prevent influenza infections and is strongly recommended for cardiovascular disease (CVD) patients, especially in the current coronavirus disease 2019 (COVID-19) pandemic. The aim of this study is to develop a machine learning model to identify Korean adult CVD patients with low adherence to influenza vaccination Methods Adults with CVD (n = 815) from a nationally representative dataset of the Fifth Korea National Health and Nutrition Examination Survey (KNHANES V) were analyzed. Among these adults, 500 (61.4%) had answered "yes" to whether they had received seasonal influenza vaccinations in the past 12 months. The classification process was performed using the logistic regression (LR), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGB) machine learning techniques. Because the Ministry of Health and Welfare in Korea offers free influenza immunization for the elderly, separate models were developed for the < 65 and ≥ 65 age groups. Results The accuracy of machine learning models using 16 variables as predictors of low influenza vaccination adherence was compared; for the ≥ 65 age group, XGB (84.7%) and RF (84.7%) have the best accuracies, followed by LR (82.7%) and SVM (77.6%). For the < 65 age group, SVM has the best accuracy (68.4%), followed by RF (64.9%), LR (63.2%), and XGB (61.4%). Conclusions The machine leaning models show comparable performance in classifying adult CVD patients with low adherence to influenza vaccination.

Download Full-text

Cocrystal Prediction Using Machine Learning Models and Descriptors

Applied Sciences ◽

10.3390/app11031323 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1323

Author(s):

Medard Edmund Mswahili ◽

Min-Jeong Lee ◽

Gati Lother Martin ◽

Junghyun Kim ◽

Paul Kim ◽

...

Keyword(s):

Machine Learning ◽

Academic Research ◽

Pharmaceutical Research ◽

Machine Learning Techniques ◽

Learning Models ◽

Pharmaceutical Ingredients ◽

Learning Techniques ◽

Comparable Performance ◽

Selection Algorithms ◽

Machine Learning Models

Cocrystals are of much interest in industrial application as well as academic research, and screening of suitable coformers for active pharmaceutical ingredients is the most crucial and challenging step in cocrystal development. Recently, machine learning techniques are attracting researchers in many fields including pharmaceutical research such as quantitative structure-activity/property relationship. In this paper, we develop machine learning models to predict cocrystal formation. We extract descriptor values from simplified molecular-input line-entry system (SMILES) of compounds and compare the machine learning models by experiments with our collected data of 1476 instances. As a result, we found that artificial neural network shows great potential as it has the best accuracy, sensitivity, and F1 score. We also found that the model achieved comparable performance with about half of the descriptors chosen by feature selection algorithms. We believe that this will contribute to faster and more accurate cocrystal development.

Download Full-text

Triage and diagnosis of COVID-19 from medical social media (Preprint)

10.2196/preprints.30397 ◽

2021 ◽

Author(s):

Abul Hasan ◽

Mark Levene ◽

David Weston ◽

Renate Fromson ◽

Nicolas Koslover ◽

...

Keyword(s):

Machine Learning ◽

Social Media ◽

Natural Language Processing ◽

Natural Language ◽

Language Processing ◽

Learning Models ◽

Rule Based ◽

Additional Information ◽

Processing Pipeline ◽

Machine Learning Models

BACKGROUND The COVID-19 pandemic has created a pressing need for integrating information from disparate sources, in order to assist decision makers. Social media is important in this respect, however, to make sense of the textual information it provides and be able to automate the processing of large amounts of data, natural language processing methods are needed. Social media posts are often noisy, yet they may provide valuable insights regarding the severity and prevalence of the disease in the population. In particular, machine learning techniques for triage and diagnosis could allow for a better understanding of what social media may offer in this respect. OBJECTIVE This study aims to develop an end-to-end natural language processing pipeline for triage and diagnosis of COVID-19 from patient-authored social media posts, in order to provide researchers and other interested parties with additional information on the symptoms, severity and prevalence of the disease. METHODS The text processing pipeline first extracts COVID-19 symptoms and related concepts such as severity, duration, negations, and body parts from patients’ posts using conditional random fields. An unsupervised rule-based algorithm is then applied to establish relations between concepts in the next step of the pipeline. The extracted concepts and relations are subsequently used to construct two different vector representations of each post. These vectors are applied separately to build support vector machine learning models to triage patients into three categories and diagnose them for COVID-19. RESULTS We report that Macro- and Micro-averaged F_{1\ }scores in the range of 71-96% and 61-87%, respectively, for the triage and diagnosis of COVID-19, when the models are trained on human labelled data. Our experimental results indicate that similar performance can be achieved when the models are trained using predicted labels from concept extraction and rule-based classifiers, thus yielding end-to-end machine learning. Also, we highlight important features uncovered by our diagnostic machine learning models and compare them with the most frequent symptoms revealed in another COVID-19 dataset. In particular, we found that the most important features are not always the most frequent ones. CONCLUSIONS Our preliminary results show that it is possible to automatically triage and diagnose patients for COVID-19 from natural language narratives using a machine learning pipeline, in order to provide additional information on the severity and prevalence of the disease through the eyes of social media.

Download Full-text

Deep context of citations using machine-learning models in scholarly full-text articles

Scientometrics ◽

10.1007/s11192-018-2944-y ◽

2018 ◽

Vol 117 (3) ◽

pp. 1645-1662 ◽

Cited By ~ 11

Author(s):

Saeed-Ul Hassan ◽

Mubashir Imran ◽

Sehrish Iqbal ◽

Naif Radi Aljohani ◽

Raheel Nawaz

Keyword(s):

Machine Learning ◽

Full Text ◽

Learning Models ◽

Machine Learning Models

Download Full-text

Predicting Onset of Dementia Using Clinical Notes and Machine Learning: Case-Control Study (Preprint)

10.2196/preprints.17819 ◽

2020 ◽

Author(s):

Christopher A Hane ◽

Vijay S Nori ◽

William H Crown ◽

Darshak M Sanghavi ◽

Paul Bleicher

Keyword(s):

Machine Learning ◽

Language Processing ◽

Disease Onset ◽

Area Under The Curve ◽

Learning Models ◽

Term Care ◽

Clinical Notes ◽

Patients At Risk ◽

Hospital Systems ◽

Machine Learning Models

BACKGROUND Clinical trials need efficient tools to assist in recruiting patients at risk of Alzheimer disease and related dementias (ADRD). Early detection can also assist patients with financial planning for long-term care. Clinical notes are an important, underutilized source of information in machine learning models because of the cost of collection and complexity of analysis. OBJECTIVE This study aimed to investigate the use of deidentified clinical notes from multiple hospital systems collected over 10 years to augment retrospective machine learning models of the risk of developing ADRD. METHODS We used 2 years of data to predict the future outcome of ADRD onset. Clinical notes are provided in a deidentified format with specific terms and sentiments. Terms in clinical notes are embedded into a 100-dimensional vector space to identify clusters of related terms and abbreviations that differ across hospital systems and individual clinicians. RESULTS When using clinical notes, the area under the curve (AUC) improved from 0.85 to 0.94, and positive predictive value (PPV) increased from 45.07% (25,245/56,018) to 68.32% (14,153/20,717) in the model at disease onset. Models with clinical notes improved in both AUC and PPV in years 3-6 when notes’ volume was largest; results are mixed in years 7 and 8 with the smallest cohorts. CONCLUSIONS Although clinical notes helped in the short term, the presence of ADRD symptomatic terms years earlier than onset adds evidence to other studies that clinicians undercode diagnoses of ADRD. De-identified clinical notes increase the accuracy of risk models. Clinical notes collected across multiple hospital systems via natural language processing can be merged using postprocessing techniques to aid model accuracy.

Download Full-text

Analysis of Machine Learning Techniques Applied to Sensory Detection of Vehicles in Intelligent Crosswalks

Sensors ◽

10.3390/s20216019 ◽

2020 ◽

Vol 20 (21) ◽

pp. 6019

Author(s):

José Manuel Lozano Domínguez ◽

Faroq Al-Tam ◽

Tomás de J. Mateo Sanguino ◽

Noélia Correia

Keyword(s):

Machine Learning ◽

Smart Cities ◽

Machine Learning Techniques ◽

Support Vector ◽

Learning Models ◽

Fuzzy Classifier ◽

Logistic Regression Models ◽

The Road ◽

Learning Agent ◽

Machine Learning Models

Improving road safety through artificial intelligence-based systems is now crucial turning smart cities into a reality. Under this highly relevant and extensive heading, an approach is proposed to improve vehicle detection in smart crosswalks using machine learning models. Contrarily to classic fuzzy classifiers, machine learning models do not require the readjustment of labels that depend on the location of the system and the road conditions. Several machine learning models were trained and tested using real traffic data taken from urban scenarios in both Portugal and Spain. These include random forest, time-series forecasting, multi-layer perceptron, support vector machine, and logistic regression models. A deep reinforcement learning agent, based on a state-of-the-art double-deep recurrent Q-network, is also designed and compared with the machine learning models just mentioned. Results show that the machine learning models can efficiently replace the classic fuzzy classifier.

Download Full-text

Heart Failure Diagnosis, Readmission, and Mortality Prediction Using Machine Learning and Artificial Intelligence Models

Current Epidemiology Reports ◽

10.1007/s40471-020-00259-w ◽

2020 ◽

Vol 7 (4) ◽

pp. 212-219 ◽

Cited By ~ 1

Author(s):

Aixia Guo ◽

Michael Pasque ◽

Francis Loh ◽

Douglas L. Mann ◽

Philip R. O. Payne

Keyword(s):

Machine Learning ◽

Heart Failure ◽

Outcome Prediction ◽

Predictive Accuracy ◽

Imbalanced Data ◽

Mortality Prediction ◽

Machine Learning Techniques ◽

Patient Specific ◽

Learning Models ◽

Machine Learning Models

Abstract Purpose of Review One in five people will develop heart failure (HF), and 50% of HF patients die in 5 years. The HF diagnosis, readmission, and mortality prediction are essential to develop personalized prevention and treatment plans. This review summarizes recent findings and approaches of machine learning models for HF diagnostic and outcome prediction using electronic health record (EHR) data. Recent Findings A set of machine learning models have been developed for HF diagnostic and outcome prediction using diverse variables derived from EHR data, including demographic, medical note, laboratory, and image data, and achieved expert-comparable prediction results. Summary Machine learning models can facilitate the identification of HF patients, as well as accurate patient-specific assessment of their risk for readmission and mortality. Additionally, novel machine learning techniques for integration of diverse data and improvement of model predictive accuracy in imbalanced data sets are critical for further development of these promising modeling methodologies.

Download Full-text

CPT Data Interpretation Employing Different Machine Learning Techniques

Geosciences ◽

10.3390/geosciences11070265 ◽

2021 ◽

Vol 11 (7) ◽

pp. 265

Author(s):

Stefan Rauter ◽

Franz Tschuchnigg

Keyword(s):

Machine Learning ◽

Grain Size ◽

Random Forest ◽

Classification Model ◽

Machine Learning Techniques ◽

Support Vector ◽

Learning Models ◽

Cone Penetration ◽

Tip Resistance ◽

Machine Learning Models

The classification of soils into categories with a similar range of properties is a fundamental geotechnical engineering procedure. At present, this classification is based on various types of cost- and time-intensive laboratory and/or in situ tests. These soil investigations are essential for each individual construction site and have to be performed prior to the design of a project. Since Machine Learning could play a key role in reducing the costs and time needed for a suitable site investigation program, the basic ability of Machine Learning models to classify soils from Cone Penetration Tests (CPT) is evaluated. To find an appropriate classification model, 24 different Machine Learning models, based on three different algorithms, are built and trained on a dataset consisting of 1339 CPT. The applied algorithms are a Support Vector Machine, an Artificial Neural Network and a Random Forest. As input features, different combinations of direct cone penetration test data (tip resistance qc, sleeve friction fs, friction ratio Rf, depth d), combined with “defined”, thus, not directly measured data (total vertical stresses σv, effective vertical stresses σ’v and hydrostatic pore pressure u0), are used. Standard soil classes based on grain size distributions and soil classes based on soil behavior types according to Robertson are applied as targets. The different models are compared with respect to their prediction performance and the required learning time. The best results for all targets were obtained with models using a Random Forest classifier. For the soil classes based on grain size distribution, an accuracy of about 75%, and for soil classes according to Robertson, an accuracy of about 97–99%, was reached.

Download Full-text