Prediction accuracy analysis with logistic regression and CART decision tree

Few studies have been conducted to classify and predict the influence of nutritional intake on overweight/obesity, dyslipidemia, hypertension and type 2 diabetes mellitus (T2DM) based on deep learning such as deep neural network (DNN). The present study aims to classify and predict associations between nutritional intake and risk of overweight/obesity, dyslipidemia, hypertension and T2DM by developing a DNN model, and to compare a DNN model with the most popular machine learning models such as logistic regression and decision tree. Subjects aged from 40 to 69 years in the 4–7th (from 2007 through 2018) Korea National Health and Nutrition Examination Survey (KNHANES) were included. Diagnostic criteria of dyslipidemia (n = 10,731), hypertension (n = 10,991), T2DM (n = 3889) and overweight/obesity (n = 10,980) were set as dependent variables. Nutritional intakes were set as independent variables. A DNN model comprising one input layer with 7 nodes, three hidden layers with 30 nodes, 12 nodes, 8 nodes in each layer and one output layer with one node were implemented in Python programming language using Keras with tensorflow backend. In DNN, binary cross-entropy loss function for binary classification was used with Adam optimizer. For avoiding overfitting, dropout was applied to each hidden layer. Structural equation modelling (SEM) was also performed to simultaneously estimate multivariate causal association between nutritional intake and overweight/obesity, dyslipidemia, hypertension and T2DM. The DNN model showed the higher prediction accuracy with 0.58654 for dyslipidemia, 0.79958 for hypertension, 0.80896 for T2DM and 0.62496 for overweight/obesity compared with two other machine leaning models with five-folds cross-validation. Prediction accuracy for dyslipidemia, hypertension, T2DM and overweight/obesity were 0.58448, 0.79929, 0.80818 and 0.62486, respectively, when analyzed by a logistic regression, also were 0.52148, 0.66773, 0.71587 and 0.54026, respectively, when analyzed by a decision tree. This study observed a DNN model with three hidden layers with 30 nodes, 12 nodes, 8 nodes in each layer had better prediction accuracy than two conventional machine learning models of a logistic regression and decision tree.

Download Full-text

AN EFFICIENT MACHINE LEARNING MODEL FOR PREDICTION OF ACUTE MYOCARDIAL INFARCTION

Recent Advances in Computer Science and Communications ◽

10.2174/2666255813666200325104317 ◽

2020 ◽

Vol 13 ◽

Author(s):

Dhilsath Fathima.M ◽

S. Justin Samuel ◽

R. Hari Haran

Keyword(s):

Machine Learning ◽

Myocardial Infarction ◽

Acute Myocardial Infarction ◽

Logistic Regression ◽

Decision Tree ◽

Learning Model ◽

Training Dataset ◽

Data Set ◽

Machine Learning Model ◽

Proposed Model

Aim: This proposed work is used to develop an improved and robust machine learning model for predicting Myocardial Infarction (MI) could have substantial clinical impact. Objectives: This paper explains how to build machine learning based computer-aided analysis system for an early and accurate prediction of Myocardial Infarction (MI) which utilizes framingham heart study dataset for validation and evaluation. This proposed computer-aided analysis model will support medical professionals to predict myocardial infarction proficiently. Methods: The proposed model utilize the mean imputation to remove the missing values from the data set, then applied principal component analysis to extract the optimal features from the data set to enhance the performance of the classifiers. After PCA, the reduced features are partitioned into training dataset and testing dataset where 70% of the training dataset are given as an input to the four well-liked classifiers as support vector machine, k-nearest neighbor, logistic regression and decision tree to train the classifiers and 30% of test dataset is used to evaluate an output of machine learning model using performance metrics as confusion matrix, classifier accuracy, precision, sensitivity, F1-score, AUC-ROC curve. Results: Output of the classifiers are evaluated using performance measures and we observed that logistic regression provides high accuracy than K-NN, SVM, decision tree classifiers and PCA performs sound as a good feature extraction method to enhance the performance of proposed model. From these analyses, we conclude that logistic regression having good mean accuracy level and standard deviation accuracy compared with the other three algorithms. AUC-ROC curve of the proposed classifiers is analyzed from the output figure.4, figure.5 that logistic regression exhibits good AUC-ROC score, i.e. around 70% compared to k-NN and decision tree algorithm. Conclusion: From the result analysis, we infer that this proposed machine learning model will act as an optimal decision making system to predict the acute myocardial infarction at an early stage than an existing machine learning based prediction models and it is capable to predict the presence of an acute myocardial Infarction with human using the heart disease risk factors, in order to decide when to start lifestyle modification and medical treatment to prevent the heart disease.

Download Full-text

Utilizing Twitter Data Analysis and Deep Learning to Identify Drug Use (Preprint)

10.2196/preprints.14681 ◽

2019 ◽

Author(s):

Joseph Tassone ◽

Peizhi Yan ◽

Mackenzie Simpson ◽

Chetan Mendhe ◽

Vijay Mago ◽

...

Keyword(s):

Social Media ◽

Logistic Regression ◽

Deep Learning ◽

Decision Tree ◽

Semantic Meaning ◽

Predictive Capability ◽

Logistic Regression Models ◽

Twitter Data ◽

Data Points ◽

Positive Classification

BACKGROUND The collection and examination of social media has become a useful mechanism for studying the mental activity and behavior tendencies of users. OBJECTIVE Through the analysis of a collected set of Twitter data, a model will be developed for predicting positively referenced, drug-related tweets. From this, trends and correlations can be determined. METHODS Twitter social media tweets and attribute data were collected and processed using topic pertaining keywords, such as drug slang and use-conditions (methods of drug consumption). Potential candidates were preprocessed resulting in a dataset 3,696,150 rows. The predictive classification power of multiple methods was compared including regression, decision trees, and CNN-based classifiers. For the latter, a deep learning approach was implemented to screen and analyze the semantic meaning of the tweets. RESULTS The logistic regression and decision tree models utilized 12,142 data points for training and 1041 data points for testing. The results calculated from the logistic regression models respectively displayed an accuracy of 54.56% and 57.44%, and an AUC of 0.58. While an improvement, the decision tree concluded with an accuracy of 63.40% and an AUC of 0.68. All these values implied a low predictive capability with little to no discrimination. Conversely, the CNN-based classifiers presented a heavy improvement, between the two models tested. The first was trained with 2,661 manually labeled samples, while the other included synthetically generated tweets culminating in 12,142 samples. The accuracy scores were 76.35% and 82.31%, with an AUC of 0.90 and 0.91. Using association rule mining in conjunction with the CNN-based classifier showed a high likelihood for keywords such as “smoke”, “cocaine”, and “marijuana” triggering a drug-positive classification. CONCLUSIONS Predictive analysis without a CNN is limited and possibly fruitless. Attribute-based models presented little predictive capability and were not suitable for analyzing this type of data. The semantic meaning of the tweets needed to be utilized, giving the CNN-based classifier an advantage over other solutions. Additionally, commonly mentioned drugs had a level of correspondence with frequently used illicit substances, proving the practical usefulness of this system. Lastly, the synthetically generated set provided increased scores, improving the predictive capability. CLINICALTRIAL None

Download Full-text

Prediction of short-term mortality in acute heart failure patients using minimal electronic health record data

BioData Mining ◽

10.1186/s13040-021-00255-w ◽

2021 ◽

Vol 14 (1) ◽

Author(s):

Ashwath Radhachandran ◽

Anurag Garikipati ◽

Nicole S. Zelin ◽

Emily Pellegrini ◽

Sina Ghandian ◽

...

Keyword(s):

Heart Failure ◽

Logistic Regression ◽

Decision Tree ◽

Risk Stratification ◽

Acute Heart Failure ◽

Model Performance ◽

Machine Learning Techniques ◽

Electronic Health Record Data ◽

Test Set ◽

Patient Disposition

Abstract Background Acute heart failure (AHF) is associated with significant morbidity and mortality. Effective patient risk stratification is essential to guiding hospitalization decisions and the clinical management of AHF. Clinical decision support systems can be used to improve predictions of mortality made in emergency care settings for the purpose of AHF risk stratification. In this study, several models for the prediction of seven-day mortality among AHF patients were developed by applying machine learning techniques to retrospective patient data from 236,275 total emergency department (ED) encounters, 1881 of which were considered positive for AHF and were used for model training and testing. The models used varying subsets of age, sex, vital signs, and laboratory values. Model performance was compared to the Emergency Heart Failure Mortality Risk Grade (EHMRG) model, a commonly used system for prediction of seven-day mortality in the ED with similar (or, in some cases, more extensive) inputs. Model performance was assessed in terms of area under the receiver operating characteristic curve (AUROC), sensitivity, and specificity. Results When trained and tested on a large academic dataset, the best-performing model and EHMRG demonstrated test set AUROCs of 0.84 and 0.78, respectively, for prediction of seven-day mortality. Given only measurements of respiratory rate, temperature, mean arterial pressure, and FiO2, one model produced a test set AUROC of 0.83. Neither a logistic regression comparator nor a simple decision tree outperformed EHMRG. Conclusions A model using only the measurements of four clinical variables outperforms EHMRG in the prediction of seven-day mortality in AHF. With these inputs, the model could not be replaced by logistic regression or reduced to a simple decision tree without significant performance loss. In ED settings, this minimal-input risk stratification tool may assist clinicians in making critical decisions about patient disposition by providing early and accurate insights into individual patient’s risk profiles.

Download Full-text

Predicting 30-day Hospital Readmission with Publicly Available Administrative Database

Methods of Information in Medicine ◽

10.3414/me14-02-0017 ◽

2015 ◽

Vol 54 (06) ◽

pp. 560-567 ◽

Cited By ~ 11

Author(s):

K. Zhu ◽

Z. Lou ◽

J. Zhou ◽

N. Ballester ◽

P. Parikh ◽

...

Keyword(s):

Heart Failure ◽

Logistic Regression ◽

Decision Tree ◽

Ad Hoc ◽

Prediction Models ◽

Conditional Logistic Regression ◽

Hospital Readmissions ◽

Decision Rules ◽

Classification Models ◽

Standard Classification

SummaryIntroduction: This article is part of the Focus Theme of Methods of Information in Medicine on “Big Data and Analytics in Healthcare”.Background: Hospital readmissions raise healthcare costs and cause significant distress to providers and patients. It is, therefore, of great interest to healthcare organizations to predict what patients are at risk to be readmitted to their hospitals. However, current logistic regression based risk prediction models have limited prediction power when applied to hospital administrative data. Meanwhile, although decision trees and random forests have been applied, they tend to be too complex to understand among the hospital practitioners.Objectives: Explore the use of conditional logistic regression to increase the prediction accuracy.Methods: We analyzed an HCUP statewide in-patient discharge record dataset, which includes patient demographics, clinical and care utilization data from California. We extracted records of heart failure Medicare beneficiaries who had inpatient experience during an 11-month period. We corrected the data imbalance issue with under-sampling. In our study, we first applied standard logistic regression and decision tree to obtain influential variables and derive practically meaning decision rules. We then stratified the original data set accordingly and applied logistic regression on each data stratum. We further explored the effect of interacting variables in the logistic regression modeling. We conducted cross validation to assess the overall prediction performance of conditional logistic regression (CLR) and compared it with standard classification models.Results: The developed CLR models outperformed several standard classification models (e.g., straightforward logistic regression, stepwise logistic regression, random forest, support vector machine). For example, the best CLR model improved the classification accuracy by nearly 20% over the straightforward logistic regression model. Furthermore, the developed CLR models tend to achieve better sensitivity of more than 10% over the standard classification models, which can be translated to correct labeling of additional 400 – 500 readmissions for heart failure patients in the state of California over a year. Lastly, several key predictor identified from the HCUP data include the disposition location from discharge, the number of chronic conditions, and the number of acute procedures.Conclusions: It would be beneficial to apply simple decision rules obtained from the decision tree in an ad-hoc manner to guide the cohort stratification. It could be potentially beneficial to explore the effect of pairwise interactions between influential predictors when building the logistic regression models for different data strata. Judicious use of the ad-hoc CLR models developed offers insights into future development of prediction models for hospital readmissions, which can lead to better intuition in identifying high-risk patients and developing effective post-discharge care strategies. Lastly, this paper is expected to raise the awareness of collecting data on additional markers and developing necessary database infrastructure for larger-scale exploratory studies on readmission risk prediction.

Download Full-text

Differential Diagnosis Model of Hypocellular Myelodysplastic Syndrome and Aplastic Anemia Based on the Medical Big Data Platform

Complexity ◽

10.1155/2018/4824350 ◽

2018 ◽

Vol 2018 ◽

pp. 1-12 ◽

Cited By ~ 9

Author(s):

Jianhui Wu ◽

Lu Zhang ◽

Sufeng Yin ◽

Haidong Wang ◽

Guoli Wang ◽

...

Keyword(s):

Neural Network ◽

Logistic Regression ◽

Big Data ◽

Differential Diagnosis ◽

Decision Tree ◽

Cell Morphology ◽

Bp Neural Network ◽

Data Platform ◽

New Ideas ◽

Medical Big Data

The arrival of the era of big data has brought new ideas to solve problems for all walks of life. Medical clinical data is collected and stored in the medical field by utilizing the medical big data platform. Based on medical information big data, new ideas and methods for the differential diagnosis of hypo-MDS and AA are studied. The basic information, peripheral blood classification counts, peripheral blood cell morphology, bone marrow cell morphology, and other information were collected from patients diagnosed with hypo-MDS and AA diagnosed in the first diagnosis. First, statistical analysis was performed. Then, the logistic regression model, decision tree model, BP neural network model, and support vector machine (SVM) model of hypo-MDS and AA were established. The sensitivity, specificity, Youden index, positive likelihood ratio (+LR), negative likelihood ratio (−LR), area under curve (AUC), accuracy, Kappa value, positive predictive value (+PV), negative predictive value (−PV) of the four model training set and test set were compared, respectively. Finally, with the support of medical big data, using logistic regression, decision tree, BP neural network, and SVM four classification algorithms, the decision tree algorithm is optimal for the classification of hypo-MDS and AA and analyzes the characteristics of the optimal model misjudgment data.

Download Full-text

Improving the Prediction Accuracy of Decision Tree Mining with Data Preprocessing

2017 IEEE 41st Annual Computer Software and Applications Conference (COMPSAC) ◽

10.1109/compsac.2017.146 ◽

2017 ◽

Cited By ~ 16

Author(s):

Priyanga Chandrasekar ◽

Kai Qian ◽

Hossain Shahriar ◽

Prabir Bhattacharya

Keyword(s):

Decision Tree ◽

Prediction Accuracy ◽

Data Preprocessing ◽

Tree Mining

Download Full-text

Comparison of the Performance of Machine Learning Algorithms in Predicting Heart Disease

Frontiers in Health Informatics ◽

10.30699/fhi.v10i1.349 ◽

2021 ◽

Vol 10 (1) ◽

pp. 99

Author(s):

Sajad Yousefi

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Heart Disease ◽

Decision Tree ◽

Roc Curve ◽

Machine Learning Algorithms ◽

Supervised Machine Learning ◽

Learning Models ◽

Algorithm Performance ◽

Machine Learning Models

Introduction: Heart disease is often associated with conditions such as clogged arteries due to the sediment accumulation which causes chest pain and heart attack. Many people die due to the heart disease annually. Most countries have a shortage of cardiovascular specialists and thus, a significant percentage of misdiagnosis occurs. Hence, predicting this disease is a serious issue. Using machine learning models performed on multidimensional dataset, this article aims to find the most efficient and accurate machine learning models for disease prediction.Material and Methods: Several algorithms were utilized to predict heart disease among which Decision Tree, Random Forest and KNN supervised machine learning are highly mentioned. The algorithms are applied to the dataset taken from the UCI repository including 294 samples. The dataset includes heart disease features. To enhance the algorithm performance, these features are analyzed, the feature importance scores and cross validation are considered.Results: The algorithm performance is compared with each other, so that performance based on ROC curve and some criteria such as accuracy, precision, sensitivity and F1 score were evaluated for each model. As a result of evaluation, Accuracy, AUC ROC are 83% and 99% respectively for Decision Tree algorithm. Logistic Regression algorithm with accuracy and AUC ROC are 88% and 91% respectively has better performance than other algorithms. Therefore, these techniques can be useful for physicians to predict heart disease patients and prescribe them correctly.Conclusion: Machine learning technique can be used in medicine for analyzing the related data collections to a disease and its prediction. The area under the ROC curve and evaluating criteria related to a number of classifying algorithms of machine learning to evaluate heart disease and indeed, the prediction of heart disease is compared to determine the most appropriate classification. As a result of evaluation, better performance was observed in both Decision Tree and Logistic Regression models.

Download Full-text

Determinates of Employee Voluntary Turnover and Forecasting in R&D Departments: A Case Study

Studies in Engineering and Technology ◽

10.11114/set.v3i1.1635 ◽

2016 ◽

Vol 3 (1) ◽

pp. 64

Author(s):

Xiaojuan Zhu ◽

Rapinder Sawhney ◽

Girish Upreti

Keyword(s):

Logistic Regression ◽

Decision Tree ◽

Human Resource ◽

Voluntary Turnover ◽

Important Variable ◽

Validation Dataset ◽

Tree Model ◽

Job Title ◽

Decision Tree Method ◽

Employee Tenure

employee voluntary turnover factors using logistic regression and forecasts employee tenure using a decision tree for four research and development departments in a large U.S organization. Company job title, gender, ethnicity, age and years of service significantly affect employee voluntary turnover behavior determined by logistic regression. The findings assist managers and human resource departments in specific employee retention strategies to reduce R&D departments’ voluntary turnover rate. The decision tree method built a five-level depth tree model with 17 nodes. This model has the lowest AIC value and the best performance in the validation dataset. Age at hire, jobtitle, division, and race are statistically significant factors to predict employee tenure. The most important variable is age at hire located in the decision tree’s first, third, and fourth nodes. Classification rules assist managers and human resource departments in quickly predicting employee tenure and in making hiring decisions.

Download Full-text

COMPARATIVE STUDY OF MACHINE LEARNING KNN, SVM, AND DECISION TREE ALGORITHM TO PREDICT STUDENT’S PERFORMANCE

International Journal of Research -GRANTHAALAYAH ◽

10.29121/granthaalayah.v7.i1.2019.1048 ◽

2019 ◽

Vol 7 (1) ◽

pp. 190-196

Author(s):

Slamet Wiyono ◽

Taufiq Abidin

Keyword(s):

Decision Tree ◽

Student Performance ◽

Prediction Accuracy ◽

Model Building ◽

Decision Tree Algorithm ◽

Tree Algorithm ◽

Svm Algorithm ◽

Tree Algorithms ◽

Student’S Performance ◽

Predicting Student Performance

Students who are not-active will affect the number of students who graduate on time. Prevention of not-active students can be done by predicting student performance. The study was conducted by comparing the KNN, SVM, and Decision Tree algorithms to get the best predictive model. The model making process was carried out by steps; data collecting, pre-processing, model building, comparison of models, and evaluation. The results show that the SVM algorithm has the best accuracy in predicting with a precision value of 95%. The Decision Tree algorithm has a prediction accuracy of 93% and the KNN algorithm has a prediction accuracy value of 92%.

Download Full-text