Machine Learning Classifier Models Can Identify Delirium in Intensive Care Units

Abstract Purpose: The aim of this study was to use machine learning to construct a model for the analysis of risk factors and prediction of delirium among ICU patients.Methods: We developed a set of real-world data to enable the comparison of the reliability and accuracy of delirium prediction models from the MIMIC-III database, the MIMIC-IV database and the eICU Collaborative Research Database. Significance tests, correlation analysis, and factor analysis were used to individually screen 80 potential risk factors. The predictive algorithms were run using the following models: Logistic regression, naive Bayesian, K-nearest neighbors, support vector machine, random forest, and eXtreme Gradient Boosting. Conventional E-PRE-DELIRIC and eighteen models, including all-factor (AF) models with all potential variables, characteristic variable (CV) models with principal component factors, and rapid predictive (RP) models without laboratory test results, were used to construct the risk prediction model for delirium. The performance of these machine learning models was measured by the area under the receiver operating characteristic curve (AUC) of tenfold cross-validation. The VIMs and SHAP algorithms, feature interpretation and sample prediction interpretation algorithms of the machine learning black box model were implemented.Results: A total of 78,365 patients were enrolled in this study, 22,159 of whom (28.28%) had positive delirium records. The E-PRE-DELIRIC model (AUC, 0.77), CV models (AUC, 0.77-0.93), CV models (AUC, 0.77-0.88) and RP models (AUC, 0.75-0.87) had discriminatory value. The random forest CV model found that the top five factors accounting for the weight of delirium were length of ICU stay, verbal response score, APACHE-III score, urine volume and hemoglobin. The SHAP values in the eXtreme Gradient Boosting CV model showed that the top three features that were negatively correlated with outcomes were verbal response score, urine volume, and hemoglobin; the top three characteristics that were positively correlated with outcomes were length of ICU stay, APACHE-III score, and alanine transaminase.Conclusion: Even with a small number of variables, machine learning has a good ability to predict delirium in critically ill patients. Characteristic variables provide direction for early intervention to reduce the risk of delirium.

Download Full-text

Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning Framework

Journal of Diabetes Research ◽

10.1155/2020/6873891 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Mingyue Xue ◽

Yinxia Su ◽

Chen Li ◽

Shuxia Wang ◽

Hua Yao

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Decision Tree ◽

Type Ii Diabetes ◽

Large Scale ◽

Systolic Pressure ◽

Gradient Boosting ◽

Significant Feature ◽

Type Ii ◽

Extreme Gradient Boosting

Background. An estimated 425 million people globally have diabetes, accounting for 12% of the world’s health expenditures, and the number continues to grow, placing a huge burden on the healthcare system, especially in those remote, underserved areas. Methods. A total of 584,168 adult subjects who have participated in the national physical examination were enrolled in this study. The risk factors for type II diabetes mellitus (T2DM) were identified by p values and odds ratio, using logistic regression (LR) based on variables of physical measurement and a questionnaire. Combined with the risk factors selected by LR, we used a decision tree, a random forest, AdaBoost with a decision tree (AdaBoost), and an extreme gradient boosting decision tree (XGBoost) to identify individuals with T2DM, compared the performance of the four machine learning classifiers, and used the best-performing classifier to output the degree of variables’ importance scores of T2DM. Results. The results indicated that XGBoost had the best performance (accuracy=0.906, precision=0.910, recall=0.902, F‐1=0.906, and AUC=0.968). The degree of variables’ importance scores in XGBoost showed that BMI was the most significant feature, followed by age, waist circumference, systolic pressure, ethnicity, smoking amount, fatty liver, hypertension, physical activity, drinking status, dietary ratio (meat to vegetables), drink amount, smoking status, and diet habit (oil loving). Conclusions. We proposed a classifier based on LR-XGBoost which used fourteen variables of patients which are easily obtained and noninvasive as predictor variables to identify potential incidents of T2DM. The classifier can accurately screen the risk of diabetes in the early phrase, and the degree of variables’ importance scores gives a clue to prevent diabetes occurrence.

Download Full-text

Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis (Preprint)

10.2196/preprints.27344 ◽

2021 ◽

Author(s):

Sang Min Nam ◽

Thomas A Peterson ◽

Kyoung Yul Seo ◽

Hyun Wook Han ◽

Jee In Kang

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Network Analysis ◽

Survey Data ◽

Associated Factors ◽

Statistical Tests ◽

Epidemiological Studies ◽

Gradient Boosting ◽

Data Set ◽

Extreme Gradient Boosting

BACKGROUND In epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large. OBJECTIVE Our study had two aims. First, we aimed to identify essential depression-associated factors using the extreme gradient boosting (XGBoost) machine learning algorithm from big survey data (the Korea National Health and Nutrition Examination Survey, 2012-2016). Second, we aimed to achieve a comprehensive understanding of multifactorial features in depression using network analysis. METHODS An XGBoost model was trained and tested to classify “current depression” and “no lifetime depression” for a data set of 120 variables for 12,596 cases. The optimal XGBoost hyperparameters were set by an automated machine learning tool (TPOT), and a high-performance sparse model was obtained by feature selection using the feature importance value of XGBoost. We performed statistical tests on the model and nonmodel factors using survey-weighted multiple logistic regression and drew a correlation network among factors. We also adopted statistical tests for the confounder or interaction effect of selected risk factors when it was suspected on the network. RESULTS The XGBoost-derived depression model consisted of 18 factors with an area under the weighted receiver operating characteristic curve of 0.86. Two nonmodel factors could be found using the model factors, and the factors were classified into direct (<i>P</i><.05) and indirect (<i>P</i>≥.05), according to the statistical significance of the association with depression. Perceived stress and asthma were the most remarkable risk factors, and urine specific gravity was a novel protective factor. The depression-factor network showed clusters of socioeconomic status and quality of life factors and suggested that educational level and sex might be predisposing factors. Indirect factors (eg, diabetes, hypercholesterolemia, and smoking) were involved in confounding or interaction effects of direct factors. Triglyceride level was a confounder of hypercholesterolemia and diabetes, smoking had a significant risk in females, and weight gain was associated with depression involving diabetes. CONCLUSIONS XGBoost and network analysis were useful to discover depression-related factors and their relationships and can be applied to epidemiological studies using big survey data.

Download Full-text

Predicting Adverse Drug Events in Chinese Pediatric Inpatients With the Associated Risk Factors: A Machine Learning Study

Frontiers in Pharmacology ◽

10.3389/fphar.2021.659099 ◽

2021 ◽

Vol 12 ◽

Author(s):

Ze Yu ◽

Huanhuan Ji ◽

Jianwen Xiao ◽

Ping Wei ◽

Lin Song ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Adverse Drug Events ◽

Length Of Hospital Stay ◽

Gradient Boosting ◽

Multiple Risk Factors ◽

Extreme Gradient Boosting ◽

Novel Method ◽

Associated Risk Factors ◽

Medical University

The aim of this study was to apply machine learning methods to deeply explore the risk factors associated with adverse drug events (ADEs) and predict the occurrence of ADEs in Chinese pediatric inpatients. Data of 1,746 patients aged between 28 days and 18 years (mean age = 3.84 years) were included in the study from January 1, 2013, to December 31, 2015, in the Children’s Hospital of Chongqing Medical University. There were 247 cases of ADE occurrence, of which the most common drugs inducing ADEs were antibacterials. Seven algorithms, including eXtreme Gradient Boosting (XGBoost), CatBoost, AdaBoost, LightGBM, Random Forest (RF), Gradient Boosting Decision Tree (GBDT), and TPOT, were used to select the important risk factors, and GBDT was chosen to establish the prediction model with the best predicting abilities (precision = 44%, recall = 25%, F1 = 31.88%). The GBDT model has better performance than Global Trigger Tools (GTTs) for ADE prediction (precision 44 vs. 13.3%). In addition, multiple risk factors were identified via GBDT, such as the number of trigger true (TT) (+), number of doses, BMI, number of drugs, number of admission, height, length of hospital stay, weight, age, and number of diagnoses. The influencing directions of the risk factors on ADEs were displayed through Shapley Additive exPlanations (SHAP). This study provides a novel method to accurately predict adverse drug events in Chinese pediatric inpatients with the associated risk factors, which may be applicable in clinical practice in the future.

Download Full-text

Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis

Journal of Medical Internet Research ◽

10.2196/27344 ◽

2021 ◽

Vol 23 (6) ◽

pp. e27344

Author(s):

Sang Min Nam ◽

Thomas A Peterson ◽

Kyoung Yul Seo ◽

Hyun Wook Han ◽

Jee In Kang

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Network Analysis ◽

Survey Data ◽

Associated Factors ◽

Statistical Tests ◽

Epidemiological Studies ◽

Gradient Boosting ◽

Data Set ◽

Extreme Gradient Boosting

Background In epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large. Objective Our study had two aims. First, we aimed to identify essential depression-associated factors using the extreme gradient boosting (XGBoost) machine learning algorithm from big survey data (the Korea National Health and Nutrition Examination Survey, 2012-2016). Second, we aimed to achieve a comprehensive understanding of multifactorial features in depression using network analysis. Methods An XGBoost model was trained and tested to classify “current depression” and “no lifetime depression” for a data set of 120 variables for 12,596 cases. The optimal XGBoost hyperparameters were set by an automated machine learning tool (TPOT), and a high-performance sparse model was obtained by feature selection using the feature importance value of XGBoost. We performed statistical tests on the model and nonmodel factors using survey-weighted multiple logistic regression and drew a correlation network among factors. We also adopted statistical tests for the confounder or interaction effect of selected risk factors when it was suspected on the network. Results The XGBoost-derived depression model consisted of 18 factors with an area under the weighted receiver operating characteristic curve of 0.86. Two nonmodel factors could be found using the model factors, and the factors were classified into direct (P<.05) and indirect (P≥.05), according to the statistical significance of the association with depression. Perceived stress and asthma were the most remarkable risk factors, and urine specific gravity was a novel protective factor. The depression-factor network showed clusters of socioeconomic status and quality of life factors and suggested that educational level and sex might be predisposing factors. Indirect factors (eg, diabetes, hypercholesterolemia, and smoking) were involved in confounding or interaction effects of direct factors. Triglyceride level was a confounder of hypercholesterolemia and diabetes, smoking had a significant risk in females, and weight gain was associated with depression involving diabetes. Conclusions XGBoost and network analysis were useful to discover depression-related factors and their relationships and can be applied to epidemiological studies using big survey data.

Download Full-text

Importance of GWAS risk loci and clinical data in predicting asthma using machine-learning approaches

10.21203/rs.3.rs-21271/v1 ◽

2020 ◽

Author(s):

Si-Qiao Liang ◽

Jian-Xiong Long ◽

Jingmin Deng ◽

Xuan Wei ◽

Mei-Ling Yang ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Clinical Data ◽

Genome Wide Association Study ◽

Prediction Models ◽

Area Under The Curve ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

Extreme Gradient Boosting

Abstract Asthma is a serious immune-mediated respiratory airway disease. Its pathological processes involve genetics and the environment, but it remains unclear. To understand the risk factors of asthma, we combined genome-wide association study (GWAS) risk loci and clinical data in predicting asthma using machine-learning approaches. A case–control study with 123 asthma patients and 100 healthy controls was conducted in Zhuang population in Guangxi. GWAS risk loci were detected using polymerase chain reaction, and clinical data were collected. Machine-learning approaches (e.g., extreme gradient boosting [XGBoost], decision tree, support vector machine, and random forest algorithms) were used to identify the major factors that contributed to asthma. A total of 14 GWAS risk loci with clinical data were analyzed on the basis of 10 times of 10-fold cross-validation for all machine-learning models. Using GWAS risk loci or clinical data, the best performances were area under the curve (AUC) values of 64.3% and 71.4%, respectively. Combining GWAS risk loci and clinical data, the XGBoost established the best model with an AUC of 79.7%, indicating that the combination of genetics and clinical data can enable improved performance. We then sorted the importance of features and found that the top six risk factors for predicting asthma were rs3117098, rs7775228, family history, rs2305480, rs4833095, and body mass index. Asthma-prediction models based on GWAS risk loci and clinical data can accurately predict asthma and thus provide insights into the disease pathogenesis of asthma. Further research is required to evaluate more genetic markers and clinical data and predict asthma risk.

Download Full-text

Predicting Undesired Treatment Outcome in Mental Healthcare: Machine Learning Study (Preprint)

10.2196/preprints.17235 ◽

2019 ◽

Author(s):

Kasper Van Mens ◽

Joran Lokkerbol ◽

Richard Janssen ◽

Robert de Lange ◽

Bea Tiemens

Keyword(s):

Machine Learning ◽

Treatment Outcome ◽

Mental Health Treatment ◽

Mental Healthcare ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Trade Off ◽

Trade Offs ◽

Outcome Monitoring ◽

Extreme Gradient Boosting

BACKGROUND It remains a challenge to predict which treatment will work for which patient in mental healthcare. OBJECTIVE In this study we compare machine algorithms to predict during treatment which patients will not benefit from brief mental health treatment and present trade-offs that must be considered before an algorithm can be used in clinical practice. METHODS Using an anonymized dataset containing routine outcome monitoring data from a mental healthcare organization in the Netherlands (n = 2,655), we applied three machine learning algorithms to predict treatment outcome. The algorithms were internally validated with cross-validation on a training sample (n = 1,860) and externally validated on an unseen test sample (n = 795). RESULTS The performance of the three algorithms did not significantly differ on the test set. With a default classification cut-off at 0.5 predicted probability, the extreme gradient boosting algorithm showed the highest positive predictive value (ppv) of 0.71(0.61 – 0.77) with a sensitivity of 0.35 (0.29 – 0.41) and area under the curve of 0.78. A trade-off can be made between ppv and sensitivity by choosing different cut-off probabilities. With a cut-off at 0.63, the ppv increased to 0.87 and the sensitivity dropped to 0.17. With a cut-off of at 0.38, the ppv decreased to 0.61 and the sensitivity increased to 0.57. CONCLUSIONS Machine learning can be used to predict treatment outcomes based on routine monitoring data.This allows practitioners to choose their own trade-off between being selective and more certain versus inclusive and less certain.

Download Full-text

Evaluation of Three Different Machine Learning Methods for Object-Based Artificial Terrace Mapping—A Case Study of the Loess Plateau, China

Remote Sensing ◽

10.3390/rs13051021 ◽

2021 ◽

Vol 13 (5) ◽

pp. 1021

Author(s):

Hu Ding ◽

Jiaming Na ◽

Shangjing Jiang ◽

Jie Zhu ◽

Kai Liu ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Loess Plateau ◽

Water Conservation ◽

Nearest Neighbor ◽

Gradient Boosting ◽

K Nearest Neighbor ◽

The Loess Plateau ◽

Object Based ◽

Extreme Gradient Boosting

Artificial terraces are of great importance for agricultural production and soil and water conservation. Automatic high-accuracy mapping of artificial terraces is the basis of monitoring and related studies. Previous research achieved artificial terrace mapping based on high-resolution digital elevation models (DEMs) or imagery. As a result of the importance of the contextual information for terrace mapping, object-based image analysis (OBIA) combined with machine learning (ML) technologies are widely used. However, the selection of an appropriate classifier is of great importance for the terrace mapping task. In this study, the performance of an integrated framework using OBIA and ML for terrace mapping was tested. A catchment, Zhifanggou, in the Loess Plateau, China, was used as the study area. First, optimized image segmentation was conducted. Then, features from the DEMs and imagery were extracted, and the correlations between the features were analyzed and ranked for classification. Finally, three different commonly-used ML classifiers, namely, extreme gradient boosting (XGBoost), random forest (RF), and k-nearest neighbor (KNN), were used for terrace mapping. The comparison with the ground truth, as delineated by field survey, indicated that random forest performed best, with a 95.60% overall accuracy (followed by 94.16% and 92.33% for XGBoost and KNN, respectively). The influence of class imbalance and feature selection is discussed. This work provides a credible framework for mapping artificial terraces.

Download Full-text

A Machine Learning Method for Predicting Vegetation Indices in China

Remote Sensing ◽

10.3390/rs13061147 ◽

2021 ◽

Vol 13 (6) ◽

pp. 1147

Author(s):

Xiangqian Li ◽

Wenping Yuan ◽

Wenjie Dong

Keyword(s):

Machine Learning ◽

Growing Season ◽

Crop Growth ◽

Spatiotemporal Distribution ◽

Coefficient Of Determination ◽

Gradient Boosting ◽

Severe Drought ◽

Vegetation Growth ◽

Extreme Gradient Boosting ◽

Boosting Method

To forecast the terrestrial carbon cycle and monitor food security, vegetation growth must be accurately predicted; however, current process-based ecosystem and crop-growth models are limited in their effectiveness. This study developed a machine learning model using the extreme gradient boosting method to predict vegetation growth throughout the growing season in China from 2001 to 2018. The model used satellite-derived vegetation data for the first month of each growing season, CO2 concentration, and several meteorological factors as data sources for the explanatory variables. Results showed that the model could reproduce the spatiotemporal distribution of vegetation growth as represented by the satellite-derived normalized difference vegetation index (NDVI). The predictive error for the growing season NDVI was less than 5% for more than 98% of vegetated areas in China; the model represented seasonal variations in NDVI well. The coefficient of determination (R2) between the monthly observed and predicted NDVI was 0.83, and more than 69% of vegetated areas had an R2 > 0.8. The effectiveness of the model was examined for a severe drought year (2009), and results showed that the model could reproduce the spatiotemporal distribution of NDVI even under extreme conditions. This model provides an alternative method for predicting vegetation growth and has great potential for monitoring vegetation dynamics and crop growth.

Download Full-text

Machine learning models to identify low adherence to influenza vaccination among Korean adults with cardiovascular disease

BMC Cardiovascular Disorders ◽

10.1186/s12872-021-01925-7 ◽

2021 ◽

Vol 21 (1) ◽

Author(s):

Moojung Kim ◽

Young Jae Kim ◽

Sung Jin Park ◽

Kwang Gi Kim ◽

Pyung Chun Oh ◽

...

Keyword(s):

Machine Learning ◽

Cardiovascular Disease ◽

Influenza Vaccination ◽

Machine Learning Techniques ◽

Gradient Boosting ◽

Support Vector ◽

Age Group ◽

Learning Models ◽

Extreme Gradient Boosting ◽

Machine Learning Models

Abstract Background Annual influenza vaccination is an important public health measure to prevent influenza infections and is strongly recommended for cardiovascular disease (CVD) patients, especially in the current coronavirus disease 2019 (COVID-19) pandemic. The aim of this study is to develop a machine learning model to identify Korean adult CVD patients with low adherence to influenza vaccination Methods Adults with CVD (n = 815) from a nationally representative dataset of the Fifth Korea National Health and Nutrition Examination Survey (KNHANES V) were analyzed. Among these adults, 500 (61.4%) had answered "yes" to whether they had received seasonal influenza vaccinations in the past 12 months. The classification process was performed using the logistic regression (LR), random forest (RF), support vector machine (SVM), and extreme gradient boosting (XGB) machine learning techniques. Because the Ministry of Health and Welfare in Korea offers free influenza immunization for the elderly, separate models were developed for the < 65 and ≥ 65 age groups. Results The accuracy of machine learning models using 16 variables as predictors of low influenza vaccination adherence was compared; for the ≥ 65 age group, XGB (84.7%) and RF (84.7%) have the best accuracies, followed by LR (82.7%) and SVM (77.6%). For the < 65 age group, SVM has the best accuracy (68.4%), followed by RF (64.9%), LR (63.2%), and XGB (61.4%). Conclusions The machine leaning models show comparable performance in classifying adult CVD patients with low adherence to influenza vaccination.

Download Full-text

Prediction of population behavior of Listeria monocytogenes in food using machine learning and a microbial growth and survival database

Scientific Reports ◽

10.1038/s41598-021-90164-z ◽

2021 ◽

Vol 11 (1) ◽

Author(s):

Satoko Hiura ◽

Shige Koseki ◽

Kento Koyama

Keyword(s):

Machine Learning ◽

Data Mining ◽

Listeria Monocytogenes ◽

Water Activity ◽

Bacterial Population ◽

Gradient Boosting ◽

Initial Cell ◽

Data Mining Approach ◽

Cell Counts ◽

Extreme Gradient Boosting

AbstractIn predictive microbiology, statistical models are employed to predict bacterial population behavior in food using environmental factors such as temperature, pH, and water activity. As the amount and complexity of data increase, handling all data with high-dimensional variables becomes a difficult task. We propose a data mining approach to predict bacterial behavior using a database of microbial responses to food environments. Listeria monocytogenes, which is one of pathogens, population growth and inactivation data under 1,007 environmental conditions, including five food categories (beef, culture medium, pork, seafood, and vegetables) and temperatures ranging from 0 to 25 °C, were obtained from the ComBase database (www.combase.cc). We used eXtreme gradient boosting tree, a machine learning algorithm, to predict bacterial population behavior from eight explanatory variables: ‘time’, ‘temperature’, ‘pH’, ‘water activity’, ‘initial cell counts’, ‘whether the viable count is initial cell number’, and two types of categories regarding food. The root mean square error of the observed and predicted values was approximately 1.0 log CFU regardless of food category, and this suggests the possibility of predicting viable bacterial counts in various foods. The data mining approach examined here will enable the prediction of bacterial population behavior in food by identifying hidden patterns within a large amount of data.

Download Full-text