Predicting Adverse Drug Events in Chinese Pediatric Inpatients With the Associated Risk Factors: A Machine Learning Study

The aim of this study was to apply machine learning methods to deeply explore the risk factors associated with adverse drug events (ADEs) and predict the occurrence of ADEs in Chinese pediatric inpatients. Data of 1,746 patients aged between 28 days and 18 years (mean age = 3.84 years) were included in the study from January 1, 2013, to December 31, 2015, in the Children’s Hospital of Chongqing Medical University. There were 247 cases of ADE occurrence, of which the most common drugs inducing ADEs were antibacterials. Seven algorithms, including eXtreme Gradient Boosting (XGBoost), CatBoost, AdaBoost, LightGBM, Random Forest (RF), Gradient Boosting Decision Tree (GBDT), and TPOT, were used to select the important risk factors, and GBDT was chosen to establish the prediction model with the best predicting abilities (precision = 44%, recall = 25%, F1 = 31.88%). The GBDT model has better performance than Global Trigger Tools (GTTs) for ADE prediction (precision 44 vs. 13.3%). In addition, multiple risk factors were identified via GBDT, such as the number of trigger true (TT) (+), number of doses, BMI, number of drugs, number of admission, height, length of hospital stay, weight, age, and number of diagnoses. The influencing directions of the risk factors on ADEs were displayed through Shapley Additive exPlanations (SHAP). This study provides a novel method to accurately predict adverse drug events in Chinese pediatric inpatients with the associated risk factors, which may be applicable in clinical practice in the future.

Download Full-text

Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning Framework

Journal of Diabetes Research ◽

10.1155/2020/6873891 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Mingyue Xue ◽

Yinxia Su ◽

Chen Li ◽

Shuxia Wang ◽

Hua Yao

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Decision Tree ◽

Type Ii Diabetes ◽

Large Scale ◽

Systolic Pressure ◽

Gradient Boosting ◽

Significant Feature ◽

Type Ii ◽

Extreme Gradient Boosting

Background. An estimated 425 million people globally have diabetes, accounting for 12% of the world’s health expenditures, and the number continues to grow, placing a huge burden on the healthcare system, especially in those remote, underserved areas. Methods. A total of 584,168 adult subjects who have participated in the national physical examination were enrolled in this study. The risk factors for type II diabetes mellitus (T2DM) were identified by p values and odds ratio, using logistic regression (LR) based on variables of physical measurement and a questionnaire. Combined with the risk factors selected by LR, we used a decision tree, a random forest, AdaBoost with a decision tree (AdaBoost), and an extreme gradient boosting decision tree (XGBoost) to identify individuals with T2DM, compared the performance of the four machine learning classifiers, and used the best-performing classifier to output the degree of variables’ importance scores of T2DM. Results. The results indicated that XGBoost had the best performance (accuracy=0.906, precision=0.910, recall=0.902, F‐1=0.906, and AUC=0.968). The degree of variables’ importance scores in XGBoost showed that BMI was the most significant feature, followed by age, waist circumference, systolic pressure, ethnicity, smoking amount, fatty liver, hypertension, physical activity, drinking status, dietary ratio (meat to vegetables), drink amount, smoking status, and diet habit (oil loving). Conclusions. We proposed a classifier based on LR-XGBoost which used fourteen variables of patients which are easily obtained and noninvasive as predictor variables to identify potential incidents of T2DM. The classifier can accurately screen the risk of diabetes in the early phrase, and the degree of variables’ importance scores gives a clue to prevent diabetes occurrence.

Download Full-text

Machine Learning Classifier Models Can Identify Delirium in Intensive Care Units

10.21203/rs.3.rs-798902/v1 ◽

2021 ◽

Author(s):

Anmin Hu ◽

Hui-Ping Li ◽

Zhen Li ◽

Zhongjun Zhang ◽

Xiong-Xiong Zhong

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Urine Volume ◽

Verbal Response ◽

Gradient Boosting ◽

Apache Iii ◽

Extreme Gradient Boosting ◽

Cv Model ◽

Response Score ◽

Length Of Icu Stay

Abstract Purpose: The aim of this study was to use machine learning to construct a model for the analysis of risk factors and prediction of delirium among ICU patients.Methods: We developed a set of real-world data to enable the comparison of the reliability and accuracy of delirium prediction models from the MIMIC-III database, the MIMIC-IV database and the eICU Collaborative Research Database. Significance tests, correlation analysis, and factor analysis were used to individually screen 80 potential risk factors. The predictive algorithms were run using the following models: Logistic regression, naive Bayesian, K-nearest neighbors, support vector machine, random forest, and eXtreme Gradient Boosting. Conventional E-PRE-DELIRIC and eighteen models, including all-factor (AF) models with all potential variables, characteristic variable (CV) models with principal component factors, and rapid predictive (RP) models without laboratory test results, were used to construct the risk prediction model for delirium. The performance of these machine learning models was measured by the area under the receiver operating characteristic curve (AUC) of tenfold cross-validation. The VIMs and SHAP algorithms, feature interpretation and sample prediction interpretation algorithms of the machine learning black box model were implemented.Results: A total of 78,365 patients were enrolled in this study, 22,159 of whom (28.28%) had positive delirium records. The E-PRE-DELIRIC model (AUC, 0.77), CV models (AUC, 0.77-0.93), CV models (AUC, 0.77-0.88) and RP models (AUC, 0.75-0.87) had discriminatory value. The random forest CV model found that the top five factors accounting for the weight of delirium were length of ICU stay, verbal response score, APACHE-III score, urine volume and hemoglobin. The SHAP values in the eXtreme Gradient Boosting CV model showed that the top three features that were negatively correlated with outcomes were verbal response score, urine volume, and hemoglobin; the top three characteristics that were positively correlated with outcomes were length of ICU stay, APACHE-III score, and alanine transaminase.Conclusion: Even with a small number of variables, machine learning has a good ability to predict delirium in critically ill patients. Characteristic variables provide direction for early intervention to reduce the risk of delirium.

Download Full-text

Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis (Preprint)

10.2196/preprints.27344 ◽

2021 ◽

Author(s):

Sang Min Nam ◽

Thomas A Peterson ◽

Kyoung Yul Seo ◽

Hyun Wook Han ◽

Jee In Kang

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Network Analysis ◽

Survey Data ◽

Associated Factors ◽

Statistical Tests ◽

Epidemiological Studies ◽

Gradient Boosting ◽

Data Set ◽

Extreme Gradient Boosting

BACKGROUND In epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large. OBJECTIVE Our study had two aims. First, we aimed to identify essential depression-associated factors using the extreme gradient boosting (XGBoost) machine learning algorithm from big survey data (the Korea National Health and Nutrition Examination Survey, 2012-2016). Second, we aimed to achieve a comprehensive understanding of multifactorial features in depression using network analysis. METHODS An XGBoost model was trained and tested to classify “current depression” and “no lifetime depression” for a data set of 120 variables for 12,596 cases. The optimal XGBoost hyperparameters were set by an automated machine learning tool (TPOT), and a high-performance sparse model was obtained by feature selection using the feature importance value of XGBoost. We performed statistical tests on the model and nonmodel factors using survey-weighted multiple logistic regression and drew a correlation network among factors. We also adopted statistical tests for the confounder or interaction effect of selected risk factors when it was suspected on the network. RESULTS The XGBoost-derived depression model consisted of 18 factors with an area under the weighted receiver operating characteristic curve of 0.86. Two nonmodel factors could be found using the model factors, and the factors were classified into direct (<i>P</i><.05) and indirect (<i>P</i>≥.05), according to the statistical significance of the association with depression. Perceived stress and asthma were the most remarkable risk factors, and urine specific gravity was a novel protective factor. The depression-factor network showed clusters of socioeconomic status and quality of life factors and suggested that educational level and sex might be predisposing factors. Indirect factors (eg, diabetes, hypercholesterolemia, and smoking) were involved in confounding or interaction effects of direct factors. Triglyceride level was a confounder of hypercholesterolemia and diabetes, smoking had a significant risk in females, and weight gain was associated with depression involving diabetes. CONCLUSIONS XGBoost and network analysis were useful to discover depression-related factors and their relationships and can be applied to epidemiological studies using big survey data.

Download Full-text

Prediction of Prolonged Length of Hospital Stay After Cancer Surgery Using Machine Learning on Electronic Health Records: Retrospective Cross-sectional Study

JMIR Medical Informatics ◽

10.2196/23147 ◽

2021 ◽

Vol 9 (2) ◽

pp. e23147

Author(s):

Yong-Yeon Jo ◽

JaiHong Han ◽

Hyun Woo Park ◽

Hyojung Jung ◽

Jae Dong Lee ◽

...

Keyword(s):

Machine Learning ◽

Logistic Regression ◽

Length Of Stay ◽

Length Of Hospital Stay ◽

Cancer Surgery ◽

Gradient Boosting ◽

Prolonged Length ◽

Extreme Gradient Boosting ◽

Postoperative Length ◽

After Cancer

Background Postoperative length of stay is a key indicator in the management of medical resources and an indirect predictor of the incidence of surgical complications and the degree of recovery of the patient after cancer surgery. Recently, machine learning has been used to predict complex medical outcomes, such as prolonged length of hospital stay, using extensive medical information. Objective The objective of this study was to develop a prediction model for prolonged length of stay after cancer surgery using a machine learning approach. Methods In our retrospective study, electronic health records (EHRs) from 42,751 patients who underwent primary surgery for 17 types of cancer between January 1, 2000, and December 31, 2017, were sourced from a single cancer center. The EHRs included numerous variables such as surgical factors, cancer factors, underlying diseases, functional laboratory assessments, general assessments, medications, and social factors. To predict prolonged length of stay after cancer surgery, we employed extreme gradient boosting classifier, multilayer perceptron, and logistic regression models. Prolonged postoperative length of stay for cancer was defined as bed-days of the group of patients who accounted for the top 50% of the distribution of bed-days by cancer type. Results In the prediction of prolonged length of stay after cancer surgery, extreme gradient boosting classifier models demonstrated excellent performance for kidney and bladder cancer surgeries (area under the receiver operating characteristic curve [AUC] >0.85). A moderate performance (AUC 0.70-0.85) was observed for stomach, breast, colon, thyroid, prostate, cervix uteri, corpus uteri, and oral cancers. For stomach, breast, colon, thyroid, and lung cancers, with more than 4000 cases each, the extreme gradient boosting classifier model showed slightly better performance than the logistic regression model, although the logistic regression model also performed adequately. We identified risk variables for the prediction of prolonged postoperative length of stay for each type of cancer, and the importance of the variables differed depending on the cancer type. After we added operative time to the models trained on preoperative factors, the models generally outperformed the corresponding models using only preoperative variables. Conclusions A machine learning approach using EHRs may improve the prediction of prolonged length of hospital stay after primary cancer surgery. This algorithm may help to provide a more effective allocation of medical resources in cancer surgery.

Download Full-text

Discovery of Depression-Associated Factors From a Nationwide Population-Based Survey: Epidemiological Study Using Machine Learning and Network Analysis

Journal of Medical Internet Research ◽

10.2196/27344 ◽

2021 ◽

Vol 23 (6) ◽

pp. e27344

Author(s):

Sang Min Nam ◽

Thomas A Peterson ◽

Kyoung Yul Seo ◽

Hyun Wook Han ◽

Jee In Kang

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Network Analysis ◽

Survey Data ◽

Associated Factors ◽

Statistical Tests ◽

Epidemiological Studies ◽

Gradient Boosting ◽

Data Set ◽

Extreme Gradient Boosting

Background In epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large. Objective Our study had two aims. First, we aimed to identify essential depression-associated factors using the extreme gradient boosting (XGBoost) machine learning algorithm from big survey data (the Korea National Health and Nutrition Examination Survey, 2012-2016). Second, we aimed to achieve a comprehensive understanding of multifactorial features in depression using network analysis. Methods An XGBoost model was trained and tested to classify “current depression” and “no lifetime depression” for a data set of 120 variables for 12,596 cases. The optimal XGBoost hyperparameters were set by an automated machine learning tool (TPOT), and a high-performance sparse model was obtained by feature selection using the feature importance value of XGBoost. We performed statistical tests on the model and nonmodel factors using survey-weighted multiple logistic regression and drew a correlation network among factors. We also adopted statistical tests for the confounder or interaction effect of selected risk factors when it was suspected on the network. Results The XGBoost-derived depression model consisted of 18 factors with an area under the weighted receiver operating characteristic curve of 0.86. Two nonmodel factors could be found using the model factors, and the factors were classified into direct (P<.05) and indirect (P≥.05), according to the statistical significance of the association with depression. Perceived stress and asthma were the most remarkable risk factors, and urine specific gravity was a novel protective factor. The depression-factor network showed clusters of socioeconomic status and quality of life factors and suggested that educational level and sex might be predisposing factors. Indirect factors (eg, diabetes, hypercholesterolemia, and smoking) were involved in confounding or interaction effects of direct factors. Triglyceride level was a confounder of hypercholesterolemia and diabetes, smoking had a significant risk in females, and weight gain was associated with depression involving diabetes. Conclusions XGBoost and network analysis were useful to discover depression-related factors and their relationships and can be applied to epidemiological studies using big survey data.

Download Full-text

Importance of GWAS risk loci and clinical data in predicting asthma using machine-learning approaches

10.21203/rs.3.rs-21271/v1 ◽

2020 ◽

Author(s):

Si-Qiao Liang ◽

Jian-Xiong Long ◽

Jingmin Deng ◽

Xuan Wei ◽

Mei-Ling Yang ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Clinical Data ◽

Genome Wide Association Study ◽

Prediction Models ◽

Area Under The Curve ◽

Gradient Boosting ◽

Support Vector ◽

Learning Approaches ◽

Extreme Gradient Boosting

Abstract Asthma is a serious immune-mediated respiratory airway disease. Its pathological processes involve genetics and the environment, but it remains unclear. To understand the risk factors of asthma, we combined genome-wide association study (GWAS) risk loci and clinical data in predicting asthma using machine-learning approaches. A case–control study with 123 asthma patients and 100 healthy controls was conducted in Zhuang population in Guangxi. GWAS risk loci were detected using polymerase chain reaction, and clinical data were collected. Machine-learning approaches (e.g., extreme gradient boosting [XGBoost], decision tree, support vector machine, and random forest algorithms) were used to identify the major factors that contributed to asthma. A total of 14 GWAS risk loci with clinical data were analyzed on the basis of 10 times of 10-fold cross-validation for all machine-learning models. Using GWAS risk loci or clinical data, the best performances were area under the curve (AUC) values of 64.3% and 71.4%, respectively. Combining GWAS risk loci and clinical data, the XGBoost established the best model with an AUC of 79.7%, indicating that the combination of genetics and clinical data can enable improved performance. We then sorted the importance of features and found that the top six risk factors for predicting asthma were rs3117098, rs7775228, family history, rs2305480, rs4833095, and body mass index. Asthma-prediction models based on GWAS risk loci and clinical data can accurately predict asthma and thus provide insights into the disease pathogenesis of asthma. Further research is required to evaluate more genetic markers and clinical data and predict asthma risk.

Download Full-text

Optimization of a clamping concept based on machine learning

Production Engineering ◽

10.1007/s11740-021-01073-z ◽

2021 ◽

Author(s):

Qi Feng ◽

Walther Maier ◽

Thomas Stehle ◽

Hans-Christian Möhring

Keyword(s):

Machine Learning ◽

Dynamic Stiffness ◽

Research Work ◽

Gradient Boosting ◽

Machining Accuracy ◽

Fixture Design ◽

Regression Methods ◽

Extreme Gradient Boosting ◽

Novel Method ◽

Systematic Selection

AbstractFixtures are an important element of the manufacturing system, as they ensure productive and accurate machining of differently shaped workpieces. Regarding the fixture design or the layout of fixture elements, a high static and dynamic stiffness of fixtures is therefore required to ensure the defined position and orientation of workpieces under process loads, e.g. cutting forces. Nowadays, with the increase in computing performance and the development of new algorithms, machine learning (ML) offers an appropriate possibility to use regression methods for creating realistic, rapid and reliable equivalent ML models instead of simulations based on the finite element method (FEM). This research work introduces a novel method that allows an optimization of clamping concepts and fixture design by means of ML, in order to reduce manufacturing errors and to obtain an increased stiffness of fixtures and machining accuracy. This paper describes the preparation of a dataset for training ML models, the systematic selection of the most promising regression algorithm based on relevant criteria, the implementation of the chosen algorithm Extreme Gradient Boosting (XGBoost) and other comparable algorithms, the analysis of their regression results, and the validation of the optimization for a selected clamping concept.

Download Full-text

Predicting Undesired Treatment Outcome in Mental Healthcare: Machine Learning Study (Preprint)

10.2196/preprints.17235 ◽

2019 ◽

Author(s):

Kasper Van Mens ◽

Joran Lokkerbol ◽

Richard Janssen ◽

Robert de Lange ◽

Bea Tiemens

Keyword(s):

Machine Learning ◽

Treatment Outcome ◽

Mental Health Treatment ◽

Mental Healthcare ◽

Machine Learning Algorithms ◽

Gradient Boosting ◽

Trade Off ◽

Trade Offs ◽

Outcome Monitoring ◽

Extreme Gradient Boosting

BACKGROUND It remains a challenge to predict which treatment will work for which patient in mental healthcare. OBJECTIVE In this study we compare machine algorithms to predict during treatment which patients will not benefit from brief mental health treatment and present trade-offs that must be considered before an algorithm can be used in clinical practice. METHODS Using an anonymized dataset containing routine outcome monitoring data from a mental healthcare organization in the Netherlands (n = 2,655), we applied three machine learning algorithms to predict treatment outcome. The algorithms were internally validated with cross-validation on a training sample (n = 1,860) and externally validated on an unseen test sample (n = 795). RESULTS The performance of the three algorithms did not significantly differ on the test set. With a default classification cut-off at 0.5 predicted probability, the extreme gradient boosting algorithm showed the highest positive predictive value (ppv) of 0.71(0.61 – 0.77) with a sensitivity of 0.35 (0.29 – 0.41) and area under the curve of 0.78. A trade-off can be made between ppv and sensitivity by choosing different cut-off probabilities. With a cut-off at 0.63, the ppv increased to 0.87 and the sensitivity dropped to 0.17. With a cut-off of at 0.38, the ppv decreased to 0.61 and the sensitivity increased to 0.57. CONCLUSIONS Machine learning can be used to predict treatment outcomes based on routine monitoring data.This allows practitioners to choose their own trade-off between being selective and more certain versus inclusive and less certain.

Download Full-text

Evaluation of Three Different Machine Learning Methods for Object-Based Artificial Terrace Mapping—A Case Study of the Loess Plateau, China

Remote Sensing ◽

10.3390/rs13051021 ◽

2021 ◽

Vol 13 (5) ◽

pp. 1021

Author(s):

Hu Ding ◽

Jiaming Na ◽

Shangjing Jiang ◽

Jie Zhu ◽

Kai Liu ◽

...

Keyword(s):

Machine Learning ◽

Random Forest ◽

Loess Plateau ◽

Water Conservation ◽

Nearest Neighbor ◽

Gradient Boosting ◽

K Nearest Neighbor ◽

The Loess Plateau ◽

Object Based ◽

Extreme Gradient Boosting

Artificial terraces are of great importance for agricultural production and soil and water conservation. Automatic high-accuracy mapping of artificial terraces is the basis of monitoring and related studies. Previous research achieved artificial terrace mapping based on high-resolution digital elevation models (DEMs) or imagery. As a result of the importance of the contextual information for terrace mapping, object-based image analysis (OBIA) combined with machine learning (ML) technologies are widely used. However, the selection of an appropriate classifier is of great importance for the terrace mapping task. In this study, the performance of an integrated framework using OBIA and ML for terrace mapping was tested. A catchment, Zhifanggou, in the Loess Plateau, China, was used as the study area. First, optimized image segmentation was conducted. Then, features from the DEMs and imagery were extracted, and the correlations between the features were analyzed and ranked for classification. Finally, three different commonly-used ML classifiers, namely, extreme gradient boosting (XGBoost), random forest (RF), and k-nearest neighbor (KNN), were used for terrace mapping. The comparison with the ground truth, as delineated by field survey, indicated that random forest performed best, with a 95.60% overall accuracy (followed by 94.16% and 92.33% for XGBoost and KNN, respectively). The influence of class imbalance and feature selection is discussed. This work provides a credible framework for mapping artificial terraces.

Download Full-text

A Machine Learning Method for Predicting Vegetation Indices in China

Remote Sensing ◽

10.3390/rs13061147 ◽

2021 ◽

Vol 13 (6) ◽

pp. 1147

Author(s):

Xiangqian Li ◽

Wenping Yuan ◽

Wenjie Dong

Keyword(s):

Machine Learning ◽

Growing Season ◽

Crop Growth ◽

Spatiotemporal Distribution ◽

Coefficient Of Determination ◽

Gradient Boosting ◽

Severe Drought ◽

Vegetation Growth ◽

Extreme Gradient Boosting ◽

Boosting Method

To forecast the terrestrial carbon cycle and monitor food security, vegetation growth must be accurately predicted; however, current process-based ecosystem and crop-growth models are limited in their effectiveness. This study developed a machine learning model using the extreme gradient boosting method to predict vegetation growth throughout the growing season in China from 2001 to 2018. The model used satellite-derived vegetation data for the first month of each growing season, CO2 concentration, and several meteorological factors as data sources for the explanatory variables. Results showed that the model could reproduce the spatiotemporal distribution of vegetation growth as represented by the satellite-derived normalized difference vegetation index (NDVI). The predictive error for the growing season NDVI was less than 5% for more than 98% of vegetated areas in China; the model represented seasonal variations in NDVI well. The coefficient of determination (R2) between the monthly observed and predicted NDVI was 0.83, and more than 69% of vegetated areas had an R2 > 0.8. The effectiveness of the model was examined for a severe drought year (2009), and results showed that the model could reproduce the spatiotemporal distribution of NDVI even under extreme conditions. This model provides an alternative method for predicting vegetation growth and has great potential for monitoring vegetation dynamics and crop growth.

Download Full-text