scholarly journals Identification of Potential Type II Diabetes in a Large-Scale Chinese Population Using a Systematic Machine Learning Framework

2020 ◽  
Vol 2020 ◽  
pp. 1-12
Author(s):  
Mingyue Xue ◽  
Yinxia Su ◽  
Chen Li ◽  
Shuxia Wang ◽  
Hua Yao

Background. An estimated 425 million people globally have diabetes, accounting for 12% of the world’s health expenditures, and the number continues to grow, placing a huge burden on the healthcare system, especially in those remote, underserved areas. Methods. A total of 584,168 adult subjects who have participated in the national physical examination were enrolled in this study. The risk factors for type II diabetes mellitus (T2DM) were identified by p values and odds ratio, using logistic regression (LR) based on variables of physical measurement and a questionnaire. Combined with the risk factors selected by LR, we used a decision tree, a random forest, AdaBoost with a decision tree (AdaBoost), and an extreme gradient boosting decision tree (XGBoost) to identify individuals with T2DM, compared the performance of the four machine learning classifiers, and used the best-performing classifier to output the degree of variables’ importance scores of T2DM. Results. The results indicated that XGBoost had the best performance (accuracy=0.906, precision=0.910, recall=0.902, F‐1=0.906, and AUC=0.968). The degree of variables’ importance scores in XGBoost showed that BMI was the most significant feature, followed by age, waist circumference, systolic pressure, ethnicity, smoking amount, fatty liver, hypertension, physical activity, drinking status, dietary ratio (meat to vegetables), drink amount, smoking status, and diet habit (oil loving). Conclusions. We proposed a classifier based on LR-XGBoost which used fourteen variables of patients which are easily obtained and noninvasive as predictor variables to identify potential incidents of T2DM. The classifier can accurately screen the risk of diabetes in the early phrase, and the degree of variables’ importance scores gives a clue to prevent diabetes occurrence.

2020 ◽  
Author(s):  
Xueyan Li ◽  
Genshan Ma ◽  
Xiaobo Qian ◽  
Yamou Wu ◽  
Xiaochen Huang ◽  
...  

Abstract Background: We aimed to assess the performance of machine learning algorithms for the prediction of risk factors of postoperative ileus (POI) in patients underwent laparoscopic colorectal surgery for malignant lesions. Methods: We conducted analyses in a retrospective observational study with a total of 637 patients at Suzhou Hospital of Nanjing Medical University. Four machine learning algorithms (logistic regression, decision tree, random forest, gradient boosting decision tree) were considered to predict risk factors of POI. The total cases were randomly divided into training and testing data sets, with a ratio of 8:2. The performance of each model was evaluated by area under receiver operator characteristic curve (AUC), precision, recall and F1-score. Results: The morbidity of POI in this study was 19.15% (122/637). Gradient boosting decision tree reached the highest AUC (0.76) and was the best model for POI risk prediction. In addition, the results of the importance matrix of gradient boosting decision tree showed that the five most important variables were time to first passage of flatus, opioids during POD3, duration of surgery, height and weight. Conclusions: The gradient boosting decision tree was the optimal model to predict the risk of POI in patients underwent laparoscopic colorectal surgery for malignant lesions. And the results of our study could be useful for clinical guidelines in POI risk prediction.


2021 ◽  
Vol 10 (10) ◽  
pp. 680
Author(s):  
Annan Yang ◽  
Chunmei Wang ◽  
Guowei Pang ◽  
Yongqing Long ◽  
Lei Wang ◽  
...  

Gully erosion is the most severe type of water erosion and is a major land degradation process. Gully erosion susceptibility mapping (GESM)’s efficiency and interpretability remains a challenge, especially in complex terrain areas. In this study, a WoE-MLC model was used to solve the above problem, which combines machine learning classification algorithms and the statistical weight of evidence (WoE) model in the Loess Plateau. The three machine learning (ML) algorithms utilized in this research were random forest (RF), gradient boosted decision trees (GBDT), and extreme gradient boosting (XGBoost). The results showed that: (1) GESM were well predicted by combining both machine learning regression models and WoE-MLC models, with the area under the curve (AUC) values both greater than 0.92, and the latter was more computationally efficient and interpretable; (2) The XGBoost algorithm was more efficient in GESM than the other two algorithms, with the strongest generalization ability and best performance in avoiding overfitting (averaged AUC = 0.947), followed by the RF algorithm (averaged AUC = 0.944), and GBDT algorithm (averaged AUC = 0.938); and (3) slope gradient, land use, and altitude were the main factors for GESM. This study may provide a possible method for gully erosion susceptibility mapping at large scale.


2021 ◽  
Author(s):  
Anmin Hu ◽  
Hui-Ping Li ◽  
Zhen Li ◽  
Zhongjun Zhang ◽  
Xiong-Xiong Zhong

Abstract Purpose: The aim of this study was to use machine learning to construct a model for the analysis of risk factors and prediction of delirium among ICU patients.Methods: We developed a set of real-world data to enable the comparison of the reliability and accuracy of delirium prediction models from the MIMIC-III database, the MIMIC-IV database and the eICU Collaborative Research Database. Significance tests, correlation analysis, and factor analysis were used to individually screen 80 potential risk factors. The predictive algorithms were run using the following models: Logistic regression, naive Bayesian, K-nearest neighbors, support vector machine, random forest, and eXtreme Gradient Boosting. Conventional E-PRE-DELIRIC and eighteen models, including all-factor (AF) models with all potential variables, characteristic variable (CV) models with principal component factors, and rapid predictive (RP) models without laboratory test results, were used to construct the risk prediction model for delirium. The performance of these machine learning models was measured by the area under the receiver operating characteristic curve (AUC) of tenfold cross-validation. The VIMs and SHAP algorithms, feature interpretation and sample prediction interpretation algorithms of the machine learning black box model were implemented.Results: A total of 78,365 patients were enrolled in this study, 22,159 of whom (28.28%) had positive delirium records. The E-PRE-DELIRIC model (AUC, 0.77), CV models (AUC, 0.77-0.93), CV models (AUC, 0.77-0.88) and RP models (AUC, 0.75-0.87) had discriminatory value. The random forest CV model found that the top five factors accounting for the weight of delirium were length of ICU stay, verbal response score, APACHE-III score, urine volume and hemoglobin. The SHAP values in the eXtreme Gradient Boosting CV model showed that the top three features that were negatively correlated with outcomes were verbal response score, urine volume, and hemoglobin; the top three characteristics that were positively correlated with outcomes were length of ICU stay, APACHE-III score, and alanine transaminase.Conclusion: Even with a small number of variables, machine learning has a good ability to predict delirium in critically ill patients. Characteristic variables provide direction for early intervention to reduce the risk of delirium.


2021 ◽  
Vol 11 ◽  
Author(s):  
Yanjie Zhao ◽  
Rong Chen ◽  
Ting Zhang ◽  
Chaoyue Chen ◽  
Muhetaer Muhelisa ◽  
...  

BackgroundDifferential diagnosis between benign and malignant breast lesions is of crucial importance relating to follow-up treatment. Recent development in texture analysis and machine learning may lead to a new solution to this problem.MethodThis current study enrolled a total number of 265 patients (benign breast lesions:malignant breast lesions = 71:194) diagnosed in our hospital and received magnetic resonance imaging between January 2014 and August 2017. Patients were randomly divided into the training group and validation group (4:1), and two radiologists extracted their texture features from the contrast-enhanced T1-weighted images. We performed five different feature selection methods including Distance correlation, Gradient Boosting Decision Tree (GBDT), least absolute shrinkage and selection operator (LASSO), random forest (RF), eXtreme gradient boosting (Xgboost) and five independent classification models were built based on Linear discriminant analysis (LDA) algorithm.ResultsAll five models showed promising results to discriminate malignant breast lesions from benign breast lesions, and the areas under the curve (AUCs) of receiver operating characteristic (ROC) were all above 0.830 in both training and validation groups. The model with a better discriminating ability was the combination of LDA + gradient boosting decision tree (GBDT). The sensitivity, specificity, AUC, and accuracy in the training group were 0.814, 0.883, 0.922, and 0.868, respectively; LDA + random forest (RF) also suggests promising results with the AUC of 0.906 in the training group.ConclusionThe evidence of this study, while preliminary, suggested that a combination of MRI texture analysis and LDA algorithm could discriminate benign breast lesions from malignant breast lesions. Further multicenter researches in this field would be of great help in the validation of the result.


2021 ◽  
Author(s):  
Sang Min Nam ◽  
Thomas A Peterson ◽  
Kyoung Yul Seo ◽  
Hyun Wook Han ◽  
Jee In Kang

BACKGROUND In epidemiological studies, finding the best subset of factors is challenging when the number of explanatory variables is large. OBJECTIVE Our study had two aims. First, we aimed to identify essential depression-associated factors using the extreme gradient boosting (XGBoost) machine learning algorithm from big survey data (the Korea National Health and Nutrition Examination Survey, 2012-2016). Second, we aimed to achieve a comprehensive understanding of multifactorial features in depression using network analysis. METHODS An XGBoost model was trained and tested to classify “current depression” and “no lifetime depression” for a data set of 120 variables for 12,596 cases. The optimal XGBoost hyperparameters were set by an automated machine learning tool (TPOT), and a high-performance sparse model was obtained by feature selection using the feature importance value of XGBoost. We performed statistical tests on the model and nonmodel factors using survey-weighted multiple logistic regression and drew a correlation network among factors. We also adopted statistical tests for the confounder or interaction effect of selected risk factors when it was suspected on the network. RESULTS The XGBoost-derived depression model consisted of 18 factors with an area under the weighted receiver operating characteristic curve of 0.86. Two nonmodel factors could be found using the model factors, and the factors were classified into direct (<i>P</i>&lt;.05) and indirect (<i>P</i>≥.05), according to the statistical significance of the association with depression. Perceived stress and asthma were the most remarkable risk factors, and urine specific gravity was a novel protective factor. The depression-factor network showed clusters of socioeconomic status and quality of life factors and suggested that educational level and sex might be predisposing factors. Indirect factors (eg, diabetes, hypercholesterolemia, and smoking) were involved in confounding or interaction effects of direct factors. Triglyceride level was a confounder of hypercholesterolemia and diabetes, smoking had a significant risk in females, and weight gain was associated with depression involving diabetes. CONCLUSIONS XGBoost and network analysis were useful to discover depression-related factors and their relationships and can be applied to epidemiological studies using big survey data.


2021 ◽  
Vol 12 (2) ◽  
pp. 28-55
Author(s):  
Fabiano Rodrigues ◽  
Francisco Aparecido Rodrigues ◽  
Thelma Valéria Rocha Rodrigues

Este estudo analisa resultados obtidos com modelos de machine learning para predição do sucesso de startups. Como proxy de sucesso considera-se a perspectiva do investidor, na qual a aquisição da startup ou realização de IPO (Initial Public Offering) são formas de recuperação do investimento. A revisão da literatura aborda startups e veículos de financiamento, estudos anteriores sobre predição do sucesso de startups via modelos de machine learning, e trade-offs entre técnicas de machine learning. Na parte empírica, foi realizada uma pesquisa quantitativa baseada em dados secundários oriundos da plataforma americana Crunchbase, com startups de 171 países. O design de pesquisa estabeleceu como filtro startups fundadas entre junho/2010 e junho/2015, e uma janela de predição entre junho/2015 e junho/2020 para prever o sucesso das startups. A amostra utilizada, após etapa de pré-processamento dos dados, foi de 18.571 startups. Foram utilizados seis modelos de classificação binária para a predição: Regressão Logística, Decision Tree, Random Forest, Extreme Gradiente Boosting, Support Vector Machine e Rede Neural. Ao final, os modelos Random Forest e Extreme Gradient Boosting apresentaram os melhores desempenhos na tarefa de classificação. Este artigo, envolvendo machine learning e startups, contribui para áreas de pesquisa híbridas ao mesclar os campos da Administração e Ciência de Dados. Além disso, contribui para investidores com uma ferramenta de mapeamento inicial de startups na busca de targets com maior probabilidade de sucesso.   


2020 ◽  
Vol 2020 ◽  
pp. 1-14
Author(s):  
Ke Zhou ◽  
Hailei Liu ◽  
Xiaobo Deng ◽  
Hao Wang ◽  
Shenglan Zhang

Six machine-learning approaches, including multivariate linear regression (MLR), gradient boosting decision tree, k-nearest neighbors, random forest, extreme gradient boosting (XGB), and deep neural network (DNN), were compared for near-surface air-temperature (Tair) estimation from the new generation of Chinese geostationary meteorological satellite Fengyun-4A (FY-4A) observations. The brightness temperatures in split-window channels from the Advanced Geostationary Radiation Imager (AGRI) of FY-4A and numerical weather prediction data from the global forecast system were used as the predictor variables for Tair estimation. The performance of each model and the temporal and spatial distribution of the estimated Tair errors were analyzed. The results showed that the XGB model had better overall performance, with R2 of 0.902, bias of −0.087°C, and root-mean-square error of 1.946°C. The spatial variation characteristics of the Tair error of the XGB method were less obvious than those of the other methods. The XGB model can provide more stable and high-precision Tair for a large-scale Tair estimation over China and can serve as a reference for Tair estimation based on machine-learning models.


2021 ◽  
Vol 12 ◽  
Author(s):  
Ze Yu ◽  
Huanhuan Ji ◽  
Jianwen Xiao ◽  
Ping Wei ◽  
Lin Song ◽  
...  

The aim of this study was to apply machine learning methods to deeply explore the risk factors associated with adverse drug events (ADEs) and predict the occurrence of ADEs in Chinese pediatric inpatients. Data of 1,746 patients aged between 28 days and 18 years (mean age = 3.84 years) were included in the study from January 1, 2013, to December 31, 2015, in the Children’s Hospital of Chongqing Medical University. There were 247 cases of ADE occurrence, of which the most common drugs inducing ADEs were antibacterials. Seven algorithms, including eXtreme Gradient Boosting (XGBoost), CatBoost, AdaBoost, LightGBM, Random Forest (RF), Gradient Boosting Decision Tree (GBDT), and TPOT, were used to select the important risk factors, and GBDT was chosen to establish the prediction model with the best predicting abilities (precision = 44%, recall = 25%, F1 = 31.88%). The GBDT model has better performance than Global Trigger Tools (GTTs) for ADE prediction (precision 44 vs. 13.3%). In addition, multiple risk factors were identified via GBDT, such as the number of trigger true (TT) (+), number of doses, BMI, number of drugs, number of admission, height, length of hospital stay, weight, age, and number of diagnoses. The influencing directions of the risk factors on ADEs were displayed through Shapley Additive exPlanations (SHAP). This study provides a novel method to accurately predict adverse drug events in Chinese pediatric inpatients with the associated risk factors, which may be applicable in clinical practice in the future.


Author(s):  
Chen Wang ◽  
Lin Liu ◽  
Chengcheng Xu ◽  
Weitao Lv

The objective of this paper is to predict the future driving risk of crash-involved drivers in Kunshan, China. A systematic machine learning framework is proposed to deal with three critical technical issues: 1. defining driving risk; 2. developing risky driving factors; 3. developing a reliable and explicable machine learning model. High-risk (HR) and low-risk (LR) drivers were defined by five different scenarios. A number of features were extracted from seven-year crash/violation records. Drivers’ two-year prior crash/violation information was used to predict their driving risk in the subsequent two years. Using a one-year rolling time window, prediction models were developed for four consecutive time periods: 2013–2014, 2014–2015, 2015–2016, and 2016–2017. Four tree-based ensemble learning techniques were attempted, including random forest (RF), Adaboost with decision tree, gradient boosting decision tree (GBDT), and extreme gradient boosting decision tree (XGboost). A temporal transferability test and a follow-up study were applied to validate the trained models. The best scenario defining driving risk was multi-dimensional, encompassing crash recurrence, severity, and fault commitment. GBDT appeared to be the best model choice across all time periods, with an acceptable average precision (AP) of 0.68 on the most recent datasets (i.e., 2016–2017). Seven of nine top features were related to risky driving behaviors, which presented non-linear relationships with driving risk. Model transferability held within relatively short time intervals (1–2 years). Appropriate risk definition, complicated violation/crash features, and advanced machine learning techniques need to be considered for risk prediction task. The proposed machine learning approach is promising, so that safety interventions can be launched more effectively.


Author(s):  
Parthiban Loganathan ◽  
Amit Baburao Mahindrakar

Abstract The intercomparison of streamflow simulation and the prediction of discharge using various renowned machine learning techniques were performed. The daily streamflow discharge model was developed for 35 observation stations located in a large-scale river basin named Cauvery. Various hydrological indices were calculated for observed and predicted discharges for comparing and evaluating the replicability of local hydrological conditions. The model variance and bias observed from the proposed extreme gradient boosting decision tree model were less than 15%, which is compared with other machine learning techniques considered in this study. The model Nash–Sutcliffe efficiency and coefficient of determination values are above 0.7 for both the training and testing phases which demonstrate the effectiveness of model performance. The comparison of monthly observed and model-predicted discharges during the validation period illustrates the model's ability in representing the peaks and fall in high-, medium-, and low-flow zones. The assessment and comparison of hydrological indices between observed and predicted discharges illustrate the model's ability in representing the baseflow, high-spell, and low-spell statistics. Simulating streamflow and predicting discharge are essential for water resource planning and management, especially in large-scale river basins. The proposed machine learning technique demonstrates significant improvement in model efficiency by dropping variance and bias which, in turn, improves the replicability of local-scale hydrology.


Sign in / Sign up

Export Citation Format

Share Document