Machine Learning for Predicting the 3-Year Risk of Incident Diabetes in Chinese Adults

Purpose: We aimed to establish and validate a risk assessment system that combines demographic and clinical variables to predict the 3-year risk of incident diabetes in Chinese adults.Methods: A 3-year cohort study was performed on 15,928 Chinese adults without diabetes at baseline. All participants were randomly divided into a training set (n = 7,940) and a validation set (n = 7,988). XGBoost method is an effective machine learning technique used to select the most important variables from candidate variables. And we further established a stepwise model based on the predictors chosen by the XGBoost model. The area under the receiver operating characteristic curve (AUC), decision curve and calibration analysis were used to assess discrimination, clinical use and calibration of the model, respectively. The external validation was performed on a cohort of 11,113 Japanese participants.Result: In the training and validation sets, 148 and 145 incident diabetes cases occurred. XGBoost methods selected the 10 most important variables from 15 candidate variables. Fasting plasma glucose (FPG), body mass index (BMI) and age were the top 3 important variables. And we further established a stepwise model and a prediction nomogram. The AUCs of the stepwise model were 0.933 and 0.910 in the training and validation sets, respectively. The Hosmer-Lemeshow test showed a perfect fit between the predicted diabetes risk and the observed diabetes risk (p = 0.068 for the training set, p = 0.165 for the validation set). Decision curve analysis presented the clinical use of the stepwise model and there was a wide range of alternative threshold probability spectrum. And there were almost no the interactions between these predictors (most P-values for interaction >0.05). Furthermore, the AUC for the external validation set was 0.830, and the Hosmer-Lemeshow test for the external validation set showed no statistically significant difference between the predicted diabetes risk and observed diabetes risk (P = 0.824).Conclusion: We established and validated a risk assessment system for characterizing the 3-year risk of incident diabetes.

Download Full-text

Association of hypertension and incident diabetes in Chinese adults: a retrospective cohort study using propensity-score matching

10.21203/rs.3.rs-23730/v1 ◽

2020 ◽

Author(s):

Yang Wu ◽

Haofei Hu ◽

Jinlin Cai ◽

Runtian Chen ◽

Xin Zuo ◽

...

Keyword(s):

Cohort Study ◽

Propensity Score ◽

Propensity Score Matching ◽

Diabetes Risk ◽

Incident Diabetes ◽

Chinese Adults ◽

Hypertensive Patients ◽

Hypertensive Group ◽

High Propensity ◽

Doubly Robust Estimation

Abstract Background Previous studies have revealed that hypertension is one of major risk factors of incident diabetes. However, reliable quantification of the relationship between hypertension and diabetes risk is limited, especially in Chinese people. We aimed to investigate the association between hypertension and risk of incident diabetes in a large cohort of Chinese population. Methods This was a retrospective propensity score-matched cohort study. We enrolled 211809 Chinese adults without diabetes at baseline between 2010 and 2016. The target independent and dependent variable were hypertension at baseline and incident diabetes during follow-up respectively. The one to one propensity score matching using a non-parsimonious multivariable logistic regression was conducted to balance the confounders between 28,946 hypertensive patients and 28,946 non-hypertensive participants. The doubly robust estimation method was used to investigate the association between hypertension and incident diabetes. Result After propensity-score matching, the cumulative incidence of diabetes among hypertensive and non-hypertensive participants were 1627.690 per 100,000 person-years and 1414.422 per 100,000 person-years, respectively. In the propensity-score matching cohort, compared to the non-hypertensive participants, the risk of incident diabetes increased by 14.0% among hypertensive subjects (HR = 1.140, 95% confidence interval (CI): 1.058–1.229, P = 0.00063). After adjusting for the demographic and clinical covariates, diabetes risk increased by 13.1% in the hypertensive group (HR = 1.131, 95%CI: 1.049–1.220, P = 0.00143). And diabetes risk increased by 15.4% among hypertensive subjects after adjusting for the propensity score (HR = 1.154, 95%CI:1.070–1.244, P = 0.00019).In the subgroup analysis, compared to non-hypertensive participants with low propensity score, the risk of incident diabetes increased by 2.6 times among hypertensive patients with high propensity score (HR = 3.610,95%CI: 2.604–5.005,P < 0.00001). In the sensitivity analysis, the risk of diabetes in the hypertensive group increased by 11.7% in the original cohort (HR = 1.117༌95%CI: 1.044–1.196,P = 0.00134) and 19.9% in the weighted cohort(HR = 1.199༌95%CI: 1.149–1.250,P < 0.00001), respectively. Conclusion Hypertension was associated with a 13.1% increase in the risk of developing diabetes in Chinese adults. Additionally, compared to non-hypertensive participants with low propensity score, the risk of incident diabetes increased by 2.6 times among hypertensive patients with high propensity score.

Download Full-text

Retrained Classification of Tyrosinase Inhibitors and “In Silico” Potency Estimation by Using Atom-Type Linear Indices

Methodologies and Applications for Chemoinformatics and Chemical Engineering ◽

10.4018/978-1-4666-4010-8.ch021 ◽

2013 ◽

pp. 322-427

Keyword(s):

External Validation ◽

Correlation Coefficients ◽

Classification Models ◽

Training Set ◽

Linear Discriminant ◽

Oecd Principles ◽

Qsar Models ◽

Validation Set ◽

Global Accuracy

In this paper, the authors present an effort to increase the applicability domain (AD) by means of retraining models using a database of 701 great dissimilar molecules presenting anti-tyrosinase activity and 728 drugs with other uses. Atom-based linear indices and best subset linear discriminant analysis (LDA) were used to develop individual classification models. Eighteen individual classification-based QSAR models for the tyrosinase inhibitory activity were obtained with global accuracy varying from 88.15-91.60% in the training set and values of Matthews correlation coefficients (C) varying from 0.76-0.82. The external validation set shows globally classifications above 85.99% and 0.72 for C. All individual models were validated and fulfilled by OECD principles. A brief analysis of AD for the training set of 478 compounds and the new active compounds included in the re-training was carried out. Various assembled multiclassifier systems contained eighteen models using different selection criterions were obtained, which provide possibility of select the best strategy for particular problem. The various assembled multiclassifier systems also estimated the potency of active identified compounds. Eighteen validated potency models by OECD principles were used.

Download Full-text

A prediction nomogram for the 3-year risk of incident diabetes among Chinese adults

Scientific Reports ◽

10.1038/s41598-020-78716-1 ◽

2020 ◽

Vol 10 (1) ◽

Author(s):

Yang Wu ◽

Haofei Hu ◽

Jinlin Cai ◽

Runtian Chen ◽

Xin Zuo ◽

...

Keyword(s):

High Risk ◽

Targeted Delivery ◽

Operating Characteristic ◽

Validation Cohort ◽

External Validation ◽

Incident Diabetes ◽

Chinese Adults ◽

Decision Curve Analysis ◽

Diabetes Prediction ◽

Selection Operator

AbstractIdentifying individuals at high risk for incident diabetes could help achieve targeted delivery of interventional programs. We aimed to develop a personalized diabetes prediction nomogram for the 3-year risk of diabetes among Chinese adults. This retrospective cohort study was among 32,312 participants without diabetes at baseline. All participants were randomly stratified into training cohort (n = 16,219) and validation cohort (n = 16,093). The least absolute shrinkage and selection operator model was used to construct a nomogram and draw a formula for diabetes probability. 500 bootstraps performed the receiver operating characteristic (ROC) curve and decision curve analysis resamples to assess the nomogram's determination and clinical use, respectively. 155 and 141 participants developed diabetes in the training and validation cohort, respectively. The area under curve (AUC) of the nomogram was 0.9125 (95% CI, 0.8887–0.9364) and 0.9030 (95% CI, 0.8747–0.9313) for the training and validation cohort, respectively. We used 12,545 Japanese participants for external validation, its AUC was 0.8488 (95% CI, 0.8126–0.8850). The internal and external validation showed our nomogram had excellent prediction performance. In conclusion, we developed and validated a personalized prediction nomogram for 3-year risk of incident diabetes among Chinese adults, identifying individuals at high risk of developing diabetes.

Download Full-text

Machine Learning Classification of Head Impact Sensor Data

Volume 3: Biomedical and Biotechnology Engineering ◽

10.1115/imece2019-12173 ◽

2019 ◽

Author(s):

Tyler F. Rooks ◽

Andrea S. Dargie ◽

Valeta Carol Chancey

Keyword(s):

Machine Learning ◽

Decision Tree ◽

External Validation ◽

Classification Algorithm ◽

Sensor Data ◽

Environmental Sensors ◽

Head Acceleration ◽

Machine Learning Classification ◽

Environmental Sensor ◽

Validation Set

Abstract A shortcoming of using environmental sensors for the surveillance of potentially concussive events is substantial uncertainty regarding whether the event was caused by head acceleration (“head impacts”) or sensor motion (with no head acceleration). The goal of the present study is to develop a machine learning model to classify environmental sensor data obtained in the field and evaluate the performance of the model against the performance of the proprietary classification algorithm used by the environmental sensor. Data were collected from Soldiers attending sparring sessions conducted under a U.S. Army Combatives School course. Data from one sparring session were used to train a decision tree classification algorithm to identify good and bad signals. Data from the remaining sparring sessions were kept as an external validation set. The performance of the proprietary algorithm used by the sensor was also compared to the trained algorithm performance. The trained decision tree was able to correctly classify 95% of events for internal cross-validation and 88% of events for the external validation set. Comparatively, the proprietary algorithm was only able to correctly classify 61% of the events. In general, the trained algorithm was better able to predict when a signal was good or bad compared to the proprietary algorithm. The present study shows it is possible to train a decision tree algorithm using environmental sensor data collected in the field.

Download Full-text

Use of serum biomarkers in staging of canine hepatic fibrosis

Journal of Veterinary Diagnostic Investigation ◽

10.1177/1040638719866881 ◽

2019 ◽

Vol 31 (5) ◽

pp. 665-673 ◽

Cited By ~ 1

Author(s):

Maud Menard ◽

Alexis Lecoindre ◽

Jean-Luc Cadoré ◽

Michèle Chevallier ◽

Aurélie Pagnon ◽

...

Keyword(s):

Liver Biopsy ◽

Hepatic Fibrosis ◽

External Validation ◽

Model Performance ◽

Area Under The Curve ◽

Gamma Glutamyl Transferase ◽

Training Set ◽

Internal Validation ◽

Validation Set ◽

Glutamyl Transferase

Accurate staging of hepatic fibrosis (HF) is important for treatment and prognosis of canine chronic hepatitis. HF scores are used in human medicine to indirectly stage and monitor HF, decreasing the need for liver biopsy. We developed a canine HF score to screen for moderate or greater HF. We included 96 dogs in our study, including 5 healthy dogs. A liver biopsy for histologic examination and a biochemistry profile were performed on all dogs. The dogs were randomly split into a training set of 58 dogs and a validation set of 38 dogs. A HF score that included alanine aminotransferase, alkaline phosphatase, total bilirubin, potassium, and gamma-glutamyl transferase was developed in the training set. Model performance was confirmed using the internal validation set, and was similar to the performance in the training set. The overall sensitivity and specificity for the study group were 80% and 70% respectively, with an area under the curve of 0.80 (0.71–0.90). This HF score could be used for indirect diagnosis of canine HF when biochemistry panels are performed on the Konelab 30i (Thermo Scientific), using reagents as in our study. External validation is required to determine if the score is sufficiently robust to utilize biochemical results measured in other laboratories with different instruments and methodologies.

Download Full-text

Multiclass Classifier for P-Glycoprotein Substrates, Inhibitors, and Non-Active Compounds

Molecules ◽

10.3390/molecules24102006 ◽

2019 ◽

Vol 24 (10) ◽

pp. 2006 ◽

Cited By ~ 1

Author(s):

Liadys Mora Lagares ◽

Nikola Minovski ◽

Marjana Novič

Keyword(s):

In Silico ◽

Transmembrane Protein ◽

External Validation ◽

Assessment Process ◽

Classification Model ◽

Training Set ◽

Test Set ◽

Active Compounds ◽

P Glycoprotein ◽

Validation Set

P-glycoprotein (P-gp) is a transmembrane protein that actively transports a wide variety of chemically diverse compounds out of the cell. It is highly associated with the ADMET (absorption, distribution, metabolism, excretion and toxicity) properties of drugs/drug candidates and contributes to decreasing toxicity by eliminating compounds from cells, thereby preventing intracellular accumulation. Therefore, in the drug discovery and toxicological assessment process it is advisable to pay attention to whether a compound under development could be transported by P-gp or not. In this study, an in silico multiclass classification model capable of predicting the probability of a compound to interact with P-gp was developed using a counter-propagation artificial neural network (CP ANN) based on a set of 2D molecular descriptors, as well as an extensive dataset of 2512 compounds (1178 P-gp inhibitors, 477 P-gp substrates and 857 P-gp non-active compounds). The model provided a good classification performance, producing non error rate (NER) values of 0.93 for the training set and 0.85 for the test set, while the average precision (AvPr) was 0.93 for the training set and 0.87 for the test set. An external validation set of 385 compounds was used to challenge the model’s performance. On the external validation set the NER and AvPr values were 0.70 for both indices. We believe that this in silico classifier could be effectively used as a reliable virtual screening tool for identifying potential P-gp ligands.

Download Full-text

A Machine Learning-Based Prediction Platform for P-Glycoprotein Modulators and Its Validation by Molecular Docking

Cells ◽

10.3390/cells8101286 ◽

2019 ◽

Vol 8 (10) ◽

pp. 1286 ◽

Cited By ~ 1

Author(s):

Onat Kadioglu ◽

Thomas Efferth

Keyword(s):

Machine Learning ◽

Molecular Docking ◽

Learning Strategies ◽

High Performance ◽

External Validation ◽

Major Drawback ◽

Chemotherapy Drugs ◽

P Glycoprotein ◽

Validation Set ◽

Leave One Out

P-glycoprotein (P-gp) is an important determinant of multidrug resistance (MDR) because its overexpression is associated with increased efflux of various established chemotherapy drugs in many clinically resistant and refractory tumors. This leads to insufficient therapeutic targeting of tumor populations, representing a major drawback of cancer chemotherapy. Therefore, P-gp is a target for pharmacological inhibitors to overcome MDR. In the present study, we utilized machine learning strategies to establish a model for P-gp modulators to predict whether a given compound would behave as substrate or inhibitor of P-gp. Random forest feature selection algorithm-based leave-one-out random sampling was used. Testing the model with an external validation set revealed high performance scores. A P-gp modulator list of compounds from the ChEMBL database was used to test the performance, and predictions from both substrate and inhibitor classes were selected for the last step of validation with molecular docking. Predicted substrates revealed similar docking poses than that of doxorubicin, and predicted inhibitors revealed similar docking poses than that of the known P-gp inhibitor elacridar, implying the validity of the predictions. We conclude that the machine-learning approach introduced in this investigation may serve as a tool for the rapid detection of P-gp substrates and inhibitors in large chemical libraries.

Download Full-text

Serum microRNA signatures for the detection of pancreatobiliary cancer.

Journal of Clinical Oncology ◽

10.1200/jco.2019.37.15_suppl.e15718 ◽

2019 ◽

Vol 37 (15_suppl) ◽

pp. e15718-e15718

Author(s):

Shuichi Mitsunaga ◽

Shogo Nomura ◽

Kazuo Hara ◽

Yukiko Takayama ◽

Makoto Ueno ◽

...

Keyword(s):

Validation Cohort ◽

External Validation ◽

Diagnostic Value ◽

Healthy Controls ◽

Mirna Signature ◽

Training Set ◽

Linear Discriminant ◽

Serum Mirna ◽

Validation Set ◽

Sensitivity Specificity

e15718 Background: The diagnostic value of serum microRNAs (miRNA) in a highly sensitive microarray for pancreatobiliary cancer (PBca) has been demonstrated. This study attempted to build and validate a signature comprised of multiple serum miRNA markers for discriminating PBca from healthy controls. Methods: A multicenter prospective study on the diagnostic performance of serum miRNAs was conducted. The patients (pts) with treatment-naïve PBca and healthy participants aged ≥60 years were enrolled. Clinical data and sera were collected. Target population was randomly divided to training or validation cohort with an allocation ratio of 2:1. Twenty-nine serum miRNA markers on the microarray data were analyzed. Using any combinations of the markers, a Fisher’s linear discriminant analysis was performed, and the resulting sensitivity, specificity and AUC of ROC curve to discriminate PBca from healthy controls were calculated for each combination. Marker combinations with a sensitivity/specificity (SN/SP) of ≥80%/90% and high AUC in comparison with AUC of CA19-9 were defined as the diagnostic miRNA signature, which were selected in the training cohort. Next, the signatures were screened out which showed a good reproducibility in the validation cohort. As an independent external cohort, PBca pts and healthy with pooled frozen sera were enrolled and the identified miRNA signatures were further validated. Results: Total of 546 participants (80 healthy and 223 PBca in training set, 40 healthy and 104 PBca in validation set, 49 healthy and 50 PBca in external validation set) were analyzed in this study. Four serum miRNA combinations were identified as the diagnostic miRNA signature. In the training set, four miRNA signatures, consisted of 10 miRNAs, were developed. For the best-performed miRNA signature, the SN/SP and AUC in the validation and external validation cohorts were 84/90% and 0.95 (CA19-9: 73/95% and 0.88) and 84/90% and 0.93 (CA19-9: 80/94% and 0.87), respectively. Conclusions: The diagnostic serum miRNA signatures for PBca were identified in this study.

Download Full-text

CT Radiomics and Machine-Learning Models for Predicting Tumor-Stroma Ratio in Patients With Pancreatic Ductal Adenocarcinoma

Frontiers in Oncology ◽

10.3389/fonc.2021.707288 ◽

2021 ◽

Vol 11 ◽

Author(s):

Yinghao Meng ◽

Hao Zhang ◽

Qi Li ◽

Fang Liu ◽

Xu Fang ◽

...

Keyword(s):

Machine Learning ◽

Pancreatic Ductal Adenocarcinoma ◽

Predictive Value ◽

Tumor Stroma ◽

Ductal Adenocarcinoma ◽

Gradient Boosting ◽

Rank Test ◽

Training Set ◽

Extreme Gradient Boosting ◽

Validation Set

PurposeTo develop and validate a machine learning classifier based on multidetector computed tomography (MDCT), for the preoperative prediction of tumor–stroma ratio (TSR) expression in patients with pancreatic ductal adenocarcinoma (PDAC).Materials and MethodsIn this retrospective study, 227 patients with PDAC underwent an MDCT scan and surgical resection. We quantified the TSR by using hematoxylin and eosin staining and extracted 1409 arterial and portal venous phase radiomics features for each patient, respectively. Moreover, we used the least absolute shrinkage and selection operator logistic regression algorithm to reduce the features. The extreme gradient boosting (XGBoost) was developed using a training set consisting of 167 consecutive patients, admitted between December 2016 and December 2017. The model was validated in 60 consecutive patients, admitted between January 2018 and April 2018. We determined the XGBoost classifier performance based on its discriminative ability, calibration, and clinical utility.ResultsWe observed low and high TSR in 91 (40.09%) and 136 (59.91%) patients, respectively. A log-rank test revealed significantly longer survival for patients in the TSR-low group than those in the TSR-high group. The prediction model revealed good discrimination in the training (area under the curve [AUC]= 0.93) and moderate discrimination in the validation set (AUC= 0.63). While the sensitivity, specificity, accuracy, positive predictive value, and negative predictive value for the training set were 94.06%, 81.82%, 0.89, 0.89, and 0.90, respectively, those for the validation set were 85.71%, 48.00%, 0.70, 0.70, and 0.71, respectively.ConclusionsThe CT radiomics-based XGBoost classifier provides a potentially valuable noninvasive tool to predict TSR in patients with PDAC and optimize risk stratification.

Download Full-text

Establishment and Validation of a Clinically Predictive Nomogram Model for Thyroid Carcinoma Patients

10.21203/rs.3.rs-123528/v1 ◽

2020 ◽

Author(s):

Ruyi Zhang ◽

Mei Xu ◽

Xiangxiang Liu ◽

Miao Wang ◽

Qiang Jia ◽

...

Keyword(s):

Thyroid Carcinoma ◽

External Validation ◽

Cancer Staging ◽

Curve Analysis ◽

Calibration Plot ◽

Net Benefit ◽

Training Set ◽

Decision Curve Analysis ◽

Validation Set ◽

Good Calibration

Abstract Objectives To develop a clinically predictive nomogram model which can maximize patients’ net benefit in terms of predicting the prognosis of patients with thyroid carcinoma based on the 8th edition of the AJCC Cancer Staging method. MethodsWe selected 134,962 thyroid carcinoma patients diagnosed between 2004 and 2015 from SEER database with details of the 8th edition of the AJCC Cancer Staging Manual and separated those patients into two datasets randomly. The first dataset, training set, was used to build the nomogram model accounting for 80% (94,474 cases) and the second dataset, validation set, was used for external validation accounting for 20% (40,488 cases). Then we evaluated its clinical availability by analyzing DCA (Decision Curve Analysis) performance and evaluated its accuracy by calculating AUC, C-index as well as calibration plot.ResultsDecision curve analysis showed the final prediction model could maximize patients’ net benefit. In training set and validation set, Harrell’s Concordance Indexes were 0.9450 and 0.9421 respectively. Both sensitivity and specificity of three predicted time points (12 Months,36 Months and 60 Months) of two datasets were all above 0.80 except sensitivity of 60-month time point of validation set was 0.7662. AUCs of three predicted timepoints were 0.9562, 0.9273 and 0.9009 respectively for training set. Similarly, those numbers were 0.9645, 0.9329, and 0.8894 respectively for validation set. Calibration plot also showed that the nomogram model had a good calibration.ConclusionThe final nomogram model provided with both excellent accuracy and clinical availability and should be able to predict patients’ survival probability visually and accurately.

Download Full-text