Prediction of Clinical Risk Factors of Diabetes Using Multiple Machine Learning Techniques Resolving Class Imbalance

Background:Heterogeneity in disease populations complicates discovery of risk factors. To identify risk factors for subpopulations of diseases, we need analytical methods that can deal with unidentified disease subgroups.Objectives:Inspired by successful approaches from the Big Data field, we developed a high-throughput approach to identify subpopulations within patients with heterogeneous, complex diseases using the wealth of information available in Electronic Medical Records (EMRs).Methods:We extracted longitudinal healthcare-interaction records coded by 1,853 PheCodes[1] of the 64,819 patients from the Boston’s Partners-Biobank. Through dimensionality reduction using t-SNE[2] we created a 2D embedding of 32,424 of these patients (set A). We then identified distinct clusters post-t-SNE using DBscan[3] and visualized the relative importance of individual PheCodes within them using specialized spectrographs. We replicated this procedure in the remaining 32,395 records (set B).Results:Summary statistics of both sets were comparable (Table 1).Table 1.Summary statistics of the total Partners Biobank dataset and the 2 partitions.Set-Aset-BTotalEntries12,200,31112,177,13124,377,442Patients32,42432,39564,819Patientyears369,546.33368,597.92738,144.2unique ICD codes25,05624,95326,305unique Phecodes1,8511,8531,853We found 284 clusters in set A and 295 in set B, of which 63.4% from set A could be mapped to a cluster in set B with a median (range) correlation of 0.24 (0.03 – 0.58).Clusters represented similar yet distinct clinical phenotypes; e.g. patients diagnosed with “other headache syndrome” were separated into four distinct clusters characterized by migraines, neurofibromatosis, epilepsy or brain cancer, all resulting in patients presenting with headaches (Fig. 1 & 2). Though EMR databases tend to be noisy, our method was also able to differentiate misclassification from true cases; SLE patients with RA codes clustered separately from true RA cases.Figure 1.Two dimensional representation of Set A generated using dimensionality reduction (tSNE) and clustering (DBScan).Figure 2.Phenotype Spectrographs (PheSpecs) of four clusters characterized by “Other headache syndromes”, driven by codes relating to migraine, epilepsy, neurofibromatosis or brain cancer.Conclusion:We have shown that EMR data can be used to identify and visualize latent structure in patient categorizations, using an approach based on dimension reduction and clustering machine learning techniques. Our method can identify misclassified patients as well as separate patients with similar problems into subsets with different associated medical problems. Our approach adds a new and powerful tool to aid in the discovery of novel risk factors in complex, heterogeneous diseases.References:[1] Denny, J.C. et al. Bioinformatics (2010)[2]van der Maaten et al. Journal of Machine Learning Research (2008)[3] Ester, M. et al. Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. (1996)Disclosure of Interests:Marc Maurits: None declared, Thomas Huizinga Grant/research support from: Ablynx, Bristol-Myers Squibb, Roche, Sanofi, Consultant of: Ablynx, Bristol-Myers Squibb, Roche, Sanofi, Marcel Reinders: None declared, Soumya Raychaudhuri: None declared, Elizabeth Karlson: None declared, Erik van den Akker: None declared, Rachel Knevel: None declared

Download Full-text

Machine learning analysis of multispectral imaging and clinical risk factors to predict amputation wound healing

Journal of Vascular Surgery ◽

10.1016/j.jvs.2021.06.478 ◽

2021 ◽

Author(s):

John J. Squiers ◽

Jeffrey E. Thatcher ◽

David Bastawros ◽

Andrew J. Applewhite ◽

Ronald D. Baxter ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Wound Healing ◽

Multispectral Imaging ◽

Clinical Risk Factors ◽

Clinical Risk ◽

Learning Analysis

Download Full-text

Class Imbalance Issue in Software Defect Prediction Models by various Machine Learning Techniques: An Empirical Study

10.1109/icscc51209.2021.9528170 ◽

2021 ◽

Author(s):

Sushant Kumar Pandey ◽

Anil Kumar Tripathi

Keyword(s):

Machine Learning ◽

Empirical Study ◽

Prediction Models ◽

Class Imbalance ◽

Machine Learning Techniques ◽

Defect Prediction ◽

Software Defect Prediction ◽

Software Defect ◽

Learning Techniques ◽

Defect Prediction Models

Download Full-text

Survival prediction models since liver transplantation - comparisons between Cox models and machine learning techniques

BMC Medical Research Methodology ◽

10.1186/s12874-020-01153-1 ◽

2020 ◽

Vol 20 (1) ◽

Author(s):

Georgios Kantidakis ◽

Hein Putter ◽

Carlo Lancia ◽

Jacob de Boer ◽

Andries E. Braat ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Neural Networks ◽

Liver Transplantation ◽

Prediction Models ◽

Machine Learning Techniques ◽

Brier Score ◽

Cox Models ◽

Learning Techniques ◽

Random Survival Forest

Abstract Background Predicting survival of recipients after liver transplantation is regarded as one of the most important challenges in contemporary medicine. Hence, improving on current prediction models is of great interest.Nowadays, there is a strong discussion in the medical field about machine learning (ML) and whether it has greater potential than traditional regression models when dealing with complex data. Criticism to ML is related to unsuitable performance measures and lack of interpretability which is important for clinicians. Methods In this paper, ML techniques such as random forests and neural networks are applied to large data of 62294 patients from the United States with 97 predictors selected on clinical/statistical grounds, over more than 600, to predict survival from transplantation. Of particular interest is also the identification of potential risk factors. A comparison is performed between 3 different Cox models (with all variables, backward selection and LASSO) and 3 machine learning techniques: a random survival forest and 2 partial logistic artificial neural networks (PLANNs). For PLANNs, novel extensions to their original specification are tested. Emphasis is given on the advantages and pitfalls of each method and on the interpretability of the ML techniques. Results Well-established predictive measures are employed from the survival field (C-index, Brier score and Integrated Brier Score) and the strongest prognostic factors are identified for each model. Clinical endpoint is overall graft-survival defined as the time between transplantation and the date of graft-failure or death. The random survival forest shows slightly better predictive performance than Cox models based on the C-index. Neural networks show better performance than both Cox models and random survival forest based on the Integrated Brier Score at 10 years. Conclusion In this work, it is shown that machine learning techniques can be a useful tool for both prediction and interpretation in the survival context. From the ML techniques examined here, PLANN with 1 hidden layer predicts survival probabilities the most accurately, being as calibrated as the Cox model with all variables. Trial registration Retrospective data were provided by the Scientific Registry of Transplant Recipients under Data Use Agreement number 9477 for analysis of risk factors after liver transplantation.

Download Full-text

Machine Learning Based on a Multiparametric and Multiregional Radiomics Signature Predicts Radiotherapeutic Response in Patients with Glioblastoma

Behavioural Neurology ◽

10.1155/2020/1712604 ◽

2020 ◽

Vol 2020 ◽

pp. 1-12

Author(s):

Zi-Qi Pan ◽

Shu-Jun Zhang ◽

Xiang-Lian Wang ◽

Yu-Xin Jiao ◽

Jian-Jian Qiu

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Cox Regression ◽

Clinical Risk Factors ◽

Cox Regression Analysis ◽

Clinical Risk ◽

Independent Test ◽

Independent Test Dataset ◽

Multiple Regions ◽

Radiomics Signature

Background and Objective. Although radiotherapy has become one of the main treatment methods for cancer, there is no noninvasive method to predict the radiotherapeutic response of individual glioblastoma (GBM) patients before surgery. The purpose of this study is to develop and validate a machine learning-based radiomics signature to predict the radiotherapeutic response of GBM patients. Methods. The MRI images, genetic data, and clinical data of 152 patients with GBM were analyzed. 122 patients from the TCIA dataset (training set: n = 82 ; validation set: n = 40 ) and 30 patients from local hospitals were used as an independent test dataset. Radiomics features were extracted from multiple regions of multiparameter MRI. Kaplan-Meier survival analysis was used to verify the ability of the imaging signature to predict the response of GBM patients to radiotherapy before an operation. Multivariate Cox regression including radiomics signature and preoperative clinical risk factors was used to further improve the ability to predict the overall survival (OS) of individual GBM patients, which was presented in the form of a nomogram. Results. The radiomics signature was built by eight selected features. The C -index of the radiomics signature in the TCIA and independent test cohorts was 0.703 ( P < 0.001 ) and 0.757 ( P = 0.001 ), respectively. Multivariate Cox regression analysis confirmed that the radiomics signature (HR: 0.290, P < 0.001 ), age (HR: 1.023, P = 0.01 ), and KPS (HR: 0.968, P < 0.001 ) were independent risk factors for OS in GBM patients before surgery. When the radiomics signature and preoperative clinical risk factors were combined, the radiomics nomogram further improved the performance of OS prediction in individual patients ( C ‐ index = 0.764 and 0.758 in the TCIA and test cohorts, respectively). Conclusion. This study developed a radiomics signature that can predict the response of individual GBM patients to radiotherapy and may be a new supplement for precise GBM radiotherapy.

Download Full-text

Malicious web domain identification using online credibility and performance data by considering the class imbalance issue

Industrial Management & Data Systems ◽

10.1108/imds-02-2018-0072 ◽

2019 ◽

Vol 119 (3) ◽

pp. 676-696 ◽

Cited By ~ 5

Author(s):

Zhongyi Hu ◽

Raymond Chiong ◽

Ilung Pranata ◽

Yukun Bao ◽

Yuqing Lin

Keyword(s):

Machine Learning ◽

Class Imbalance ◽

Performance Data ◽

Machine Learning Techniques ◽

Data Sets ◽

Real World Data ◽

Content Type ◽

Domain Identification ◽

Learning Techniques ◽

And Performance

Purpose Malicious web domain identification is of significant importance to the security protection of internet users. With online credibility and performance data, the purpose of this paper to investigate the use of machine learning techniques for malicious web domain identification by considering the class imbalance issue (i.e. there are more benign web domains than malicious ones). Design/methodology/approach The authors propose an integrated resampling approach to handle class imbalance by combining the synthetic minority oversampling technique (SMOTE) and particle swarm optimisation (PSO), a population-based meta-heuristic algorithm. The authors use the SMOTE for oversampling and PSO for undersampling. Findings By applying eight well-known machine learning classifiers, the proposed integrated resampling approach is comprehensively examined using several imbalanced web domain data sets with different imbalance ratios. Compared to five other well-known resampling approaches, experimental results confirm that the proposed approach is highly effective. Practical implications This study not only inspires the practical use of online credibility and performance data for identifying malicious web domains but also provides an effective resampling approach for handling the class imbalance issue in the area of malicious web domain identification. Originality/value Online credibility and performance data are applied to build malicious web domain identification models using machine learning techniques. An integrated resampling approach is proposed to address the class imbalance issue. The performance of the proposed approach is confirmed based on real-world data sets with different imbalance ratios.

Download Full-text

Classification of Neurodegenerative Disorders Based on Major Risk Factors Employing Machine Learning Techniques

International Journal of Engineering and Technology ◽

10.7763/ijet.2010.v2.146 ◽

2010 ◽

Vol 2 (4) ◽

pp. 350-355 ◽

Cited By ~ 5

Author(s):

Sandhya Joshi ◽

P. Deepa Shenoy ◽

Vibhudendra Simha G.G. ◽

Venugopal K. R ◽

L.M. Patnaik

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Neurodegenerative Disorders ◽

Machine Learning Techniques ◽

Learning Techniques

Download Full-text

Application of Machine Learning Techniques to Identify Data Reliability and Factors Affecting Outcome After Stroke Using Electronic Administrative Records

Frontiers in Neurology ◽

10.3389/fneur.2021.670379 ◽

2021 ◽

Vol 12 ◽

Author(s):

Santu Rana ◽

Wei Luo ◽

Truyen Tran ◽

Svetha Venkatesh ◽

Paul Talman ◽

...

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Ischemic Stroke ◽

Machine Learning Techniques ◽

Discharge Destination ◽

Clinical Factors ◽

Administrative Records ◽

Factors Associated ◽

Learning Techniques ◽

Discharge Outcomes

Aim: To use available electronic administrative records to identify data reliability, predict discharge destination, and identify risk factors associated with specific outcomes following hospital admission with stroke, compared to stroke specific clinical factors, using machine learning techniques.Method: The study included 2,531 patients having at least one admission with a confirmed diagnosis of stroke, collected from a regional hospital in Australia within 2009–2013. Using machine learning (penalized regression with Lasso) techniques, patients having their index admission between June 2009 and July 2012 were used to derive predictive models, and patients having their index admission between July 2012 and June 2013 were used for validation. Three different stroke types [intracerebral hemorrhage (ICH), ischemic stroke, transient ischemic attack (TIA)] were considered and five different comparison outcome settings were considered. Our electronic administrative record based predictive model was compared with a predictive model composed of “baseline” clinical features, more specific for stroke, such as age, gender, smoking habits, co-morbidities (high cholesterol, hypertension, atrial fibrillation, and ischemic heart disease), types of imaging done (CT scan, MRI, etc.), and occurrence of in-hospital pneumonia. Risk factors associated with likelihood of negative outcomes were identified.Results: The data was highly reliable at predicting discharge to rehabilitation and all other outcomes vs. death for ICH (AUC 0.85 and 0.825, respectively), all discharge outcomes except home vs. rehabilitation for ischemic stroke, and discharge home vs. others and home vs. rehabilitation for TIA (AUC 0.948 and 0.873, respectively). Electronic health record data appeared to provide improved prediction of outcomes over stroke specific clinical factors from the machine learning models. Common risk factors associated with a negative impact on expected outcomes appeared clinically intuitive, and included older age groups, prior ventilatory support, urinary incontinence, need for imaging, and need for allied health input.Conclusion: Electronic administrative records from this cohort produced reliable outcome prediction and identified clinically appropriate factors negatively impacting most outcome variables following hospital admission with stroke. This presents a means of future identification of modifiable factors associated with patient discharge destination. This may potentially aid in patient selection for certain interventions and aid in better patient and clinician education regarding expected discharge outcomes.

Download Full-text

Uncovering clinical risk factors and prediction of severe COVID-19: A machine learning approach based on UK Biobank data

10.1101/2020.09.18.20197319 ◽

2020 ◽

Author(s):

Kenneth C.Y. WONG ◽

Hon-Cheong So

Keyword(s):

Machine Learning ◽

Risk Factors ◽

Renal Function ◽

Population Level ◽

Clinical Risk Factors ◽

Health Concern ◽

Targeted Prevention ◽

Uk Biobank ◽

Clinical Risk ◽

Glutamyl Transferase

Background: COVID-19 is a major public health concern. Given the extent of the pandemic, it is urgent to identify risk factors associated with severe disease. Accurate prediction of those at risk of developing severe infections is also important clinically. Methods: Based on the UK Biobank (UKBB data), we built machine learning(ML) models to predict the risk of developing severe or fatal infections, and to evaluate the major risk factors involved. We first restricted the analysis to infected subjects, then performed analysis at a population level, considering those with no known infections as controls. Hospitalization was used as a proxy for severity. Totally 93 clinical variables (collected prior to the COVID-19 outbreak) covering demographic variables, comorbidities, blood measurements (e.g. hematological/liver and renal function/metabolic parameters etc.), anthropometric measures and other risk factors (e.g. smoking/drinking habits) were included as predictors. XGboost (gradient boosted trees) was used for prediction and predictive performance was assessed by cross-validation. Variable importance was quantified by Shapley values and accuracy gain. Shapley dependency and interaction plots were used to evaluate the pattern of relationship between risk factors and outcomes. Results: A total of 1191 severe and 358 fatal cases were identified. For the analysis among infected individuals (N=1747), our prediction model achieved AUCs of 0.668 and 0.712 for severe and fatal infections respectively. Since only pre-diagnostic clinical data were available, the main objective of this analysis was to identify baseline risk factors. The top five contributing factors for severity were age, waist-hip ratio(WHR), HbA1c, number of drugs taken(cnt_tx) and gamma-glutamyl transferase levels. For prediction of mortality, the top features were age, systolic blood pressure, waist circumference (WC), urea and WHR. In subsequent analyses involving the whole UKBB population (N for controls=489987), the corresponding AUCs for severity and fatality were 0.669 and 0.749. The same top five risk factors were identified for both outcomes, namely age, cnt_tx, WC, WHR and cystatin C. We also uncovered other features of potential relevance, including testosterone, IGF-1 levels, red cell distribution width (RDW) and lymphocyte percentage. Conclusions: We identified a number of baseline clinical risk factors for severe/fatal infection by an ML approach. For example, age, central obesity, impaired renal function, multi-comorbidities and cardiometabolic abnormalities may predispose to poorer outcomes. The presented prediction models may be useful at a population level to help identify those susceptible to developing severe/fatal infections, hence facilitating targeted prevention strategies. Further replications in independent cohorts are required to verify our findings.

Download Full-text