scholarly journals Uncovering Clinical Risk Factors and Predicting Severe COVID-19 Cases Using UK Biobank Data: Machine Learning Approach (Preprint)

2021 ◽  
Author(s):  
Kenneth Chi-Yin Wong ◽  
Yong Xiang ◽  
Liangying Yin ◽  
Hon-Cheong So

BACKGROUND COVID-19 is a major public health concern. Given the extent of the pandemic, it is urgent to identify risk factors associated with disease severity. More accurate prediction of those at risk of developing severe infections is of high clinical importance. OBJECTIVE Based on the UK Biobank (UKBB), we aimed to build machine learning models to predict the risk of developing severe or fatal infections, and uncover major risk factors involved. METHODS We first restricted the analysis to infected individuals (n=7846), then performed analysis at a population level, considering those with no known infection as controls (ncontrols=465,728). Hospitalization was used as a proxy for severity. A total of 97 clinical variables (collected prior to the COVID-19 outbreak) covering demographic variables, comorbidities, blood measurements (eg, hematological/liver/renal function/metabolic parameters), anthropometric measures, and other risk factors (eg, smoking/drinking) were included as predictors. We also constructed a simplified (lite) prediction model using 27 covariates that can be more easily obtained (demographic and comorbidity data). XGboost (gradient-boosted trees) was used for prediction and predictive performance was assessed by cross-validation. Variable importance was quantified by Shapley values (ShapVal), permutation importance (PermImp), and accuracy gain. Shapley dependency and interaction plots were used to evaluate the pattern of relationships between risk factors and outcomes. RESULTS A total of 2386 severe and 477 fatal cases were identified. For analyses within infected individuals (n=7846), our prediction model achieved area under the receiving-operating characteristic curve (AUC–ROC) of 0.723 (95% CI 0.711-0.736) and 0.814 (95% CI 0.791-0.838) for severe and fatal infections, respectively. The top 5 contributing factors (sorted by ShapVal) for severity were age, number of drugs taken (cnt_tx), cystatin C (reflecting renal function), waist-to-hip ratio (WHR), and Townsend deprivation index (TDI). For mortality, the top features were age, testosterone, cnt_tx, waist circumference (WC), and red cell distribution width. For analyses involving the whole UKBB population, AUCs for severity and fatality were 0.696 (95% CI 0.684-0.708) and 0.825 (95% CI 0.802-0.848), respectively. The same top 5 risk factors were identified for both outcomes, namely, age, cnt_tx, WC, WHR, and TDI. Apart from the above, age, cystatin C, TDI, and cnt_tx were among the top 10 across all 4 analyses. Other diseases top ranked by ShapVal or PermImp were type 2 diabetes mellitus (T2DM), coronary artery disease, atrial fibrillation, and dementia, among others. For the “lite” models, predictive performances were broadly similar, with estimated AUCs of 0.716, 0.818, 0.696, and 0.830, respectively. The top ranked variables were similar to above, including age, cnt_tx, WC, sex (male), and T2DM. CONCLUSIONS We identified numerous baseline clinical risk factors for severe/fatal infection by XGboost. For example, age, central obesity, impaired renal function, multiple comorbidities, and cardiometabolic abnormalities may predispose to poorer outcomes. The prediction models may be useful at a population level to identify those susceptible to developing severe/fatal infections, facilitating targeted prevention strategies. A risk-prediction tool is also available online. Further replications in independent cohorts are required to verify our findings.

10.2196/29544 ◽  
2021 ◽  
Vol 7 (9) ◽  
pp. e29544
Author(s):  
Kenneth Chi-Yin Wong ◽  
Yong Xiang ◽  
Liangying Yin ◽  
Hon-Cheong So

Background COVID-19 is a major public health concern. Given the extent of the pandemic, it is urgent to identify risk factors associated with disease severity. More accurate prediction of those at risk of developing severe infections is of high clinical importance. Objective Based on the UK Biobank (UKBB), we aimed to build machine learning models to predict the risk of developing severe or fatal infections, and uncover major risk factors involved. Methods We first restricted the analysis to infected individuals (n=7846), then performed analysis at a population level, considering those with no known infection as controls (ncontrols=465,728). Hospitalization was used as a proxy for severity. A total of 97 clinical variables (collected prior to the COVID-19 outbreak) covering demographic variables, comorbidities, blood measurements (eg, hematological/liver/renal function/metabolic parameters), anthropometric measures, and other risk factors (eg, smoking/drinking) were included as predictors. We also constructed a simplified (lite) prediction model using 27 covariates that can be more easily obtained (demographic and comorbidity data). XGboost (gradient-boosted trees) was used for prediction and predictive performance was assessed by cross-validation. Variable importance was quantified by Shapley values (ShapVal), permutation importance (PermImp), and accuracy gain. Shapley dependency and interaction plots were used to evaluate the pattern of relationships between risk factors and outcomes. Results A total of 2386 severe and 477 fatal cases were identified. For analyses within infected individuals (n=7846), our prediction model achieved area under the receiving-operating characteristic curve (AUC–ROC) of 0.723 (95% CI 0.711-0.736) and 0.814 (95% CI 0.791-0.838) for severe and fatal infections, respectively. The top 5 contributing factors (sorted by ShapVal) for severity were age, number of drugs taken (cnt_tx), cystatin C (reflecting renal function), waist-to-hip ratio (WHR), and Townsend deprivation index (TDI). For mortality, the top features were age, testosterone, cnt_tx, waist circumference (WC), and red cell distribution width. For analyses involving the whole UKBB population, AUCs for severity and fatality were 0.696 (95% CI 0.684-0.708) and 0.825 (95% CI 0.802-0.848), respectively. The same top 5 risk factors were identified for both outcomes, namely, age, cnt_tx, WC, WHR, and TDI. Apart from the above, age, cystatin C, TDI, and cnt_tx were among the top 10 across all 4 analyses. Other diseases top ranked by ShapVal or PermImp were type 2 diabetes mellitus (T2DM), coronary artery disease, atrial fibrillation, and dementia, among others. For the “lite” models, predictive performances were broadly similar, with estimated AUCs of 0.716, 0.818, 0.696, and 0.830, respectively. The top ranked variables were similar to above, including age, cnt_tx, WC, sex (male), and T2DM. Conclusions We identified numerous baseline clinical risk factors for severe/fatal infection by XGboost. For example, age, central obesity, impaired renal function, multiple comorbidities, and cardiometabolic abnormalities may predispose to poorer outcomes. The prediction models may be useful at a population level to identify those susceptible to developing severe/fatal infections, facilitating targeted prevention strategies. A risk-prediction tool is also available online. Further replications in independent cohorts are required to verify our findings.


2020 ◽  
Author(s):  
Kenneth C.Y. WONG ◽  
Hon-Cheong So

Background: COVID-19 is a major public health concern. Given the extent of the pandemic, it is urgent to identify risk factors associated with severe disease. Accurate prediction of those at risk of developing severe infections is also important clinically. Methods: Based on the UK Biobank (UKBB data), we built machine learning(ML) models to predict the risk of developing severe or fatal infections, and to evaluate the major risk factors involved. We first restricted the analysis to infected subjects, then performed analysis at a population level, considering those with no known infections as controls. Hospitalization was used as a proxy for severity. Totally 93 clinical variables (collected prior to the COVID-19 outbreak) covering demographic variables, comorbidities, blood measurements (e.g. hematological/liver and renal function/metabolic parameters etc.), anthropometric measures and other risk factors (e.g. smoking/drinking habits) were included as predictors. XGboost (gradient boosted trees) was used for prediction and predictive performance was assessed by cross-validation. Variable importance was quantified by Shapley values and accuracy gain. Shapley dependency and interaction plots were used to evaluate the pattern of relationship between risk factors and outcomes. Results: A total of 1191 severe and 358 fatal cases were identified. For the analysis among infected individuals (N=1747), our prediction model achieved AUCs of 0.668 and 0.712 for severe and fatal infections respectively. Since only pre-diagnostic clinical data were available, the main objective of this analysis was to identify baseline risk factors. The top five contributing factors for severity were age, waist-hip ratio(WHR), HbA1c, number of drugs taken(cnt_tx) and gamma-glutamyl transferase levels. For prediction of mortality, the top features were age, systolic blood pressure, waist circumference (WC), urea and WHR. In subsequent analyses involving the whole UKBB population (N for controls=489987), the corresponding AUCs for severity and fatality were 0.669 and 0.749. The same top five risk factors were identified for both outcomes, namely age, cnt_tx, WC, WHR and cystatin C. We also uncovered other features of potential relevance, including testosterone, IGF-1 levels, red cell distribution width (RDW) and lymphocyte percentage. Conclusions: We identified a number of baseline clinical risk factors for severe/fatal infection by an ML approach. For example, age, central obesity, impaired renal function, multi-comorbidities and cardiometabolic abnormalities may predispose to poorer outcomes. The presented prediction models may be useful at a population level to help identify those susceptible to developing severe/fatal infections, hence facilitating targeted prevention strategies. Further replications in independent cohorts are required to verify our findings.


Author(s):  
John J. Squiers ◽  
Jeffrey E. Thatcher ◽  
David Bastawros ◽  
Andrew J. Applewhite ◽  
Ronald D. Baxter ◽  
...  

2020 ◽  
Vol 2020 ◽  
pp. 1-12
Author(s):  
Zi-Qi Pan ◽  
Shu-Jun Zhang ◽  
Xiang-Lian Wang ◽  
Yu-Xin Jiao ◽  
Jian-Jian Qiu

Background and Objective. Although radiotherapy has become one of the main treatment methods for cancer, there is no noninvasive method to predict the radiotherapeutic response of individual glioblastoma (GBM) patients before surgery. The purpose of this study is to develop and validate a machine learning-based radiomics signature to predict the radiotherapeutic response of GBM patients. Methods. The MRI images, genetic data, and clinical data of 152 patients with GBM were analyzed. 122 patients from the TCIA dataset (training set: n = 82 ; validation set: n = 40 ) and 30 patients from local hospitals were used as an independent test dataset. Radiomics features were extracted from multiple regions of multiparameter MRI. Kaplan-Meier survival analysis was used to verify the ability of the imaging signature to predict the response of GBM patients to radiotherapy before an operation. Multivariate Cox regression including radiomics signature and preoperative clinical risk factors was used to further improve the ability to predict the overall survival (OS) of individual GBM patients, which was presented in the form of a nomogram. Results. The radiomics signature was built by eight selected features. The C -index of the radiomics signature in the TCIA and independent test cohorts was 0.703 ( P < 0.001 ) and 0.757 ( P = 0.001 ), respectively. Multivariate Cox regression analysis confirmed that the radiomics signature (HR: 0.290, P < 0.001 ), age (HR: 1.023, P = 0.01 ), and KPS (HR: 0.968, P < 0.001 ) were independent risk factors for OS in GBM patients before surgery. When the radiomics signature and preoperative clinical risk factors were combined, the radiomics nomogram further improved the performance of OS prediction in individual patients ( C ‐ index = 0.764 and 0.758 in the TCIA and test cohorts, respectively). Conclusion. This study developed a radiomics signature that can predict the response of individual GBM patients to radiotherapy and may be a new supplement for precise GBM radiotherapy.


2019 ◽  
Vol 9 (1) ◽  
Author(s):  
Li-Na Liao ◽  
Tsai-Chung Li ◽  
Chia-Ing Li ◽  
Chiu-Shong Liu ◽  
Wen-Yuan Lin ◽  
...  

AbstractWe evaluated whether genetic information could offer improvement on risk prediction of diabetic nephropathy (DN) while adding susceptibility variants into a risk prediction model with conventional risk factors in Han Chinese type 2 diabetes patients. A total of 995 (including 246 DN cases) and 519 (including 179 DN cases) type 2 diabetes patients were included in derivation and validation sets, respectively. A genetic risk score (GRS) was constructed with DN susceptibility variants based on findings of our previous genome-wide association study. In derivation set, areas under the receiver operating characteristics (AUROC) curve (95% CI) for model with clinical risk factors only, model with GRS only, and model with clinical risk factors and GRS were 0.75 (0.72–0.78), 0.64 (0.60–0.68), and 0.78 (0.75–0.81), respectively. In external validation sample, AUROC for model combining conventional risk factors and GRS was 0.70 (0.65–0.74). Additionally, the net reclassification improvement was 9.98% (P = 0.001) when the GRS was added to the prediction model of a set of clinical risk factors. This prediction model enabled us to confirm the importance of GRS combined with clinical factors in predicting the risk of DN and enhanced identification of high-risk individuals for appropriate management of DN for intervention.


Author(s):  
Shaan Khurshid ◽  
Samuel Friedman ◽  
Christopher Reeder ◽  
Paolo Di Achille ◽  
Nathaniel Diamant ◽  
...  

Background: Artificial intelligence (AI)-enabled analysis of 12-lead electrocardiograms (ECGs) may facilitate efficient estimation of incident atrial fibrillation (AF) risk. However, it remains unclear whether AI provides meaningful and generalizable improvement in predictive accuracy beyond clinical risk factors for AF. Methods: We trained a convolutional neural network ("ECG-AI") to infer 5-year incident AF risk using 12-lead ECGs in patients receiving longitudinal primary care at Massachusetts General Hospital (MGH). We then fit three Cox proportional hazards models, each composed of: a) ECG-AI 5-year AF probability, b) the Cohorts for Heart and Aging in Genomic Epidemiology AF (CHARGE-AF) clinical risk score, and c) terms for both ECG-AI and CHARGE-AF ("CH-AI"). We assessed model performance by calculating discrimination (area under the receiver operating characteristic curve, AUROC) and calibration in an internal test set and two external test sets (Brigham and Women's Hospital and UK Biobank). Models were recalibrated to estimate 2-year AF risk in the UK Biobank given limited available follow-up. We used saliency mapping to identify ECG features most influential on ECG-AI risk predictions and assessed correlation between ECG-AI and CHARGE-AF linear predictors. Results: The training set comprised 45,770 individuals (age 55±17 years, 53% women, 2,171 AF events), and the test sets comprised 83,162 individuals (age 59±13 years, 56% women, 2,424 AF events). AUROC was comparable using CHARGE-AF (MGH 0.802, 95% CI 0.767-0.836; BWH 0.752, 95% CI 0.741-0.763; UK Biobank 0.732, 95% CI 0.704-0.759) and ECG-AI (MGH 0.823, 95% CI 0.790-0.856; BWH 0.747, 95% CI 0.736-0.759; UK Biobank 0.705, 95% CI 0.673-0.737). AUROC was highest using CH-AI: MGH 0.838, 95% CI 0.807-0.869; BWH 0.777, 95% CI 0.766-0.788; UK Biobank 0.746, 95% CI 0.716-0.776). Calibration error was low using ECG-AI (MGH 0.0212; BWH 0.0129; UK Biobank 0.0035) and CH-AI (MGH 0.012; BWH 0.0108; UK Biobank 0.0001). In saliency analyses, the ECG P-wave had the greatest influence on AI model predictions. ECG-AI and CHARGE-AF linear predictors were correlated (Pearson r MGH 0.61, BWH 0.66, UK Biobank 0.41). Conclusions: AI-based analysis of 12-lead ECGs has similar predictive utility to a clinical risk factor model for incident AF and both approaches are complementary. ECG-AI may enable efficient quantification of future AF risk.


PLoS ONE ◽  
2021 ◽  
Vol 16 (2) ◽  
pp. e0247205 ◽  
Author(s):  
Gillian S. Dite ◽  
Nicholas M. Murphy ◽  
Richard Allman

Up to 30% of people who test positive to SARS-CoV-2 will develop severe COVID-19 and require hospitalisation. Age, gender, and comorbidities are known to be risk factors for severe COVID-19 but are generally considered independently without accurate knowledge of the magnitude of their effect on risk, potentially resulting in incorrect risk estimation. There is an urgent need for accurate prediction of the risk of severe COVID-19 for use in workplaces and healthcare settings, and for individual risk management. Clinical risk factors and a panel of 64 single-nucleotide polymorphisms were identified from published data. We used logistic regression to develop a model for severe COVID-19 in 1,582 UK Biobank participants aged 50 years and over who tested positive for the SARS-CoV-2 virus: 1,018 with severe disease and 564 without severe disease. Model discrimination was assessed using the area under the receiver operating characteristic curve (AUC). A model incorporating the SNP score and clinical risk factors (AUC = 0.786; 95% confidence interval = 0.763 to 0.808) had 111% better discrimination of disease severity than a model with just age and gender (AUC = 0.635; 95% confidence interval = 0.607 to 0.662). The effects of age and gender are attenuated by the other risk factors, suggesting that it is those risk factors–not age and gender–that confer risk of severe disease. In the whole UK Biobank, most are at low or only slightly elevated risk, but one-third are at two-fold or more increased risk. We have developed a model that enables accurate prediction of severe COVID-19. Continuing to rely on age and gender alone (or only clinical factors) to determine risk of severe COVID-19 will unnecessarily classify healthy older people as being at high risk and will fail to accurately quantify the increased risk for younger people with comorbidities.


Circulation ◽  
2020 ◽  
Vol 142 (Suppl_3) ◽  
Author(s):  
Jack W Osullivan ◽  
Anna Shcherbina ◽  
Johanne M Justesen ◽  
Mintu Turakhia ◽  
Marco V Perez ◽  
...  

Introduction: Atrial fibrillation (AF) is associated with a five-fold increased risk of ischemic stroke. A portion of this risk is heritable, however current risk stratification tools (CHA 2 DS 2 -VASc) don’t include family history or genetic risk. Hypothesis: A polygenic risk scores (PRS) is both independently, and in integrated with clinical risk factors, predictive of ischemic stroke in patients with Atrial Fibrillation. Methods: Using data from the largest available GWAS in Europeans, we combined over half a million genetic variants to construct a PRS to predict ischemic stroke in patients with AF. We externally validated this PRS in independent data from the UK Biobank (UK Biobank), both independently and integrated with clinical risk factors. Results: The integrated PRS and clinical risk factors risk tool had the greatest predictive ability. Compared with the currently recommended risk tool (CHA 2 DS 2 -VASc), the integrated tool significantly improved net reclassification (NRI: 2.3% (95%CI: 1.3% to 3.0%)), and fit (χ2 P =0.002). Independently, PRS was a significant predictor of ischemic stroke in patients with AF prospectively (Hazard Ratio: 1.13 per 1 SD (95%CI: 1.04 to 1.21)). Lastly, polygenic risk scores were uncorrelated with clinical risk factors (Pearson’s correlation coefficient: -0.018). Conclusions: In patients with AF, there appears to be a significant association between PRS and risk of ischemic stroke. The greatest predictive ability was found with the integration of PRS and clinical risk factors, however the prediction of stroke remains challenging.


Sign in / Sign up

Export Citation Format

Share Document