Risk prediction in multicentre studies when there is confounding by cluster or informative cluster size

Abstract Background Clustered data arise in research when patients are clustered within larger units. Generalised Estimating Equations (GEE) and Generalised Linear Models (GLMM) can be used to provide marginal and cluster-specific inference and predictions, respectively. Methods Confounding by Cluster (CBC) and Informative cluster size (ICS) are two complications that may arise when modelling clustered data. CBC can arise when the distribution of a predictor variable (termed ‘exposure’), varies between clusters causing confounding of the exposure-outcome relationship. ICS means that the cluster size conditional on covariates is not independent of the outcome. In both situations, standard GEE and GLMM may provide biased or misleading inference, and modifications have been proposed. However, both CBC and ICS are routinely overlooked in the context of risk prediction, and their impact on the predictive ability of the models has been little explored. We study the effect of CBC and ICS on the predictive ability of risk models for binary outcomes when GEE and GLMM are used. We examine whether two simple approaches to handle CBC and ICS, which involve adjusting for the cluster mean of the exposure and the cluster size, respectively, can improve the accuracy of predictions. Results Both CBC and ICS can be viewed as violations of the assumptions in the standard GLMM; the random effects are correlated with exposure for CBC and cluster size for ICS. Based on these principles, we simulated data subject to CBC/ICS. The simulation studies suggested that the predictive ability of models derived from using standard GLMM and GEE ignoring CBC/ICS was affected. Marginal predictions were found to be mis-calibrated. Adjusting for the cluster-mean of the exposure or the cluster size improved calibration, discrimination and the overall predictive accuracy of marginal predictions, by explaining part of the between cluster variability. The presence of CBC/ICS did not affect the accuracy of conditional predictions. We illustrate these concepts using real data from a multicentre study with potential CBC. Conclusion Ignoring CBC and ICS when developing prediction models for clustered data can affect the accuracy of marginal predictions. Adjusting for the cluster mean of the exposure or the cluster size can improve the predictive accuracy of marginal predictions.

Download Full-text

Clinical risk prediction models and informative cluster size: Assessing the performance of a suicide risk prediction algorithm

Biometrical Journal ◽

10.1002/bimj.202000199 ◽

2021 ◽

Author(s):

Rebecca Yates Coley ◽

Rod L. Walker ◽

Maricela Cruz ◽

Gregory E. Simon ◽

Susan M. Shortreed

Keyword(s):

Risk Prediction ◽

Cluster Size ◽

Suicide Risk ◽

Prediction Models ◽

Prediction Algorithm ◽

Risk Prediction Models ◽

Clinical Risk ◽

Informative Cluster Size

Download Full-text

A review of literature on risk prediction tools for hospital readmissions in older adults

Journal of Health Organization and Management ◽

10.1108/jhom-11-2020-0450 ◽

2022 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Michelle Louise Gatt ◽

Maria Cassar ◽

Sandra C. Buttigieg

Keyword(s):

Risk Prediction ◽

Prediction Models ◽

Hospital Readmissions ◽

Predictive Ability ◽

Chronic Obstructive ◽

Risk Category ◽

Extensive Literature ◽

Content Type ◽

Prediction Tools ◽

Readmission Risk

Purpose The purpose of this paper is to identify and analyse the readmission risk prediction tools reported in the literature and their benefits when it comes to healthcare organisations and management.Design/methodology/approach Readmission risk prediction is a growing topic of interest with the aim of identifying patients in particular those suffering from chronic diseases such as congestive heart failure, chronic obstructive pulmonary disease and diabetes, who are at risk of readmission. Several models have been developed with different levels of predictive ability. A structured and extensive literature search of several databases was conducted using the Preferred Reporting Items for Systematic Reviews and Meta-analysis strategy, and this yielded a total of 48,984 records.Findings Forty-three articles were selected for full-text and extensive review after following the screening process and according to the eligibility criteria. About 34 unique readmission risk prediction models were identified, in which their predictive ability ranged from poor to good (c statistic 0.5–0.86). Readmission rates ranged between 3.1 and 74.1% depending on the risk category. This review shows that readmission risk prediction is a complex process and is still relatively new as a concept and poorly understood. It confirms that readmission prediction models hold significant accuracy at identifying patients at higher risk for such an event within specific context.Research limitations/implications Since most prediction models were developed for specific populations, conditions or hospital settings, the generalisability and transferability of the predictions across wider or other contexts may be difficult to achieve. Therefore, the value of prediction models remains limited to hospital management. Future research is indicated in this regard.Originality/value This review is the first to cover readmission risk prediction tools that have been published in the literature since 2011, thereby providing an assessment of the relevance of this crucial KPI to health organisations and managers.

Download Full-text

Abstract W MP37: Novel Prognostic Scores for Early Prediction of Outcome Following Aneurysmal Subarachnoid Hemorrhage

Stroke ◽

10.1161/str.46.suppl_1.wmp37 ◽

2015 ◽

Vol 46 (suppl_1) ◽

Author(s):

Blessing Jaja ◽

Hester Lingsma ◽

Ewout Steyerberg ◽

R. Loch Macdonald ◽

Keyword(s):

Subarachnoid Hemorrhage ◽

Cross Validation ◽

Prediction Models ◽

Aneurysmal Subarachnoid Hemorrhage ◽

Model Performance ◽

Predictor Variable ◽

Predictive Ability ◽

Prognostic Scores ◽

Operating Characteristics ◽

Fisher Grade

Background: Aneurysmal subarachnoid hemorrhage (SAH) is a cerebrovascular emergency. Currently, clinicians have limited tools to estimate outcomes early after hospitalization. We aimed to develop novel prognostic scores using large cohorts of patients reflecting experience from different settings. Methods: Logistic regression analysis was used to develop prediction models for mortality and unfavorable outcomes according to 3-month Glasgow outcome score after SAH based on readily obtained parameters at hospital admission. The development cohort was derived from 10 prospective studies involving 10936 patients in the Subarachnoid Hemorrhage International Trialists (SAHIT) repository. Model performance was assessed by bootstrap internal validation and by cross validation by omission of each of the 10 studies, using R2 statistic, Area under the receiver operating characteristics curve (AUC), and calibration plots. Prognostic scores were developed from the regression coefficients. Results: Predictor variable with the strongest prognostic strength was neurologic status (partial R2 = 12.03%), followed by age (1.91%), treatment modality (1.25%), Fisher grade of CT clot burden (0.65%), history of hypertension (0.37%), aneurysm size (0.12%) and aneurysm location (0.06%). These predictors were combined to develop 3 sets of hierarchical scores based on the coefficients of the regression models. The AUC at bootstrap validation was 0.79-0.80, and at cross validation was 0.64-0.85. Calibration plots demonstrated satisfactory agreement between predicted and observed probabilities of the outcomes. Conclusions: The novel prognostic scores have good predictive ability and potential for broad application as they have been developed from prospective cohorts reflecting experience from different centers globally.

Download Full-text

Financial Compass for Slovak Enterprises: Modeling Economic Stability of Agricultural Entities

Journal of Risk and Financial Management ◽

10.3390/jrfm13050092 ◽

2020 ◽

Vol 13 (5) ◽

pp. 92

Author(s):

Katarina Valaskova ◽

Pavol Durana ◽

Peter Adamko ◽

Jaroslav Jaros

Keyword(s):

Prediction Models ◽

Predictive Accuracy ◽

Characteristic Curve ◽

Confusion Matrix ◽

Predictive Ability ◽

Early Warning Systems ◽

Emerging Countries ◽

Bankruptcy Prediction ◽

Financial Health ◽

Prediction Ability

The risk of corporate financial distress negatively affects the operation of the enterprise itself and can change the financial performance of all other partners that come into close or wider contact. To identify these risks, business entities use early warning systems, prediction models, which help identify the level of corporate financial health. Despite the fact that the relevant financial analyses and financial health predictions are crucial to mitigate or eliminate the potential risks of bankruptcy, the modeling of financial health in emerging countries is mostly based on models which were developed in different economic sectors and countries. However, several prediction models have been introduced in emerging countries (also in Slovakia) in the last few years. Thus, the main purpose of the paper is to verify the predictive ability of the bankruptcy models formed in conditions of the Slovak economy in the sector of agriculture. To compare their predictive accuracy the confusion matrix (cross tables) and the receiver operating characteristic curve are used, which allow more detailed analysis than the mere proportion of correct classifications (predictive accuracy). The results indicate that the models developed in the specific economic sector highly outperform the prediction ability of other models either developed in the same country or abroad, usage of which is then questionable considering the issue of prediction accuracy. The research findings confirm that the highest predictive ability of the bankruptcy prediction models is achieved provided that they are used in the same economic conditions and industrial sector in which they were primarily developed.

Download Full-text

Review of methods for handling confounding by cluster and informative cluster size in clustered data

Statistics in Medicine ◽

10.1002/sim.6277 ◽

2014 ◽

Vol 33 (30) ◽

pp. 5371-5387 ◽

Cited By ~ 29

Author(s):

Shaun Seaman ◽

Menelaos Pavlou ◽

Andrew Copas

Keyword(s):

Cluster Size ◽

Clustered Data ◽

Informative Cluster Size

Download Full-text

Performance of Atrial Fibrillation Risk Prediction Models in Over Four Million Individuals

Circulation Arrhythmia and Electrophysiology ◽

10.1161/circep.120.008997 ◽

2020 ◽

Author(s):

Shaan Khurshid ◽

Uri Kartoun ◽

Jeffrey M. Ashburner ◽

Ludovic Trinquart ◽

Anthony Philippakis ◽

...

Keyword(s):

Heart Failure ◽

Atrial Fibrillation ◽

Risk Prediction ◽

Proportional Hazards ◽

Prediction Models ◽

Predictive Accuracy ◽

Health Management ◽

Cox Proportional Hazards ◽

Risk Scores ◽

Office Visits

Background - Atrial fibrillation (AF) is associated with increased risks of stroke and heart failure. Electronic health record (EHR) based AF risk prediction may facilitate efficient deployment of interventions to diagnose or prevent AF altogether. Methods - We externally validated an EHR atrial fibrillation (EHR-AF) score in IBM Explorys Life Sciences, a multi-institutional dataset containing statistically de-identified EHR data for over 21 million individuals ("Explorys Dataset"). We included individuals with complete AF risk data, ≥2 office visits within two years, and no prevalent AF. We compared EHR-AF to existing scores including CHARGE-AF, C 2 HEST, and CHA 2 DS 2 -VASc. We assessed association between AF risk scores and 5-year incident AF, stroke, and heart failure using Cox proportional hazards modeling, 5-year AF discrimination using c-indices, and calibration of predicted AF risk to observed AF incidence. Results - Of 21,825,853 individuals in the Explorys Dataset, 4,508,180 comprised the analysis (age 62.5, 56.3% female). AF risk scores were strongly associated with 5-year incident AF (hazard ratio [HR] per standard deviation [SD] increase 1.85 using CHA 2 DS 2 -VASc to 2.88 using EHR-AF), stroke (1.61 using C 2 HEST to 1.92 using CHARGE-AF), and heart failure (1.91 using CHA 2 DS 2 -VASc to 2.58 using EHR-AF). EHR-AF (c-index 0.808 [95%CI 0.807-0.809]) demonstrated favorable AF discrimination compared to CHARGE-AF (0.806 [0.805-0.807]), C 2 HEST (0.683 [0.682-0.684]), and CHA 2 DS 2 -VASc (0.720 [0.719-0.722]). Of the scores, EHR-AF demonstrated the best calibration to incident AF (calibration slope 1.002 [0.997-1.007]). In subgroup analyses, AF discrimination using EHR-AF was lower in individuals with stroke (c-index 0.696 [0.692-0.700]) and heart failure (0.621 [0.617-0.625]). Conclusions - EHR-AF demonstrates predictive accuracy for incident AF using readily ascertained EHR data. AF risk is associated with incident stroke and heart failure. Use of such risk scores may facilitate decision-support and population health management efforts focused on minimizing AF-related morbidity.

Download Full-text

Abstract 032: Is Bigger Data Better? Predicting Readmissions in Acute Myocardial Infarction on Admission versus Discharge With Electronic Health Record Data

Circulation Cardiovascular Quality and Outcomes ◽

10.1161/circoutcomes.10.suppl_3.032 ◽

2017 ◽

Vol 10 (suppl_3) ◽

Author(s):

Oanh K Nguyen ◽

Anil N Makam ◽

Christopher Clark ◽

Song Zhang ◽

Sandeep R Das ◽

...

Keyword(s):

Myocardial Infarction ◽

Acute Myocardial Infarction ◽

Electronic Health Record ◽

Risk Prediction ◽

Prediction Models ◽

Predictive Ability ◽

Health Record ◽

Electronic Health Record Data ◽

Electronic Health ◽

Using Data

Background: Readmissions after hospitalization for acute myocardial infarction (AMI) are common, but the few available risk prediction models have poor predictive ability. Including more data from hospitalization may improve risk prediction. Objectives: To assess if an AMI-specific electronic health record (EHR) readmission risk prediction model derived and validated from data through the entire hospital course (‘full stay’ model) outperforms a model using data available only from the first day of hospitalization (‘first day’ model). Methods: EHR data from AMI hospitalizations from 6 diverse hospitals in north Texas from 2009-2010 were used to derive a model predicting all-cause non-elective 30-day readmissions which was then validated using five-fold cross-validation. Results: Of 826 consecutive index AMI admissions, 13% were followed by a 30-day readmission. History of diabetes (AOR 2.41, 95% CI 1.37-4.24), SBP <100 mmHg on admission (AOR 2.18, 95% CI 1.68-2.82), elevated Cr (≥2 mg/dL) on admission (AOR 2.56, 95% CI 2.52-6.08), elevated BNP on admission (AOR 6.36, 95% CI 1.65-24.47) and lack of PCI within 24 hours of admission (AOR 1.31, 95% CI 1.02-1.69) were significant predictors of readmission. Our ‘first-day’ AMI readmissions model based on these predictors had good discrimination ( Table ). Adding three other variables from the hospital course - use of IV diuretics (AOR 1.58, 95% CI 1.07-2.31), anemia (hematocrit ≤ 33%) on discharge (AOR 2.04, 95% CI 1.20-3.46), and discharge to post-acute care (AOR 1.50, 95% CI 0.90-2.50) - improved discrimination of the ‘full stay’ AMI model but only modestly improved net reclassification and calibration. Conclusions: A ‘full-stay’ AMI-specific EHR readmission model modestly outperformed a ‘first-day’ EHR model, a multi-condition EHR model, and the CMS AMI model. Surprisingly, incorporating more hospitalization data improved discrimination of the full-stay AMI model but did not meaningfully improve reclassification compared to the first-day model. Readmissions in AMI may be accurately predicted on the first day of hospitalization; waiting until later in hospitalization does not markedly improve risk prediction.

Download Full-text

Machine learning techniques for personalized breast cancer risk prediction: comparison with the BCRAT and BOADICEA models

Breast Cancer Research ◽

10.1186/s13058-019-1158-4 ◽

2019 ◽

Vol 21 (1) ◽

Cited By ~ 13

Author(s):

Chang Ming ◽

Valeria Viassolo ◽

Nicole Probst-Hensch ◽

Pierre O. Chappuis ◽

Ivo D. Dinov ◽

...

Keyword(s):

Breast Cancer ◽

Breast Cancer Risk ◽

Cancer Risk ◽

Risk Prediction ◽

Prediction Models ◽

Predictive Accuracy ◽

Population Based ◽

Breast Cancer Patients ◽

Adaptive Boosting ◽

Discriminatory Accuracy

Abstract Background Comprehensive breast cancer risk prediction models enable identifying and targeting women at high-risk, while reducing interventions in those at low-risk. Breast cancer risk prediction models used in clinical practice have low discriminatory accuracy (0.53–0.64). Machine learning (ML) offers an alternative approach to standard prediction modeling that may address current limitations and improve accuracy of those tools. The purpose of this study was to compare the discriminatory accuracy of ML-based estimates against a pair of established methods—the Breast Cancer Risk Assessment Tool (BCRAT) and Breast and Ovarian Analysis of Disease Incidence and Carrier Estimation Algorithm (BOADICEA) models. Methods We quantified and compared the performance of eight different ML methods to the performance of BCRAT and BOADICEA using eight simulated datasets and two retrospective samples: a random population-based sample of U.S. breast cancer patients and their cancer-free female relatives (N = 1143), and a clinical sample of Swiss breast cancer patients and cancer-free women seeking genetic evaluation and/or testing (N = 2481). Results Predictive accuracy (AU-ROC curve) reached 88.28% using ML-Adaptive Boosting and 88.89% using ML-random forest versus 62.40% with BCRAT for the U.S. population-based sample. Predictive accuracy reached 90.17% using ML-adaptive boosting and 89.32% using ML-Markov chain Monte Carlo generalized linear mixed model versus 59.31% with BOADICEA for the Swiss clinic-based sample. Conclusions There was a striking improvement in the accuracy of classification of women with and without breast cancer achieved with ML algorithms compared to the state-of-the-art model-based approaches. High-accuracy prediction techniques are important in personalized medicine because they facilitate stratification of prevention strategies and individualized clinical management.

Download Full-text

Short-term predictive ability of selected cardiovascular risk prediction models in a rural Bangladeshi population: a case-cohort study

BMC Cardiovascular Disorders ◽

10.1186/s12872-016-0279-2 ◽

2016 ◽

Vol 16 (1) ◽

Cited By ~ 5

Author(s):

Kaniz Fatema ◽

Bayzidur Rahman ◽

Nicholas Arnold Zwar ◽

Abul Hasnat Milton ◽

Liaquat Ali

Keyword(s):

Cohort Study ◽

Cardiovascular Risk ◽

Risk Prediction ◽

Prediction Models ◽

Predictive Ability ◽

Short Term ◽

Risk Prediction Models ◽

Cardiovascular Risk Prediction ◽

Bangladeshi Population

Download Full-text

BWGS: a R package for genomic selection and its application to a wheat breeding programme

10.1101/763037 ◽

2019 ◽

Author(s):

Gilles Charmet ◽

Louis Gautier Tran ◽

Jérôme Auzanneau ◽

Renaud Rincent ◽

Sophie Bouchet

Keyword(s):

Missing Data ◽

Genomic Selection ◽

Prediction Models ◽

Predictive Accuracy ◽

Predictive Ability ◽

Breeding Programme ◽

Training Set ◽

Desktop Computer ◽

Marker Selection ◽

Breeding Programmes

AbstractWe developed an integrated R library called BWGS to enable easy computation of Genomic Estimates of Breeding values (GEBV) for genomic selection. BWGS relies on existing R-libraries, all freely available from CRAN servers. The two main functions enable to run 1) replicated random cross validations within a training set of genotyped and phenotyped lines and 2) GEBV prediction, for a set of genotyped-only lines. Options are available for 1) missing data imputation, 2) markers and training set selection and 3) genomic prediction with 15 different methods, either parametric or semi-parametric.The usefulness and efficiency of BWGS are illustrated using a population of wheat lines from a real breeding programme. Adjusted yield data from historical trials (highly unbalanced design) were used for testing the options of BWGS. On the whole, 760 candidate lines with adjusted phenotypes and genotypes for 47 839 robust SNP were used. With a simple desktop computer, we obtained results which compared with previously published results on wheat genomic selection. As predicted by the theory, factors that are most influencing predictive ability, for a given trait of moderate heritability, are the size of the training population and a minimum number of markers for capturing every QTL information. Missing data up to 40%, if randomly distributed, do not degrade predictive ability once imputed, and up to 80% randomly distributed missing data are still acceptable once imputed with Expectation-Maximization method of package rrBLUP. It is worth noticing that selecting markers that are most associated to the trait do improve predictive ability, compared with the whole set of markers, but only when marker selection is made on the whole population. When marker selection is made only on the sampled training set, this advantage nearly disappeared, since it was clearly due to overfitting. Few differences are observed between the 15 prediction models with this dataset. Although non-parametric methods that are supposed to capture non-additive effects have slightly better predictive accuracy, differences remain small. Finally, the GEBV from the 15 prediction models are all highly correlated to each other. These results are encouraging for an efficient use of genomic selection in applied breeding programmes and BWGS is a simple and powerful toolbox to apply in breeding programmes or training activities.

Download Full-text