scholarly journals Sample size for binary logistic prediction models: Beyond events per variable criteria

2018 ◽  
Vol 28 (8) ◽  
pp. 2455-2474 ◽  
Author(s):  
Maarten van Smeden ◽  
Karel GM Moons ◽  
Joris AH de Groot ◽  
Gary S Collins ◽  
Douglas G Altman ◽  
...  

Binary logistic regression is one of the most frequently applied statistical approaches for developing clinical prediction models. Developers of such models often rely on an Events Per Variable criterion (EPV), notably EPV ≥10, to determine the minimal sample size required and the maximum number of candidate predictors that can be examined. We present an extensive simulation study in which we studied the influence of EPV, events fraction, number of candidate predictors, the correlations and distributions of candidate predictor variables, area under the ROC curve, and predictor effects on out-of-sample predictive performance of prediction models. The out-of-sample performance (calibration, discrimination and probability prediction error) of developed prediction models was studied before and after regression shrinkage and variable selection. The results indicate that EPV does not have a strong relation with metrics of predictive performance, and is not an appropriate criterion for (binary) prediction model development studies. We show that out-of-sample predictive performance can better be approximated by considering the number of predictors, the total sample size and the events fraction. We propose that the development of new sample size criteria for prediction models should be based on these three parameters, and provide suggestions for improving sample size determination.

2021 ◽  
pp. 096228022110463
Author(s):  
Glen P Martin ◽  
Richard D Riley ◽  
Gary S Collins ◽  
Matthew Sperrin

Recent minimum sample size formula (Riley et al.) for developing clinical prediction models help ensure that development datasets are of sufficient size to minimise overfitting. While these criteria are known to avoid excessive overfitting on average, the extent of variability in overfitting at recommended sample sizes is unknown. We investigated this through a simulation study and empirical example to develop logistic regression clinical prediction models using unpenalised maximum likelihood estimation, and various post-estimation shrinkage or penalisation methods. While the mean calibration slope was close to the ideal value of one for all methods, penalisation further reduced the level of overfitting, on average, compared to unpenalised methods. This came at the cost of higher variability in predictive performance for penalisation methods in external data. We recommend that penalisation methods are used in data that meet, or surpass, minimum sample size requirements to further mitigate overfitting, and that the variability in predictive performance and any tuning parameters should always be examined as part of the model development process, since this provides additional information over average (optimism-adjusted) performance alone. Lower variability would give reassurance that the developed clinical prediction model will perform well in new individuals from the same population as was used for model development.


2021 ◽  
Author(s):  
Sebastian Johannes Fritsch ◽  
Konstantin Sharafutdinov ◽  
Moein Einollahzadeh Samadi ◽  
Gernot Marx ◽  
Andreas Schuppert ◽  
...  

BACKGROUND During the course of the COVID-19 pandemic, a variety of machine learning models were developed to predict different aspects of the disease, such as long-term causes, organ dysfunction or ICU mortality. The number of training datasets used has increased significantly over time. However, these data now come from different waves of the pandemic, not always addressing the same therapeutic approaches over time as well as changing outcomes between two waves. The impact of these changes on model development has not yet been studied. OBJECTIVE The aim of the investigation was to examine the predictive performance of several models trained with data from one wave predicting the second wave´s data and the impact of a pooling of these data sets. Finally, a method for comparison of different datasets for heterogeneity is introduced. METHODS We used two datasets from wave one and two to develop several predictive models for mortality of the patients. Four classification algorithms were used: logistic regression (LR), support vector machine (SVM), random forest classifier (RF) and AdaBoost classifier (ADA). We also performed a mutual prediction on the data of that wave which was not used for training. Then, we compared the performance of models when a pooled dataset from two waves was used. The populations from the different waves were checked for heterogeneity using a convex hull analysis. RESULTS 63 patients from wave one (03-06/2020) and 54 from wave two (08/2020-01/2021) were evaluated. For both waves separately, we found models reaching sufficient accuracies up to 0.79 AUROC (95%-CI 0.76-0.81) for SVM on the first wave and up 0.88 AUROC (95%-CI 0.86-0.89) for RF on the second wave. After the pooling of the data, the AUROC decreased relevantly. In the mutual prediction, models trained on second wave´s data showed, when applied on first wave´s data, a good prediction for non-survivors but an insufficient classification for survivors. The opposite situation (training: first wave, test: second wave) revealed the inverse behaviour with models correctly classifying survivors and incorrectly predicting non-survivors. The convex hull analysis for the first and second wave populations showed a more inhomogeneous distribution of underlying data when compared to randomly selected sets of patients of the same size. CONCLUSIONS Our work demonstrates that a larger dataset is not a universal solution to all machine learning problems in clinical settings. Rather, it shows that inhomogeneous data used to develop models can lead to serious problems. With the convex hull analysis, we offer a solution for this problem. The outcome of such an analysis can raise concerns if the pooling of different datasets would cause inhomogeneous patterns preventing a better predictive performance.


2021 ◽  
Vol 36 (Supplement_1) ◽  
Author(s):  
A Youssef

Abstract Study question Which models that predict pregnancy outcome in couples with unexplained RPL exist and what is the performance of the most used model? Summary answer We identified seven prediction models; none followed the recommended prediction model development steps. Moreover, the most used model showed poor predictive performance. What is known already RPL remains unexplained in 50–75% of couples For these couples, there is no effective treatment option and clinical management rests on supportive care. Essential part of supportive care consists of counselling on the prognosis of subsequent pregnancies. Indeed, multiple prediction models exist, however the quality and validity of these models varies. In addition, the prediction model developed by Brigham et al is the most widely used model, but has never been externally validated. Study design, size, duration We performed a systematic review to identify prediction models for pregnancy outcome after unexplained RPL. In addition we performed an external validation of the Brigham model in a retrospective cohort, consisting of 668 couples with unexplained RPL that visited our RPL clinic between 2004 and 2019. Participants/materials, setting, methods A systematic search was performed in December 2020 in Pubmed, Embase, Web of Science and Cochrane library to identify relevant studies. Eligible studies were selected and assessed according to the TRIPOD) guidelines, covering topics on model performance and validation statement. The performance of predicting live birth in the Brigham model was evaluated through calibration and discrimination, in which the observed pregnancy rates were compared to the predicted pregnancy rates. Main results and the role of chance Seven models were compared and assessed according to the TRIPOD statement. This resulted in two studies of low, three of moderate and two of above average reporting quality. These studies did not follow the recommended steps for model development and did not calculate a sample size. Furthermore, the predictive performance of neither of these models was internally- or externally validated. We performed an external validation of Brigham model. Calibration showed overestimation of the model and too extreme predictions, with a negative calibration intercept of –0.52 (CI 95% –0.68 – –0.36), with a calibration slope of 0.39 (CI 95% 0.07 – 0.71). The discriminative ability of the model was very low with a concordance statistic of 0.55 (CI 95% 0.50 – 0.59). Limitations, reasons for caution None of the studies are specifically named prediction models, therefore models may have been missed in the selection process. The external validation cohort used a retrospective design, in which only the first pregnancy after intake was registered. Follow-up time was not limited, which is important in counselling unexplained RPL couples. Wider implications of the findings: Currently, there are no suitable models that predict on pregnancy outcome after RPL. Moreover, we are in need of a model with several variables such that prognosis is individualized, and factors from both the female as the male to enable a couple specific prognosis. Trial registration number Not applicable


2021 ◽  
Vol 29 (1) ◽  
Author(s):  
Hezlin Aryani Abd Rahman ◽  
Yap Bee Wah ◽  
Ong Seng Huat

Logistic regression is often used for the classification of a binary categorical dependent variable using various types of covariates (continuous or categorical). Imbalanced data will lead to biased parameter estimates and classification performance of the logistic regression model. Imbalanced data occurs when the number of cases in one category of the binary dependent variable is very much smaller than the other category. This simulation study investigates the effect of imbalanced data measured by imbalanced ratio on the parameter estimate of the binary logistic regression with a categorical covariate. Datasets were simulated with controlled different percentages of imbalance ratio (IR), from 1% to 50%, and for various sample sizes. The simulated datasets were then modeled using binary logistic regression. The bias in the estimates was measured using MSE (Mean Square Error). The simulation results provided evidence that the effect of imbalance ratio on the parameter estimate of the covariate decreased as sample size increased. The bias of the estimates depended on sample size whereby for sample size 100, 500, 1000 – 2000 and 2500 – 3500, the estimates were biased for IR below 30%, 10%, 5% and 2% respectively. Results also showed that parameter estimates were all biased at IR 1% for all sample size. An application using a real dataset supported the simulation results.


2020 ◽  
Author(s):  
Evangelia Christodoulou ◽  
Maarten van Smeden ◽  
Michael Edlinger ◽  
Dirk Timmerman ◽  
Maria Wanitschek ◽  
...  

Abstract Background: We suggest an adaptive sample size calculation method for developing clinical prediction models, in which model performance is monitored sequentially as new data comes in. Methods: We illustrate the approach using data for the diagnosis of ovarian cancer (n=5914, 33% event fraction) and obstructive coronary artery disease (CAD; n=4888, 44% event fraction). We used logistic regression to develop a prediction model consisting only of a-priori selected predictors and assumed linear relations for continuous predictors. We mimicked prospective patient recruitment by developing the model on 100 randomly selected patients, and we used bootstrapping to internally validate the model. We sequentially added 50 random new patients until we reached a sample size of 3000, and re-estimated model performance at each step. We examined the required sample size for satisfying the following stopping rule: obtaining a calibration slope ≥0.9 and optimism in the c-statistic (ΔAUC) <=0.02 at two consecutive sample sizes. This procedure was repeated 500 times. We also investigated the impact of alternative modeling strategies: modeling nonlinear relations for continuous predictors, and applying Firth’s bias correction.Results: Better discrimination was achieved in the ovarian cancer data (c-statistic 0.9 with 7 predictors) than in the CAD data (c-statistic 0.7 with 11 predictors). Adequate calibration and limited optimism in discrimination was achieved after a median of 450 patients (interquartile range 450-500) for the ovarian cancer data (22 events per parameter (EPP), 20-24), and 750 patients (700-800) for the CAD data (30 EPP, 28-33). A stricter criterion, requiring ΔAUC <=0.01, was met with a median of 500 (23 EPP) and 1350 (54 EPP) patients, respectively. These sample sizes were much higher than the well-known 10 EPP rule of thumb and slightly higher than a recently published fixed sample size calculation method by Riley et al. Higher sample sizes were required when nonlinear relationships were modeled, and lower sample sizes when Firth’s correction was used. Conclusions: Adaptive sample size determination can be a useful supplement to a priori sample size calculations, because it allows to further tailor the sample size to the specific prediction modeling context in a dynamic fashion.


2020 ◽  
Vol 29 (11) ◽  
pp. 3166-3178 ◽  
Author(s):  
Ben Van Calster ◽  
Maarten van Smeden ◽  
Bavo De Cock ◽  
Ewout W Steyerberg

When developing risk prediction models on datasets with limited sample size, shrinkage methods are recommended. Earlier studies showed that shrinkage results in better predictive performance on average. This simulation study aimed to investigate the variability of regression shrinkage on predictive performance for a binary outcome. We compared standard maximum likelihood with the following shrinkage methods: uniform shrinkage (likelihood-based and bootstrap-based), penalized maximum likelihood (ridge) methods, LASSO logistic regression, adaptive LASSO, and Firth’s correction. In the simulation study, we varied the number of predictors and their strength, the correlation between predictors, the event rate of the outcome, and the events per variable. In terms of results, we focused on the calibration slope. The slope indicates whether risk predictions are too extreme (slope < 1) or not extreme enough (slope > 1). The results can be summarized into three main findings. First, shrinkage improved calibration slopes on average. Second, the between-sample variability of calibration slopes was often increased relative to maximum likelihood. In contrast to other shrinkage approaches, Firth’s correction had a small shrinkage effect but showed low variability. Third, the correlation between the estimated shrinkage and the optimal shrinkage to remove overfitting was typically negative, with Firth’s correction as the exception. We conclude that, despite improved performance on average, shrinkage often worked poorly in individual datasets, in particular when it was most needed. The results imply that shrinkage methods do not solve problems associated with small sample size or low number of events per variable.


2018 ◽  
Vol 5 (suppl_1) ◽  
pp. S351-S352
Author(s):  
Thomas P Lodise Jr. ◽  
Nicole G Bonine ◽  
J Michael Ye ◽  
Henry J Folse ◽  
Patrick Gillard

Abstract Background Identification of infections caused by antimicrobial-resistant microorganisms is critical to administration of early appropriate antibiotic therapy. We developed a clinical bedside tool to estimate the probability of carbapenem-resistant Enterobacteriaceae (CRE), extended spectrum β-lactamase-producing Enterobacteriaceae (ESBL), and multidrug-resistant Pseudomonas aeruginosa (MDRP) among hospitalized adult patients with Gram-negative infections. Methods A retrospective observational study of the Premier Hospital Database (PHD) was conducted. The study included adult hospitalized patients with complicated urinary tract infection (cUTI), complicated intraabdominal infection (cIAI), bloodstream infections (BSI), or hospital-acquired/ventilator-associated pneumonia (HAP/VAP) with a culture-confirmed Gram-negative infection in PHD from 2011 to 2015. Model development steps are shown in Figure 1. The study population was split into training and test cohorts. Prediction models were developed using logistic regression in the training cohort (Figure 1). For each resistant phenotype (CRE, ESBL, and MDRP), a separate model was developed for community-acquired (index culture ≤3 days of admission) and hospital-acquired (index culture &gt;3 days of admission) infections (six models in total). The predictive performance of the models was assessed in the training and test cohorts. Models were converted to a singular user-friendly interface for use at the bedside. Results The most important predictors of antibiotic-resistant Gram-negative bacterial infection were prior number of antibiotics, infection site, prior infection in the last 3 months, hospital prevalence of each resistant pathogen (CRE, ESBL, and MDRP), and age (Figure 2). The predictive performance was highly acceptable for all six models (Figure 3). Conclusion We developed a clinical prediction tool to estimate the probability of CRE, ESBL, and MDRP among hospitalized adult patients with community- and hospital-acquired Gram-negative infections. Our predictive model has been implemented as a user-friendly bedside tool for use by clinicians to predict the probability of resistant infections in individual patients, to guide early appropriate therapy. Disclosures T. P. Lodise Jr., Motif BioSciences: Board Member, Consulting fee. N. G. Bonine, Allergan: Employee, Salary. J. M. Ye, Allergan: Employee, Salary. H. J. Folse, Evidera: Employee, Salary. P. Gillard, Allergan: Employee, Salary.


2021 ◽  
Author(s):  
Leighann Ashlock ◽  
Peter D. Soyster ◽  
Aaron Jason Fisher

The specific factors driving alcohol-related behavior and cognition likely vary from person to person. Many theories suggest emotions are pertinent to alcohol use. Emotions and how they change over time may provide an opportunity for more precise prediction of alcohol consumption. The present study applied statistical classification methods to idiographic time series data of emotions and emotion dynamics in order to identify person-specific and between-subjects predictors of future drinking-relevant behavior, affect, and cognition (N = 33). Participants were sent eight mobile phone surveys per day for 15 days. Each survey assessed the number of drinks consumed since the previous survey, as well as emotions, alcohol craving, and the desire to drink. Each participant’s EMA data were prepared for analysis separately. To estimate emotion dynamics, we utilized the Generalized Local Linear Approximation. The data collected from each individual were split into training and testing sets for out-of-sample, person-specific validation. Elastic net regularization was used to select a subset of emotion and emotion dynamic variables to be used in models that predicted either alcohol consumption, craving, or wanting to drink roughly two hours in the future. To compare predictive performance, we tested both person-specific and between-subject prediction models. Averaging across participants, out-of-sample predictions of future drinking using idiographic models were 69% accurate. For craving, the mean out-of-sample R² value was .13. For wanting to drink, the mean out-of-sample R² value was .16. Idiographic prediction models exceeded nomothetic models in prediction accuracy. Using person-specific emotion and emotion dynamics can help predict future drinking behaviors.


2016 ◽  
Vol 27 (2) ◽  
pp. 507-520 ◽  
Author(s):  
Ying Wu ◽  
Yiqun Li ◽  
Yan Hou ◽  
Kang Li ◽  
Xiaohua Zhou

Study planning is particularly complex for survival trials because it usually involves an accrual period and a continued observation period after accrual closure. The three-arm clinical trial design, which includes a test treatment, an active reference, and a placebo control, is the gold standard design for the assessment of non-inferiority. The existing statistical methods of calculating minimal sample size for non-inferiority trials with three-arm design and survival-type endpoints cannot take into consideration the accrual rate of patients to the trial, the length of accrual period, the length of continued observation period after accrual closure, and unbalanced allocation of the total sample size. The purpose of this paper is to develop a statistical method, which allows for all these sources of variability for planning non-inferiority trials with the gold standard design for censored, exponentially distributed time-to-event data. The proposed method is based on the assumption of exponentially distributed failure times and a non-inferiority test formulated in terms of the retention of effect hypotheses. It can be used to calculate the duration of accrual required to assure a desired power for non-inferiority trials with active and placebo control. We illustrate the use of the method by considering a randomized, active- and placebo-controlled trial in depression associated with Parkinson’s disease. We then explore the validity of the proposed method by simulation studies. An R-language program for the implementation of the proposed algorithm is provided as supplementary material.


Author(s):  
ZHENMIN CHEN ◽  
FANG ZHAO

Survey analysis method is widely used in many areas such as social study, marketing research, economics, public health, clinical trials and transportation data analysis. Minimum sample size determination is always needed before a survey is conducted to avoid huge cost. Some statistical methods can be found from the literature for finding the minimum required sample size. This paper proposes a method for finding the minimum total sample size needed for the survey when the population is divided into cells. The proposed method can be used for both the infinite population case and the finite population case. A computer program is needed to realize the sample size calculation. The computer program the authors used is SAS/IML, which is a special integrated matrix language (IML) procedure of the Statistical Analysis System (SAS) software.


Sign in / Sign up

Export Citation Format

Share Document