Sample Size Analysis for Machine Learning Clinical Validation Studies
ABSTRACTOBJECTIVEBefore integrating new machine learning (ML) into clinical practice, algorithms must undergo validation. Validation studies require sample size estimates. Unlike hypothesis testing studies seeking a p-value, the goal of validating predictive models is obtaining estimates of model performance. Our aim was to provide a standardized, data distribution- and model-agnostic approach to sample size calculations for validation studies of predictive ML models.MATERIALS AND METHODSSample Size Analysis for Machine Learning (SSAML) was tested in three previously published models: brain age to predict mortality (Cox Proportional Hazard), COVID hospitalization risk prediction (ordinal regression), and seizure risk forecasting (deep learning). The SSAML steps are: 1) Specify performance metrics for model discrimination and calibration. For discrimination, we use area under the receiver operating curve (AUC) for classification and Harrell’s C-statistic for survival models. For calibration, we employ calibration slope and calibration-in-the-large. 2) Specify the required precision and accuracy (≤0.5 normalized confidence interval width and ±5% accuracy). 3) Specify the required coverage probability (95%). 4) For increasing sample sizes, calculate the expected precision and bias that is achievable. 5) Choose the minimum sample size that meets all requirements.RESULTSMinimum sample sizes were obtained in each dataset using standardized criteria.DISCUSSIONSSAML provides a formal expectation of precision and accuracy at a desired confidence level.CONCLUSIONSSAML is open-source and agnostics to data type and ML model. It can be used for clinical validation studies of ML models.