Sample Size Analysis for Machine Learning Clinical Validation Studies

ABSTRACTOBJECTIVEBefore integrating new machine learning (ML) into clinical practice, algorithms must undergo validation. Validation studies require sample size estimates. Unlike hypothesis testing studies seeking a p-value, the goal of validating predictive models is obtaining estimates of model performance. Our aim was to provide a standardized, data distribution- and model-agnostic approach to sample size calculations for validation studies of predictive ML models.MATERIALS AND METHODSSample Size Analysis for Machine Learning (SSAML) was tested in three previously published models: brain age to predict mortality (Cox Proportional Hazard), COVID hospitalization risk prediction (ordinal regression), and seizure risk forecasting (deep learning). The SSAML steps are: 1) Specify performance metrics for model discrimination and calibration. For discrimination, we use area under the receiver operating curve (AUC) for classification and Harrell’s C-statistic for survival models. For calibration, we employ calibration slope and calibration-in-the-large. 2) Specify the required precision and accuracy (≤0.5 normalized confidence interval width and ±5% accuracy). 3) Specify the required coverage probability (95%). 4) For increasing sample sizes, calculate the expected precision and bias that is achievable. 5) Choose the minimum sample size that meets all requirements.RESULTSMinimum sample sizes were obtained in each dataset using standardized criteria.DISCUSSIONSSAML provides a formal expectation of precision and accuracy at a desired confidence level.CONCLUSIONSSAML is open-source and agnostics to data type and ML model. It can be used for clinical validation studies of ML models.

Download Full-text

Effects of Training Set Size on Supervised Machine-Learning Land-Cover Classification of Large-Area High-Resolution Remotely Sensed Data

Remote Sensing ◽

10.3390/rs13030368 ◽

2021 ◽

Vol 13 (3) ◽

pp. 368

Author(s):

Christopher A. Ramezan ◽

Timothy A. Warner ◽

Aaron E. Maxwell ◽

Bradley S. Price

Keyword(s):

Machine Learning ◽

Sample Size ◽

Remotely Sensed ◽

Training Data ◽

Supervised Machine Learning ◽

Sample Sizes ◽

Remotely Sensed Data ◽

Large Area ◽

Training Set ◽

Set Size

The size of the training data set is a major determinant of classification accuracy. Nevertheless, the collection of a large training data set for supervised classifiers can be a challenge, especially for studies covering a large area, which may be typical of many real-world applied projects. This work investigates how variations in training set size, ranging from a large sample size (n = 10,000) to a very small sample size (n = 40), affect the performance of six supervised machine-learning algorithms applied to classify large-area high-spatial-resolution (HR) (1–5 m) remotely sensed data within the context of a geographic object-based image analysis (GEOBIA) approach. GEOBIA, in which adjacent similar pixels are grouped into image-objects that form the unit of the classification, offers the potential benefit of allowing multiple additional variables, such as measures of object geometry and texture, thus increasing the dimensionality of the classification input data. The six supervised machine-learning algorithms are support vector machines (SVM), random forests (RF), k-nearest neighbors (k-NN), single-layer perceptron neural networks (NEU), learning vector quantization (LVQ), and gradient-boosted trees (GBM). RF, the algorithm with the highest overall accuracy, was notable for its negligible decrease in overall accuracy, 1.0%, when training sample size decreased from 10,000 to 315 samples. GBM provided similar overall accuracy to RF; however, the algorithm was very expensive in terms of training time and computational resources, especially with large training sets. In contrast to RF and GBM, NEU, and SVM were particularly sensitive to decreasing sample size, with NEU classifications generally producing overall accuracies that were on average slightly higher than SVM classifications for larger sample sizes, but lower than SVM for the smallest sample sizes. NEU however required a longer processing time. The k-NN classifier saw less of a drop in overall accuracy than NEU and SVM as training set size decreased; however, the overall accuracies of k-NN were typically less than RF, NEU, and SVM classifiers. LVQ generally had the lowest overall accuracy of all six methods, but was relatively insensitive to sample size, down to the smallest sample sizes. Overall, due to its relatively high accuracy with small training sample sets, and minimal variations in overall accuracy between very large and small sample sets, as well as relatively short processing time, RF was a good classifier for large-area land-cover classifications of HR remotely sensed data, especially when training data are scarce. However, as performance of different supervised classifiers varies in response to training set size, investigating multiple classification algorithms is recommended to achieve optimal accuracy for a project.

Download Full-text

Novel GIS Based Machine Learning Algorithms for Shallow Landslide Susceptibility Mapping

Sensors ◽

10.3390/s18113777 ◽

2018 ◽

Vol 18 (11) ◽

pp. 3777 ◽

Cited By ~ 61

Author(s):

Ataollah Shirzadi ◽

Karim Soliamani ◽

Mahmood Habibnejhad ◽

Ataollah Kavian ◽

Kamran Chapi ◽

...

Keyword(s):

Machine Learning ◽

Sample Size ◽

Prediction Accuracy ◽

Goodness Of Fit ◽

Learning Algorithm ◽

Machine Learning Algorithms ◽

Landslide Susceptibility Mapping ◽

Sample Sizes ◽

Promising Alternative ◽

Statistical Measures

The main objective of this research was to introduce a novel machine learning algorithm of alternating decision tree (ADTree) based on the multiboost (MB), bagging (BA), rotation forest (RF) and random subspace (RS) ensemble algorithms under two scenarios of different sample sizes and raster resolutions for spatial prediction of shallow landslides around Bijar City, Kurdistan Province, Iran. The evaluation of modeling process was checked by some statistical measures and area under the receiver operating characteristic curve (AUROC). Results show that, for combination of sample sizes of 60%/40% and 70%/30% with a raster resolution of 10 m, the RS model, while, for 80%/20% and 90%/10% with a raster resolution of 20 m, the MB model obtained a high goodness-of-fit and prediction accuracy. The RS-ADTree and MB-ADTree ensemble models outperformed the ADTree model in two scenarios. Overall, MB-ADTree in sample size of 80%/20% with a resolution of 20 m (area under the curve (AUC) = 0.942) and sample size of 60%/40% with a resolution of 10 m (AUC = 0.845) had the highest and lowest prediction accuracy, respectively. The findings confirm that the newly proposed models are very promising alternative tools to assist planners and decision makers in the task of managing landslide prone areas.

Download Full-text

Sample Size Analysis for Confidence Interval Estimation of Performance Metrics in ATR Evaluation

2007 IEEE Radar Conference ◽

10.1109/radar.2007.374284 ◽

2007 ◽

Cited By ~ 2

Author(s):

Jun He ◽

Hongzhong Zhao ◽

Qiang Fu

Keyword(s):

Confidence Interval ◽

Sample Size ◽

Performance Metrics ◽

Interval Estimation ◽

Size Analysis ◽

Confidence Interval Estimation

Download Full-text

Impact of Training Sample Size on the Effects of Regularization in a Convolutional Neural Network-based Dental X-ray Artifact Prediction Model

Journal of Undergraduate Life Sciences ◽

10.33137/juls.v14i1.35883 ◽

2020 ◽

Vol 14 (1) ◽

pp. 5

Author(s):

Adam Adli ◽

Pascal Tyrrell

Keyword(s):

Neural Network ◽

Machine Learning ◽

Convolutional Neural Network ◽

Sample Size ◽

Training Sample ◽

Training Data ◽

Classification Model ◽

Sample Sizes ◽

X Ray ◽

Training Sample Size

Introduction: Advances in computers have allowed for the practical application of increasingly advanced machine learning models to aid healthcare providers with diagnosis and inspection of medical images. Often, a lack of training data and computation time can be a limiting factor in the development of an accurate machine learning model in the domain of medical imaging. As a possible solution, this study investigated whether L2 regularization moderate s the overfitting that occurs as a result of small training sample sizes.Methods: This study employed transfer learning experiments on a dental x-ray binary classification model to explore L2 regularization with respect to training sample size in five common convolutional neural network architectures. Model testing performance was investigated and technical implementation details including computation times and hardware considerations as well as performance factors and practical feasibility were described.Results: The experimental results showed a trend that smaller training sample sizes benefitted more from regularization than larger training sample sizes. Further, the results showed that applying L2 regularization did not apply significant computational overhead and that the extra rounds of training L2 regularization were feasible when training sample sizes are relatively small.Conclusion: Overall, this study found that there is a window of opportunity in which the benefits of employing regularization can be most cost-effective relative to training sample size. It is recommended that training sample size should be carefully considered when forming expectations of achievable generalizability improvements that result from investing computational resources into model regularization.

Download Full-text

Influence of plot and sample sizes on aboveground biomass estimations in plantation forests using very high resolution stereo satellite imagery

Forestry An International Journal of Forest Research ◽

10.1093/forestry/cpaa028 ◽

2020 ◽

Cited By ~ 1

Author(s):

Zahra Hosseini ◽

Hooman Latifi ◽

Hamed Naghavi ◽

Siavash Bakhtiarvand Bakhtiari ◽

Fabian Ewald Fassnacht

Keyword(s):

Remote Sensing ◽

High Resolution ◽

Sample Size ◽

Performance Metrics ◽

Model Performance ◽

Remote Sensing Data ◽

Sample Sizes ◽

Plantation Forests ◽

Plot Size ◽

Very High

Abstract Regular biomass estimations for natural and plantation forests are important to support sustainable forestry and to calculate carbon-related statistics. The application of remote sensing data to estimate biomass of forests has been amply demonstrated but there is still space for increasing the efficiency of current approaches. Here, we investigated the influence of field plot and sample sizes on the accuracy of random forest models trained with information derived from Pléiades very high resolution (VHR) stereo images applied to plantation forests in an arid environment. We collected field data at 311 locations with three different plot area sizes (100, 300 and 500 m2). In two experiments, we demonstrate how plot and sample sizes influence the accuracy of biomass estimation models. In the first experiment, we compared model accuracies obtained with varying plot sizes but constant number of samples. In the second experiment, we fixed the total area to be sampled to account for the additional effort to collect large field plots. Our results for the first experiment show that model performance metrics Spearman’s r, RMSErel and RMSEnor improve from 0.61, 0.70 and 0.36 at a sample size of 24–0.79, 0.51 and 0.15 at a sample size of 192, respectively. In the second experiment, highest accuracies were obtained with a plot size of 100 m2 (most samples) with Spearman’s r = 0.77, RMSErel = 0.59 and RMSEnor = 0.15. Results from an analysis of variance type-II suggest that the overall most important factors to explain model performance metrics for our biomass models is sample size. Our results suggest no clear advantage for any plot size to reach accurate biomass estimates using VHR stereo imagery in plantations. This is an important finding, which partly contradicts the suggestions of earlier studies but requires validation for other forest types and remote sensing data types (e.g. LiDAR).

Download Full-text

Study on Travel Time Reliability of Probe Vehicle System Based on Minimum Sample Size Analysis

ICTIS 2011 ◽

10.1061/41177(415)212 ◽

2011 ◽

Cited By ~ 1

Author(s):

Yang Yang ◽

Enjian Yao ◽

Da Qu ◽

Yi Zhang

Keyword(s):

Sample Size ◽

Travel Time ◽

Size Analysis ◽

Minimum Sample Size ◽

Travel Time Reliability ◽

Probe Vehicle ◽

Vehicle System ◽

Minimum Sample

Download Full-text

Deep learning for brains?: Different linear and nonlinear scaling in UK Biobank brain images vs. machine-learning datasets

10.1101/757054 ◽

2019 ◽

Cited By ~ 3

Author(s):

Marc-Andre Schulz ◽

B.T. Thomas Yeo ◽

Joshua T. Vogelstein ◽

Janaina Mourao-Miranada ◽

Jakob N. Kather ◽

...

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Sample Size ◽

Speech Processing ◽

Linear Models ◽

Nonlinear Models ◽

Uk Biobank ◽

Sample Sizes ◽

Brain Images ◽

Brain Scans

AbstractIn recent years, deep learning has unlocked unprecedented success in various domains, especially in image, text, and speech processing. These breakthroughs may hold promise for neuroscience and especially for brain-imaging investigators who start to analyze thousands of participants. However, deep learning is only beneficial if the data have nonlinear relationships and if they are exploitable at currently available sample sizes. We systematically profiled the performance of deep models, kernel models, and linear models as a function of sample size on UK Biobank brain images against established machine learning references. On MNIST and Zalando Fashion, prediction accuracy consistently improved when escalating from linear models to shallow-nonlinear models, and further improved when switching to deep-nonlinear models. The more observations were available for model training, the greater the performance gain we saw. In contrast, using structural or functional brain scans, simple linear models performed on par with more complex, highly parameterized models in age/sex prediction across increasing sample sizes. In fact, linear models kept improving as the sample size approached ∼10,000 participants. Our results indicate that the increase in performance of linear models with additional data does not saturate at the limit of current feasibility. Yet, nonlinearities of common brain scans remain largely inaccessible to both kernel and deep learning methods at any examined scale.

Download Full-text

A General Monte Carlo Method for Sample Size Analysis in the Context of Network Models

10.31234/osf.io/j5v7u ◽

2021 ◽

Author(s):

Mihai Alexandru Constantin ◽

Noémi Katalin Schuurman ◽

Jeroen Vermunt

Keyword(s):

Monte Carlo ◽

Sample Size ◽

Graphical Model ◽

Network Models ◽

Performance Measure ◽

Monte Carlo Algorithm ◽

Size Analysis ◽

Sample Sizes ◽

Cross Sectional ◽

Target Value

We introduce a general method for sample size computations in the context of cross-sectional network models. The method takes the form of an automated Monte Carlo algorithm, designed to find an optimal sample size while iteratively concentrating the computations on the sample sizes that seem most relevant. The method requires three inputs: 1) a hypothesized network structure or desired characteristics of that structure, 2) an estimation performance measure and its corresponding target value (e.g., a sensitivity of 0.6), and 3) a statistic and its corresponding target value that determine how the target value for the performance measure be reached (e.g., reaching a sensitivity of 0.6 with a probability of 0.8). The method consists of a Monte Carlo simulation step for computing the performance measure and the statistic for several sample sizes selected from an initial candidate sample size range, a curve-fitting step for interpolating the statistic across the entire candidate range, and a stratified bootstrapping step to quantify the uncertainty around the recommendation provided. We evaluated the performance of the method for the Gaussian Graphical Model, but it can easily extend to other models. It displayed good performance, with the sample size recommendations provided being, on average, at most 1.14 sample sizes away from the truth, with a highest standard deviation of 26.25 sample sizes. The method is implemented in the form of an R package called powerly, available on GitHub and CRAN.

Download Full-text

Quantifying Uncertainty in Machine Learning-Based Power Outage Prediction Model Training: A Tool for Sustainable Storm Restoration

Sustainability ◽

10.3390/su12041525 ◽

2020 ◽

Vol 12 (4) ◽

pp. 1525 ◽

Cited By ~ 9

Author(s):

Feifei Yang ◽

David W. Wanik ◽

Diego Cerrai ◽

Md Abul Ehsan Bhuiyan ◽

Emmanouil N. Anagnostou

Keyword(s):

Machine Learning ◽

Sample Size ◽

Prediction Models ◽

Training Sample ◽

Training Dataset ◽

Sample Sizes ◽

Validation Experiment ◽

Out Of Sample ◽

Minimum Number ◽

The Impact

A growing number of electricity utilities use machine learning-based outage prediction models (OPMs) to predict the impact of storms on their networks for sustainable management. The accuracy of OPM predictions is sensitive to sample size and event severity representativeness in the training dataset, the extent of which has not yet been quantified. This study devised a randomized and out-of-sample validation experiment to quantify an OPM’s prediction uncertainty to different training sample sizes and event severity representativeness. The study showed random error decreasing by more than 100% for sample sizes ranging from 10 to 80 extratropical events, and by 32% for sample sizes from 10 to 40 thunderstorms. This study quantified the minimum number of sample size for the OPM attaining an acceptable prediction performance. The results demonstrated that conditioning the training of the OPM to a subset of events representative of the predicted event’s severity reduced the underestimation bias exhibited in high-impact events and the overestimation bias in low-impact ones. We used cross entropy (CE) to quantify the relatedness of weather variable distribution between the training dataset and the forecasted event.

Download Full-text

Performance of Machine Learning Algorithms in Predicting the Pavement International Roughness Index

Transportation Research Record Journal of the Transportation Research Board ◽

10.1177/0361198120986171 ◽

2021 ◽

pp. 036119812098617

Author(s):

Mohammad Z. Bashar ◽

Cristina Torres-Machi

Keyword(s):

Machine Learning ◽

Sample Size ◽

Machine Learning Algorithms ◽

Support Vector ◽

Minimum Sample Size ◽

International Roughness Index ◽

Pavement Deterioration ◽

Roughness Index ◽

Overall Performance ◽

Minimum Sample

Significant research efforts have documented the capabilities of machine learning (ML) algorithms to model pavement performance. Several challenges, however, limit the implementation of ML by practitioners and transportation agencies. One of these challenges is related to the high variability in the performance of ML models as reported by different studies and the lack of quantitative evidence supporting the true effectiveness of these techniques. The objective of this paper is twofold: to assess the overall performance of traditional and ML techniques used to predict pavement condition, and to provide guidance on the optimal architecture and minimum sample size required to develop these models. This paper analyzes three ML algorithms commonly used to predict International Roughness Index (IRI)—Artificial Neural Network (ANN), Random Forest (RF), and Support Vector Machine (SVM)—and compares their performance to traditional techniques. An inverse variance heterogeneity based meta-analysis is performed on 20 studies conducted between 2001 and 2020. The results indicate that ML algorithms capture on average 15.6% more variability than traditional techniques. RF is the most accurate technique with an overall performance value of 0.995. ANN is also identified as a highly effective technique that has been widely used and provides accurate predictions with both small and large sample sizes. For ANN algorithms, a single hidden layer with nodes equal to 0.3–2 times the number of input features is found to be sufficient in predicting pavement deterioration. A minimum sample size equal to 50 times the number of input variables is recommend to model pavement deterioration using ML.

Download Full-text