scholarly journals Sample Size Analysis for Machine Learning Clinical Validation Studies

Author(s):  
Daniel M. Goldenholz ◽  
Haoqi Sun ◽  
Wolfgang Ganglberger ◽  
M. Brandon Westover

ABSTRACTOBJECTIVEBefore integrating new machine learning (ML) into clinical practice, algorithms must undergo validation. Validation studies require sample size estimates. Unlike hypothesis testing studies seeking a p-value, the goal of validating predictive models is obtaining estimates of model performance. Our aim was to provide a standardized, data distribution- and model-agnostic approach to sample size calculations for validation studies of predictive ML models.MATERIALS AND METHODSSample Size Analysis for Machine Learning (SSAML) was tested in three previously published models: brain age to predict mortality (Cox Proportional Hazard), COVID hospitalization risk prediction (ordinal regression), and seizure risk forecasting (deep learning). The SSAML steps are: 1) Specify performance metrics for model discrimination and calibration. For discrimination, we use area under the receiver operating curve (AUC) for classification and Harrell’s C-statistic for survival models. For calibration, we employ calibration slope and calibration-in-the-large. 2) Specify the required precision and accuracy (≤0.5 normalized confidence interval width and ±5% accuracy). 3) Specify the required coverage probability (95%). 4) For increasing sample sizes, calculate the expected precision and bias that is achievable. 5) Choose the minimum sample size that meets all requirements.RESULTSMinimum sample sizes were obtained in each dataset using standardized criteria.DISCUSSIONSSAML provides a formal expectation of precision and accuracy at a desired confidence level.CONCLUSIONSSAML is open-source and agnostics to data type and ML model. It can be used for clinical validation studies of ML models.

2021 ◽  
Vol 13 (3) ◽  
pp. 368
Author(s):  
Christopher A. Ramezan ◽  
Timothy A. Warner ◽  
Aaron E. Maxwell ◽  
Bradley S. Price

The size of the training data set is a major determinant of classification accuracy. Nevertheless, the collection of a large training data set for supervised classifiers can be a challenge, especially for studies covering a large area, which may be typical of many real-world applied projects. This work investigates how variations in training set size, ranging from a large sample size (n = 10,000) to a very small sample size (n = 40), affect the performance of six supervised machine-learning algorithms applied to classify large-area high-spatial-resolution (HR) (1–5 m) remotely sensed data within the context of a geographic object-based image analysis (GEOBIA) approach. GEOBIA, in which adjacent similar pixels are grouped into image-objects that form the unit of the classification, offers the potential benefit of allowing multiple additional variables, such as measures of object geometry and texture, thus increasing the dimensionality of the classification input data. The six supervised machine-learning algorithms are support vector machines (SVM), random forests (RF), k-nearest neighbors (k-NN), single-layer perceptron neural networks (NEU), learning vector quantization (LVQ), and gradient-boosted trees (GBM). RF, the algorithm with the highest overall accuracy, was notable for its negligible decrease in overall accuracy, 1.0%, when training sample size decreased from 10,000 to 315 samples. GBM provided similar overall accuracy to RF; however, the algorithm was very expensive in terms of training time and computational resources, especially with large training sets. In contrast to RF and GBM, NEU, and SVM were particularly sensitive to decreasing sample size, with NEU classifications generally producing overall accuracies that were on average slightly higher than SVM classifications for larger sample sizes, but lower than SVM for the smallest sample sizes. NEU however required a longer processing time. The k-NN classifier saw less of a drop in overall accuracy than NEU and SVM as training set size decreased; however, the overall accuracies of k-NN were typically less than RF, NEU, and SVM classifiers. LVQ generally had the lowest overall accuracy of all six methods, but was relatively insensitive to sample size, down to the smallest sample sizes. Overall, due to its relatively high accuracy with small training sample sets, and minimal variations in overall accuracy between very large and small sample sets, as well as relatively short processing time, RF was a good classifier for large-area land-cover classifications of HR remotely sensed data, especially when training data are scarce. However, as performance of different supervised classifiers varies in response to training set size, investigating multiple classification algorithms is recommended to achieve optimal accuracy for a project.


Sensors ◽  
2018 ◽  
Vol 18 (11) ◽  
pp. 3777 ◽  
Author(s):  
Ataollah Shirzadi ◽  
Karim Soliamani ◽  
Mahmood Habibnejhad ◽  
Ataollah Kavian ◽  
Kamran Chapi ◽  
...  

The main objective of this research was to introduce a novel machine learning algorithm of alternating decision tree (ADTree) based on the multiboost (MB), bagging (BA), rotation forest (RF) and random subspace (RS) ensemble algorithms under two scenarios of different sample sizes and raster resolutions for spatial prediction of shallow landslides around Bijar City, Kurdistan Province, Iran. The evaluation of modeling process was checked by some statistical measures and area under the receiver operating characteristic curve (AUROC). Results show that, for combination of sample sizes of 60%/40% and 70%/30% with a raster resolution of 10 m, the RS model, while, for 80%/20% and 90%/10% with a raster resolution of 20 m, the MB model obtained a high goodness-of-fit and prediction accuracy. The RS-ADTree and MB-ADTree ensemble models outperformed the ADTree model in two scenarios. Overall, MB-ADTree in sample size of 80%/20% with a resolution of 20 m (area under the curve (AUC) = 0.942) and sample size of 60%/40% with a resolution of 10 m (AUC = 0.845) had the highest and lowest prediction accuracy, respectively. The findings confirm that the newly proposed models are very promising alternative tools to assist planners and decision makers in the task of managing landslide prone areas.


2020 ◽  
Vol 14 (1) ◽  
pp. 5
Author(s):  
Adam Adli ◽  
Pascal Tyrrell

Introduction: Advances in computers have allowed for the practical application of increasingly advanced machine learning models to aid healthcare providers with diagnosis and inspection of medical images. Often, a lack of training data and computation time can be a limiting factor in the development of an accurate machine learning model in the domain of medical imaging. As a possible solution, this study investigated whether L2 regularization moderate s the overfitting that occurs as a result of small training sample sizes.Methods: This study employed transfer learning experiments on a dental x-ray binary classification model to explore L2 regularization with respect to training sample size in five common convolutional neural network architectures. Model testing performance was investigated and technical implementation details including computation times and hardware considerations as well as performance factors and practical feasibility were described.Results: The experimental results showed a trend that smaller training sample sizes benefitted more from regularization than larger training sample sizes. Further, the results showed that applying L2 regularization did not apply significant computational overhead and that the extra rounds of training L2 regularization were feasible when training sample sizes are relatively small.Conclusion: Overall, this study found that there is a window of opportunity in which the benefits of employing regularization can be most cost-effective relative to training sample size. It is recommended that training sample size should be carefully considered when forming expectations of achievable generalizability improvements that result from investing computational resources into model regularization.


Author(s):  
Zahra Hosseini ◽  
Hooman Latifi ◽  
Hamed Naghavi ◽  
Siavash Bakhtiarvand Bakhtiari ◽  
Fabian Ewald Fassnacht

Abstract Regular biomass estimations for natural and plantation forests are important to support sustainable forestry and to calculate carbon-related statistics. The application of remote sensing data to estimate biomass of forests has been amply demonstrated but there is still space for increasing the efficiency of current approaches. Here, we investigated the influence of field plot and sample sizes on the accuracy of random forest models trained with information derived from Pléiades very high resolution (VHR) stereo images applied to plantation forests in an arid environment. We collected field data at 311 locations with three different plot area sizes (100, 300 and 500 m2). In two experiments, we demonstrate how plot and sample sizes influence the accuracy of biomass estimation models. In the first experiment, we compared model accuracies obtained with varying plot sizes but constant number of samples. In the second experiment, we fixed the total area to be sampled to account for the additional effort to collect large field plots. Our results for the first experiment show that model performance metrics Spearman’s r, RMSErel and RMSEnor improve from 0.61, 0.70 and 0.36 at a sample size of 24–0.79, 0.51 and 0.15 at a sample size of 192, respectively. In the second experiment, highest accuracies were obtained with a plot size of 100 m2 (most samples) with Spearman’s r = 0.77, RMSErel = 0.59 and RMSEnor = 0.15. Results from an analysis of variance type-II suggest that the overall most important factors to explain model performance metrics for our biomass models is sample size. Our results suggest no clear advantage for any plot size to reach accurate biomass estimates using VHR stereo imagery in plantations. This is an important finding, which partly contradicts the suggestions of earlier studies but requires validation for other forest types and remote sensing data types (e.g. LiDAR).


2019 ◽  
Author(s):  
Marc-Andre Schulz ◽  
B.T. Thomas Yeo ◽  
Joshua T. Vogelstein ◽  
Janaina Mourao-Miranada ◽  
Jakob N. Kather ◽  
...  

AbstractIn recent years, deep learning has unlocked unprecedented success in various domains, especially in image, text, and speech processing. These breakthroughs may hold promise for neuroscience and especially for brain-imaging investigators who start to analyze thousands of participants. However, deep learning is only beneficial if the data have nonlinear relationships and if they are exploitable at currently available sample sizes. We systematically profiled the performance of deep models, kernel models, and linear models as a function of sample size on UK Biobank brain images against established machine learning references. On MNIST and Zalando Fashion, prediction accuracy consistently improved when escalating from linear models to shallow-nonlinear models, and further improved when switching to deep-nonlinear models. The more observations were available for model training, the greater the performance gain we saw. In contrast, using structural or functional brain scans, simple linear models performed on par with more complex, highly parameterized models in age/sex prediction across increasing sample sizes. In fact, linear models kept improving as the sample size approached ∼10,000 participants. Our results indicate that the increase in performance of linear models with additional data does not saturate at the limit of current feasibility. Yet, nonlinearities of common brain scans remain largely inaccessible to both kernel and deep learning methods at any examined scale.


2021 ◽  
Author(s):  
Mihai Alexandru Constantin ◽  
Noémi Katalin Schuurman ◽  
Jeroen Vermunt

We introduce a general method for sample size computations in the context of cross-sectional network models. The method takes the form of an automated Monte Carlo algorithm, designed to find an optimal sample size while iteratively concentrating the computations on the sample sizes that seem most relevant. The method requires three inputs: 1) a hypothesized network structure or desired characteristics of that structure, 2) an estimation performance measure and its corresponding target value (e.g., a sensitivity of 0.6), and 3) a statistic and its corresponding target value that determine how the target value for the performance measure be reached (e.g., reaching a sensitivity of 0.6 with a probability of 0.8). The method consists of a Monte Carlo simulation step for computing the performance measure and the statistic for several sample sizes selected from an initial candidate sample size range, a curve-fitting step for interpolating the statistic across the entire candidate range, and a stratified bootstrapping step to quantify the uncertainty around the recommendation provided. We evaluated the performance of the method for the Gaussian Graphical Model, but it can easily extend to other models. It displayed good performance, with the sample size recommendations provided being, on average, at most 1.14 sample sizes away from the truth, with a highest standard deviation of 26.25 sample sizes. The method is implemented in the form of an R package called powerly, available on GitHub and CRAN.


2020 ◽  
Vol 12 (4) ◽  
pp. 1525 ◽  
Author(s):  
Feifei Yang ◽  
David W. Wanik ◽  
Diego Cerrai ◽  
Md Abul Ehsan Bhuiyan ◽  
Emmanouil N. Anagnostou

A growing number of electricity utilities use machine learning-based outage prediction models (OPMs) to predict the impact of storms on their networks for sustainable management. The accuracy of OPM predictions is sensitive to sample size and event severity representativeness in the training dataset, the extent of which has not yet been quantified. This study devised a randomized and out-of-sample validation experiment to quantify an OPM’s prediction uncertainty to different training sample sizes and event severity representativeness. The study showed random error decreasing by more than 100% for sample sizes ranging from 10 to 80 extratropical events, and by 32% for sample sizes from 10 to 40 thunderstorms. This study quantified the minimum number of sample size for the OPM attaining an acceptable prediction performance. The results demonstrated that conditioning the training of the OPM to a subset of events representative of the predicted event’s severity reduced the underestimation bias exhibited in high-impact events and the overestimation bias in low-impact ones. We used cross entropy (CE) to quantify the relatedness of weather variable distribution between the training dataset and the forecasted event.


Author(s):  
Mohammad Z. Bashar ◽  
Cristina Torres-Machi

Significant research efforts have documented the capabilities of machine learning (ML) algorithms to model pavement performance. Several challenges, however, limit the implementation of ML by practitioners and transportation agencies. One of these challenges is related to the high variability in the performance of ML models as reported by different studies and the lack of quantitative evidence supporting the true effectiveness of these techniques. The objective of this paper is twofold: to assess the overall performance of traditional and ML techniques used to predict pavement condition, and to provide guidance on the optimal architecture and minimum sample size required to develop these models. This paper analyzes three ML algorithms commonly used to predict International Roughness Index (IRI)—Artificial Neural Network (ANN), Random Forest (RF), and Support Vector Machine (SVM)—and compares their performance to traditional techniques. An inverse variance heterogeneity based meta-analysis is performed on 20 studies conducted between 2001 and 2020. The results indicate that ML algorithms capture on average 15.6% more variability than traditional techniques. RF is the most accurate technique with an overall performance value of 0.995. ANN is also identified as a highly effective technique that has been widely used and provides accurate predictions with both small and large sample sizes. For ANN algorithms, a single hidden layer with nodes equal to 0.3–2 times the number of input features is found to be sufficient in predicting pavement deterioration. A minimum sample size equal to 50 times the number of input variables is recommend to model pavement deterioration using ML.


Sign in / Sign up

Export Citation Format

Share Document