scholarly journals Statistical Tests for Cross-Validation of Kriging Models

Author(s):  
Jack P. C. Kleijnen ◽  
Wim C. M. van Beers

Kriging or Gaussian process models are popular metamodels (surrogate models or emulators) of simulation models; these metamodels give predictors for input combinations that are not simulated. To validate these metamodels for computationally expensive simulation models, the analysts often apply computationally efficient cross-validation. In this paper, we derive new statistical tests for so-called leave-one-out cross-validation. Graphically, we present these tests as scatterplots augmented with confidence intervals that use the estimated variances of the Kriging predictors. To estimate the true variances of these predictors, we might use bootstrapping. Like other statistical tests, our tests—with or without bootstrapping—have type I and type II error probabilities; to estimate these probabilities, we use Monte Carlo experiments. We also use such experiments to investigate statistical convergence. To illustrate the application of our tests, we use (i) an example with two inputs and (ii) the popular borehole example with eight inputs. Summary of Contribution: Simulation models are very popular in operations research (OR) and are also known as computer simulations or computer experiments. A popular topic is design and analysis of computer experiments. This paper focuses on Kriging methods and cross-validation methods applied to simulation models; these methods and models are often applied in OR. More specifically, the paper provides the following; (1) the basic variant of a new statistical test for leave-one–out cross-validation; (2) a bootstrap method for the estimation of the true variance of the Kriging predictor; and (3) Monte Carlo experiments for the evaluation of the consistency of the Kriging predictor, the convergence of the Studentized prediction error to the standard normal variable, and the convergence of the expected experimentwise type I error rate to the prespecified nominal value. The new statistical test is illustrated through examples, including the popular borehole model.

Mathematics ◽  
2021 ◽  
Vol 9 (8) ◽  
pp. 817
Author(s):  
Fernando López ◽  
Mariano Matilla-García ◽  
Jesús Mur ◽  
Manuel Ruiz Marín

A novel general method for constructing nonparametric hypotheses tests based on the field of symbolic analysis is introduced in this paper. Several existing tests based on symbolic entropy that have been used for testing central hypotheses in several branches of science (particularly in economics and statistics) are particular cases of this general approach. This family of symbolic tests uses few assumptions, which increases the general applicability of any symbolic-based test. Additionally, as a theoretical application of this method, we construct and put forward four new statistics to test for the null hypothesis of spatiotemporal independence. There are very few tests in the specialized literature in this regard. The new tests were evaluated with the mean of several Monte Carlo experiments. The results highlight the outstanding performance of the proposed test.


2019 ◽  
Vol 11 (2) ◽  
pp. 185 ◽  
Author(s):  
Christopher A. Ramezan ◽  
Timothy A. Warner ◽  
Aaron E. Maxwell

High spatial resolution (1–5 m) remotely sensed datasets are increasingly being used to map land covers over large geographic areas using supervised machine learning algorithms. Although many studies have compared machine learning classification methods, sample selection methods for acquiring training and validation data for machine learning, and cross-validation techniques for tuning classifier parameters are rarely investigated, particularly on large, high spatial resolution datasets. This work, therefore, examines four sample selection methods—simple random, proportional stratified random, disproportional stratified random, and deliberative sampling—as well as three cross-validation tuning approaches—k-fold, leave-one-out, and Monte Carlo methods. In addition, the effect on the accuracy of localizing sample selections to a small geographic subset of the entire area, an approach that is sometimes used to reduce costs associated with training data collection, is investigated. These methods are investigated in the context of support vector machines (SVM) classification and geographic object-based image analysis (GEOBIA), using high spatial resolution National Agricultural Imagery Program (NAIP) orthoimagery and LIDAR-derived rasters, covering a 2,609 km2 regional-scale area in northeastern West Virginia, USA. Stratified-statistical-based sampling methods were found to generate the highest classification accuracy. Using a small number of training samples collected from only a subset of the study area provided a similar level of overall accuracy to a sample of equivalent size collected in a dispersed manner across the entire regional-scale dataset. There were minimal differences in accuracy for the different cross-validation tuning methods. The processing time for Monte Carlo and leave-one-out cross-validation were high, especially with large training sets. For this reason, k-fold cross-validation appears to be a good choice. Classifications trained with samples collected deliberately (i.e., not randomly) were less accurate than classifiers trained from statistical-based samples. This may be due to the high positive spatial autocorrelation in the deliberative training set. Thus, if possible, samples for training should be selected randomly; deliberative samples should be avoided.


Bragantia ◽  
2014 ◽  
Vol 73 (2) ◽  
pp. 192-202 ◽  
Author(s):  
Gabriel Constantino Blain

Several studies have applied the Kolmogorov-Smirnov test (KS) to verify if a particular parametric distribution can be used to assess the probability of occurrence of a given agrometeorological variable. However, when this test is applied to the same data sample from which the distribution parameters have been estimated, it leads to a high probability of failure to reject a false null hypothesis. Although the Lilliefors test had been proposed to remedy this drawback, several studies still use the KS test even when the requirement of independence between the data and the estimated parameters is not met. Aiming at stimulating the use of the Lilliefors test, we revisited the critical values of the Lilliefors test for both gamma (gam) and normal distributions, provided easy-to-use procedures capable of calculating the Lilliefors test and evaluated the performance of these two tests in correctly accepting a hypothesized distribution. The Lilliefors test was calculated by using critical values previously presented in the scientific literature (KSLcrit) and those obtained from the procedures proposed in this study (NKSLcrit). Through Monte Carlo simulations we demonstrated that the frequency of occurrence of Type I (II) errors associated with the KSLcrit may be unacceptably low (high). By using the NKSLcrit we were able to meet the significance level in all Monte Carlo experiments. The NKSLcrit also led to the lowest rate of Type II errors. Finally, we also provided polynomial equations that eliminate the need to perform statistical simulations to calculate the Lilliefors test for both gam and normal distributions.


1998 ◽  
Vol 10 (7) ◽  
pp. 1895-1923 ◽  
Author(s):  
Thomas G. Dietterich

This article reviews five approximate statistical tests for determining whether one learning algorithm outperforms another on a particular learning task. These test sare compared experimentally to determine their probability of incorrectly detecting a difference when no difference exists (type I error). Two widely used statistical tests are shown to have high probability of type I error in certain situations and should never be used: a test for the difference of two proportions and a paired-differences t test based on taking several random train-test splits. A third test, a paired-differences t test based on 10-fold cross-validation, exhibits somewhat elevated probability of type I error. A fourth test, McNemar's test, is shown to have low type I error. The fifth test is a new test, 5 × 2 cv, based on five iterations of twofold cross-validation. Experiments show that this test also has acceptable type I error. The article also measures the power (ability to detect algorithm differences when they do exist) of these tests. The cross-validated t test is the most powerful. The 5×2 cv test is shown to be slightly more powerful than McNemar's test. The choice of the best test is determined by the computational cost of running the learning algorithm. For algorithms that can be executed only once, Mc-Nemar's test is the only test with acceptable type I error. For algorithms that can be executed 10 times, the 5 × 2 cv test is recommended, because it is slightly more powerful and because it directly measures variation due to the choice of training set.


2015 ◽  
Vol 138 (1) ◽  
Author(s):  
Haitao Liu ◽  
Shengli Xu ◽  
Ying Ma ◽  
Xudong Chen ◽  
Xiaofang Wang

Computer simulations have been increasingly used to study physical problems in various fields. To relieve computational budgets, the cheap-to-run metamodels, constructed from finite experiment points in the design space using the design of computer experiments (DOE), are employed to replace the costly simulation models. A key issue related to DOE is designing sequential computer experiments to achieve an accurate metamodel with as few points as possible. This article investigates the performance of current Bayesian sampling approaches and proposes an adaptive maximum entropy (AME) approach. In the proposed approach, the leave-one-out (LOO) cross-validation error estimates the error information in an easy way, the local space-filling exploration strategy avoids the clustering problem, and the search pattern from global to local improves the sampling efficiency. A comparison study of six examples with different types of initial points demonstrated that the AME approach is very promising for global metamodeling.


Author(s):  
Younus Hazim Al-Taweel ◽  
Najlaa Sadeek

Kriging is a statistical approach for analyzing computer experiments. Kriging models can be used as fast running surrogate models for computationally expensive computer codes. Kriging models can be built using different methods, the maximum likelihood estimation method and the leave-one-out cross validation method. The objective of this paper is to evaluate and compare these different methods for building kriging models. These evaluation and comparison are achieved via some measures that test the assumptions that are used in building kriging models. We apply kriging models that are built based on the two different methods on a real high dimensional example of a computer code. We demonstrate our evaluation and comparison through some measures on this real computer code.


1999 ◽  
Vol 11 (4) ◽  
pp. 863-870 ◽  
Author(s):  
Isabelle Rivals ◽  
Léon Personnaz

In response to Zhu and Rower (1996), a recent communication (Goutte, 1997) established that leave-one-out cross validation is not subject to the “no-free-lunch” criticism. Despite this optimistic conclusion, we show here that cross validation has very poor performances for the selection of linear models as compared to classic statistical tests. We conclude that the statistical tests are preferable to cross validation for linear as well as for nonlinear model selection.


Sensors ◽  
2021 ◽  
Vol 21 (6) ◽  
pp. 1932
Author(s):  
Julian Caicedo-Acosta ◽  
German A. Castaño ◽  
Carlos Acosta-Medina ◽  
Andres Alvarez-Meza ◽  
German Castellanos-Dominguez

Motor imaging (MI) induces recovery and neuroplasticity in neurophysical regulation. However, a non-negligible portion of users presents insufficient coordination skills of sensorimotor cortex control. Assessments of the relationship between wakefulness and tasks states are conducted to foster neurophysiological and mechanistic interpretation in MI-related applications. Thus, to understand the organization of information processing, measures of functional connectivity are used. Also, models of neural network regression prediction are becoming popular, These intend to reduce the need for extracting features manually. However, predicting MI practicing’s neurophysiological inefficiency raises several problems, like enhancing network regression performance because of the overfitting risk. Here, to increase the prediction performance, we develop a deep network regression model that includes three procedures: leave-one-out cross-validation combined with Monte Carlo dropout layers, subject clustering of MI inefficiency, and transfer learning between neighboring runs. Validation is performed using functional connectivity predictors extracted from two electroencephalographic databases acquired in conditions close to real MI applications (150 users), resulting in a high prediction of pretraining desynchronization and initial training synchronization with adequate physiological interpretability.


2021 ◽  
Vol 11 (1) ◽  
Author(s):  
A. Wong ◽  
Z. Q. Lin ◽  
L. Wang ◽  
A. G. Chung ◽  
B. Shen ◽  
...  

AbstractA critical step in effective care and treatment planning for severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the cause for the coronavirus disease 2019 (COVID-19) pandemic, is the assessment of the severity of disease progression. Chest x-rays (CXRs) are often used to assess SARS-CoV-2 severity, with two important assessment metrics being extent of lung involvement and degree of opacity. In this proof-of-concept study, we assess the feasibility of computer-aided scoring of CXRs of SARS-CoV-2 lung disease severity using a deep learning system. Data consisted of 396 CXRs from SARS-CoV-2 positive patient cases. Geographic extent and opacity extent were scored by two board-certified expert chest radiologists (with 20+ years of experience) and a 2nd-year radiology resident. The deep neural networks used in this study, which we name COVID-Net S, are based on a COVID-Net network architecture. 100 versions of the network were independently learned (50 to perform geographic extent scoring and 50 to perform opacity extent scoring) using random subsets of CXRs from the study, and we evaluated the networks using stratified Monte Carlo cross-validation experiments. The COVID-Net S deep neural networks yielded R$$^2$$ 2 of $$0.664 \pm 0.032$$ 0.664 ± 0.032 and $$0.635 \pm 0.044$$ 0.635 ± 0.044 between predicted scores and radiologist scores for geographic extent and opacity extent, respectively, in stratified Monte Carlo cross-validation experiments. The best performing COVID-Net S networks achieved R$$^2$$ 2 of 0.739 and 0.741 between predicted scores and radiologist scores for geographic extent and opacity extent, respectively. The results are promising and suggest that the use of deep neural networks on CXRs could be an effective tool for computer-aided assessment of SARS-CoV-2 lung disease severity, although additional studies are needed before adoption for routine clinical use.


Sign in / Sign up

Export Citation Format

Share Document