Resampling Plans and the Estimation of Prediction Error

This article was prepared for the Special Issue on Resampling methods for statistical inference of the 2020s. Modern algorithms such as random forests and deep learning are automatic machines for producing prediction rules from training data. Resampling plans have been the key technology for evaluating a rule’s prediction accuracy. After a careful description of the measurement of prediction error the article discusses the advantages and disadvantages of the principal methods: cross-validation, the nonparametric bootstrap, covariance penalties (Mallows’ Cp and the Akaike Information Criterion), and conformal inference. The emphasis is on a broad overview of a large subject, featuring examples, simulations, and a minimum of technical detail.

Download Full-text

Parsimonious Predictive Mortality Modeling by Regularization and Cross-Validation with and without Covid-Type Effect

Risks ◽

10.3390/risks9010005 ◽

2020 ◽

Vol 9 (1) ◽

pp. 5

Author(s):

Karim Barigou ◽

Stéphane Loisel ◽

Yahia Salhi

Keyword(s):

Life Insurance ◽

Cross Validation ◽

Cohort Effect ◽

Information Criterion ◽

Mortality Forecasting ◽

Hand Model ◽

Regularization Techniques ◽

Bayes Information Criterion ◽

The One ◽

Regularized Model

Predicting the evolution of mortality rates plays a central role for life insurance and pension funds. Standard single population models typically suffer from two major drawbacks: on the one hand, they use a large number of parameters compared to the sample size and, on the other hand, model choice is still often based on in-sample criterion, such as the Bayes information criterion (BIC), and therefore not on the ability to predict. In this paper, we develop a model based on a decomposition of the mortality surface into a polynomial basis. Then, we show how regularization techniques and cross-validation can be used to obtain a parsimonious and coherent predictive model for mortality forecasting. We analyze how COVID-19-type effects can affect predictions in our approach and in the classical one. In particular, death rates forecasts tend to be more robust compared to models with a cohort effect, and the regularized model outperforms the so-called P-spline model in terms of prediction and stability.

Download Full-text

Nonparametric bootstrap inference for the targeted highly adaptive least absolute shrinkage and selection operator (LASSO) estimator

The International Journal of Biostatistics ◽

10.1515/ijb-2017-0070 ◽

2020 ◽

Vol 0 (0) ◽

Author(s):

Weixin Cai ◽

Mark van der Laan

Keyword(s):

Confidence Intervals ◽

Nonparametric Estimation ◽

Cross Validation ◽

Data Distribution ◽

Average Treatment Effect ◽

Finite Sample ◽

Nonparametric Bootstrap ◽

A Value ◽

Variation Norm ◽

Selection Operator

AbstractThe Highly-Adaptive least absolute shrinkage and selection operator (LASSO) Targeted Minimum Loss Estimator (HAL-TMLE) is an efficient plug-in estimator of a pathwise differentiable parameter in a statistical model that at minimal (and possibly only) assumes that the sectional variation norm of the true nuisance functions (i.e., relevant part of data distribution) are finite. It relies on an initial estimator (HAL-MLE) of the nuisance functions by minimizing the empirical risk over the parameter space under the constraint that the sectional variation norm of the candidate functions are bounded by a constant, where this constant can be selected with cross-validation. In this article we establish that the nonparametric bootstrap for the HAL-TMLE, fixing the value of the sectional variation norm at a value larger or equal than the cross-validation selector, provides a consistent method for estimating the normal limit distribution of the HAL-TMLE. In order to optimize the finite sample coverage of the nonparametric bootstrap confidence intervals, we propose a selection method for this sectional variation norm that is based on running the nonparametric bootstrap for all values of the sectional variation norm larger than the one selected by cross-validation, and subsequently determining a value at which the width of the resulting confidence intervals reaches a plateau. We demonstrate our method for 1) nonparametric estimation of the average treatment effect when observing a covariate vector, binary treatment, and outcome, and for 2) nonparametric estimation of the integral of the square of the multivariate density of the data distribution. In addition, we also present simulation results for these two examples demonstrating the excellent finite sample coverage of bootstrap-based confidence intervals.

Download Full-text

Special issue on methodologies of training data processing professionals and advanced end-users

Education and Computing ◽

10.1016/s0167-9287(05)80039-1 ◽

1990 ◽

Vol 6 (1-2) ◽

pp. 1-2

Author(s):

Ben Zion Barta ◽

Lauri Fontell ◽

Patrick Raymont

Keyword(s):

Data Processing ◽

End Users ◽

Training Data ◽

Special Issue

Download Full-text

Flight crew diagnostic using aviation simulator training data

Experimental Psychology (Russia) ◽

10.17759/exppsy.2016090310 ◽

2016 ◽

Vol 9 (3) ◽

pp. 118-137

Author(s):

L.S. Kuravsky ◽

P.A. Marmalyuk ◽

G.A. Yuryev ◽

O.B. Belyaeva ◽

O.Yu. Prokopieva

Keyword(s):

Continuous Time ◽

Goodness Of Fit ◽

Training Data ◽

Simulator Training ◽

Model Parameters ◽

Flight Crew ◽

Advantages And Disadvantages ◽

Goodness Of Fit Tests ◽

Flight Simulators ◽

Discrete States

This paper describes a new concept of flight crew assessment based on flight simulators training result. It is based on representation of pilot gaze movement with the aid of continuous-time Markov processes with discrete states. Considered are both the procedure of model parameters identification provided with goodness-of-fit tests in use and the classifier-building technique, which makes it possible to estimate degree of correspondence between the observed gaze motion distribution under study and reference distributions identified for different diagnosed groups. The final assessing criterion is formed on the basis of integrated diagnostic parameters, which are determined by the parameters of the identified models. The article provides a description of the experiment, illustrations, and results of studies aimed at assessing the reliability of the developed models and criteria, as well as conclusions about the applicability of the approach, its advantages and disadvantages.

Download Full-text

Special Issue on “Augmented Reality, Virtual Reality & Semantic 3D Reconstruction”

Applied Sciences ◽

10.3390/app11188590 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8590

Author(s):

Zhihan Lv ◽

Jing-Yan Wang ◽

Neeraj Kumar ◽

Jaime Lloret

Keyword(s):

Virtual Reality ◽

Augmented Reality ◽

3D Reconstruction ◽

Paradigm Shift ◽

Special Issue ◽

Key Technology ◽

Viable Solution ◽

The Way ◽

Critical Needs

Augmented Reality is a key technology that will facilitate a major paradigm shift in the way users interact with data and has only just recently been recognized as a viable solution for solving many critical needs [...]

Download Full-text

PREDICTION AND ANALYSIS OF GEOMECHANICAL PROPERTIES OF JIMUSAER SHALE USING A MACHINE LEARNING APPROACH

10.30632/spwla-2021-0089 ◽

2021 ◽

Author(s):

Lianteng Song ◽

◽

Zhonghua Liu ◽

Chaoliu Li ◽

Congqian Ning ◽

...

Keyword(s):

Machine Learning ◽

Cross Validation ◽

Gamma Ray ◽

Short Term Memory ◽

Machine Learning Algorithms ◽

Training Data ◽

Sequential Data ◽

Log Data ◽

Geomechanical Properties ◽

Single Well

Geomechanical properties are essential for safe drilling, successful completion, and exploration of both conven-tional and unconventional reservoirs, e.g. deep shale gas and shale oil. Typically, these properties could be calcu-lated from sonic logs. However, in shale reservoirs, it is time-consuming and challenging to obtain reliable log-ging data due to borehole complexity and lacking of in-formation, which often results in log deficiency and high recovery cost of incomplete datasets. In this work, we propose the bidirectional long short-term memory (BiL-STM) which is a supervised neural network algorithm that has been widely used in sequential data-based pre-diction to estimate geomechanical parameters. The pre-diction from log data can be conducted from two differ-ent aspects. 1) Single-Well prediction, the log data from a single well is divided into training data and testing data for cross validation; 2) Cross-Well prediction, a group of wells from the same geographical region are divided into training set and testing set for cross validation, as well. The logs used in this work were collected from 11 wells from Jimusaer Shale, which includes gamma ray, bulk density, resistivity, and etc. We employed 5 vari-ous machine learning algorithms for comparison, among which BiLSTM showed the best performance with an R-squared of more than 90% and an RMSE of less than 10. The predicted results can be directly used to calcu-late geomechanical properties, of which accuracy is also improved in contrast to conventional methods.

Download Full-text