scholarly journals A Bayesian Nonparametric Learning Approach to Ensemble Models Using the Proper Bayesian Bootstrap

Author(s):  
Marta Galvani ◽  
Chiara Bardelli ◽  
Silvia Figini ◽  
Pietro Muliere

Bootstrap resampling techinques, introduced by Efron and Rubin, can be presented in a general Bayesian framework, approximating the statistical distribution of a statistical functional φ(F), where F is a random distribution function. Efron’s and Rubin’s bootstrap procedures can be extended introducing an informative prior through the Proper Bayesian bootstrap. In this paper different bootstrap techniques are used and compared in predictive classification and regression models based on ensemble approaches, i.e. bagging models involving decision trees. Proper Bayesian bootstrap, proposed by Muliere and Secchi, is used to sample the posterior distribution over trees, introducing prior distributions on the covariates and the target variable. The results obtained are compared with respect to other competitive procedures employing different bootstrap techniques. The empirical analysis reports the results obtained on simulated and real data.

Algorithms ◽  
2021 ◽  
Vol 14 (1) ◽  
pp. 11
Author(s):  
Marta Galvani ◽  
Chiara Bardelli ◽  
Silvia Figini ◽  
Pietro Muliere

Bootstrap resampling techniques, introduced by Efron and Rubin, can be presented in a general Bayesian framework, approximating the statistical distribution of a statistical functional ϕ(F), where F is a random distribution function. Efron’s and Rubin’s bootstrap procedures can be extended, introducing an informative prior through the Proper Bayesian bootstrap. In this paper different bootstrap techniques are used and compared in predictive classification and regression models based on ensemble approaches, i.e., bagging models involving decision trees. Proper Bayesian bootstrap, proposed by Muliere and Secchi, is used to sample the posterior distribution over trees, introducing prior distributions on the covariates and the target variable. The results obtained are compared with respect to other competitive procedures employing different bootstrap techniques. The empirical analysis reports the results obtained on simulated and real data.


Sensors ◽  
2021 ◽  
Vol 21 (6) ◽  
pp. 1962
Author(s):  
Enrico Buratto ◽  
Adriano Simonetto ◽  
Gianluca Agresti ◽  
Henrik Schäfer ◽  
Pietro Zanuttigh

In this work, we propose a novel approach for correcting multi-path interference (MPI) in Time-of-Flight (ToF) cameras by estimating the direct and global components of the incoming light. MPI is an error source linked to the multiple reflections of light inside a scene; each sensor pixel receives information coming from different light paths which generally leads to an overestimation of the depth. We introduce a novel deep learning approach, which estimates the structure of the time-dependent scene impulse response and from it recovers a depth image with a reduced amount of MPI. The model consists of two main blocks: a predictive model that learns a compact encoded representation of the backscattering vector from the noisy input data and a fixed backscattering model which translates the encoded representation into the high dimensional light response. Experimental results on real data show the effectiveness of the proposed approach, which reaches state-of-the-art performances.


Author(s):  
Moritz Berger ◽  
Gerhard Tutz

AbstractA flexible semiparametric class of models is introduced that offers an alternative to classical regression models for count data as the Poisson and Negative Binomial model, as well as to more general models accounting for excess zeros that are also based on fixed distributional assumptions. The model allows that the data itself determine the distribution of the response variable, but, in its basic form, uses a parametric term that specifies the effect of explanatory variables. In addition, an extended version is considered, in which the effects of covariates are specified nonparametrically. The proposed model and traditional models are compared in simulations and by utilizing several real data applications from the area of health and social science.


2021 ◽  
Author(s):  
Rosa F Ropero ◽  
M Julia Flores ◽  
Rafael Rumí

<p>Environmental data often present missing values or lack of information that make modelling tasks difficult. Under the framework of SAICMA Research Project, a flood risk management system is modelled for Andalusian Mediterranean catchment using information from the Andalusian Hydrological System. Hourly data were collected from October 2011 to September 2020, and present two issues:</p><ul><li>In Guadarranque River, for the dam level variable there is no data from May to August 2020, probably because of sensor damage.</li> <li>No information about river level is collected in the lower part of Guadiaro River, which make difficult to estimate flood risk in the coastal area.</li> </ul><p>In order to avoid removing dam variable from the entire model (or those missing months), or even reject modelling one river system, this abstract aims to provide modelling solutions based on Bayesian networks (BNs) that overcome this limitation.</p><p><em>Guarranque River. Missing values.</em></p><p>Dataset contains 75687 observations for 6 continuous variables. BNs regression models based on fixed structures (Naïve Bayes, NB, and Tree Augmented Naïve, TAN) were learnt using the complete dataset (until September 2019) with the aim of predicting the dam level variable as accurately as possible. A scenario was carried out with data from October 2019 to March 2020 and compared the prediction made for the target variable with the real data. Results show both NB (rmse: 6.29) and TAN (rmse: 5.74) are able to predict the behaviour of the target variable.</p><p>Besides, a BN based on expert’s structural learning was learnt with real data and both datasets with imputed values by NB and TAN. Results show models learnt with imputed data (NB: 3.33; TAN: 3.07) improve the error rate of model with respect to real data (4.26).</p><p><em>Guadairo River. Lack of information.</em></p><p>Dataset contains 73636 observations with 14 continuous variables. Since rainfall variables present a high percentage of zero values (over 94%), they were discretised by Equal Frequency method with 4 intervals. The aim is to predict flooding risk in the coastal area but no data is collected from this area. Thus, an unsupervised classification based on hybrid BNs was performed. Here, target variable classifies all observations into a set of homogeneous groups and gives, for each observation, the probability of belonging to each group. Results show a total of 3 groups:</p><ul><li>Group 0, “Normal situation”: with rainfall values equal to 0, and mean of river level very low.</li> <li>Group 1, “Storm situation”: mean rainfall values are over 0.3 mm and all river level variables duplicate the mean with respect to group 0.</li> <li>Group 2, “Extreme situation”: Both rainfall and river level means values present the highest values far away from both previous groups.</li> </ul><p>Even when validation shows this methodology is able to identify extreme events, further work is needed. In this sense, data from autumn-winter season (from October 2020 to March 2021) will be used. Including this new information it would be possible to check if last extreme events (flooding event during December and Filomenastorm during January) are identified.</p><p> </p><p> </p><p> </p>


2016 ◽  
Vol 41 (2) ◽  
Author(s):  
Omar Eidous ◽  
M.K. Shakhatreh

A double kernel method as an alternative to the classical kernel method is proposed to estimate the population abundance by using line transect sampling. The proposed method produces an estimator that is essentially a kernel type of estimator use the kernel estimator twice to improve the performances of the classical kernel estimator. The feasibility of using bootstrap techniques to estimate the bias and variance of the proposed estimator is also addressed. Some numerical examples based on simulated and real data are presented. The results show that the proposed estimator outperforms existingclassical kernel estimator in most considered cases.


Risks ◽  
2020 ◽  
Vol 8 (2) ◽  
pp. 33
Author(s):  
Łukasz Delong ◽  
Mario V. Wüthrich

The goal of this paper is to develop regression models and postulate distributions which can be used in practice to describe the joint development process of individual claim payments and claim incurred. We apply neural networks to estimate our regression models. As regressors we use the whole claim history of incremental payments and claim incurred, as well as any relevant feature information which is available to describe individual claims and their development characteristics. Our models are calibrated and tested on a real data set, and the results are benchmarked with the Chain-Ladder method. Our analysis focuses on the development of the so-called Reported But Not Settled (RBNS) claims. We show benefits of using deep neural network and the whole claim history in our prediction problem.


2019 ◽  
Vol 11 (01n02) ◽  
pp. 1950003
Author(s):  
Fábio Prataviera ◽  
Gauss M. Cordeiro ◽  
Edwin M. M. Ortega ◽  
Adriano K. Suzuki

In several applications, the distribution of the data is frequently unimodal, asymmetric or bimodal. The regression models commonly used for applications to data with real support are the normal, skew normal, beta normal and gamma normal, among others. We define a new regression model based on the odd log-logistic geometric normal distribution for modeling asymmetric or bimodal data with support in [Formula: see text], which generalizes some known regression models including the widely known heteroscedastic linear regression. We adopt the maximum likelihood method for estimating the model parameters and define diagnostic measures to detect influential observations. For some parameter settings, sample sizes and different systematic structures, various simulations are performed to verify the adequacy of the estimators of the model parameters. The empirical distribution of the quantile residuals is investigated and compared with the standard normal distribution. We prove empirically the usefulness of the proposed models by means of three applications to real data.


2019 ◽  
Vol 31 (8) ◽  
pp. 1718-1750
Author(s):  
Kota Matsui ◽  
Wataru Kumagai ◽  
Kenta Kanamori ◽  
Mitsuaki Nishikimi ◽  
Takafumi Kanamori

In this letter, we propose a variable selection method for general nonparametric kernel-based estimation. The proposed method consists of two-stage estimation: (1) construct a consistent estimator of the target function, and (2) approximate the estimator using a few variables by [Formula: see text]-type penalized estimation. We see that the proposed method can be applied to various kernel nonparametric estimation such as kernel ridge regression, kernel-based density, and density-ratio estimation. We prove that the proposed method has the property of variable selection consistency when the power series kernel is used. Here, the power series kernel is a certain class of kernels containing polynomial and exponential kernels. This result is regarded as an extension of the variable selection consistency for the nonnegative garrote (NNG), a special case of the adaptive Lasso, to the kernel-based estimators. Several experiments, including simulation studies and real data applications, show the effectiveness of the proposed method.


2017 ◽  
Vol 28 (4) ◽  
pp. 1105-1125 ◽  
Author(s):  
Chuen Seng Tan ◽  
Nathalie C Støer ◽  
Ying Chen ◽  
Marielle Andersson ◽  
Yilin Ning ◽  
...  

The control of confounding is an area of extensive epidemiological research, especially in the field of causal inference for observational studies. Matched cohort and case-control study designs are commonly implemented to control for confounding effects without specifying the functional form of the relationship between the outcome and confounders. This paper extends the commonly used regression models in matched designs for binary and survival outcomes (i.e. conditional logistic and stratified Cox proportional hazards) to studies of continuous outcomes through a novel interpretation and application of logit-based regression models from the econometrics and marketing research literature. We compare the performance of the maximum likelihood estimators using simulated data and propose a heuristic argument for obtaining the residuals for model diagnostics. We illustrate our proposed approach with two real data applications. Our simulation studies demonstrate that our stratification approach is robust to model misspecification and that the distribution of the estimated residuals provides a useful diagnostic when the strata are of moderate size. In our applications to real data, we demonstrate that parity and menopausal status are associated with percent mammographic density, and that the mean level and variability of inpatient blood glucose readings vary between medical and surgical wards within a national tertiary hospital. Our work highlights how the same class of regression models, available in most statistical software, can be used to adjust for confounding in the study of binary, time-to-event and continuous outcomes.


2016 ◽  
Vol 857 ◽  
pp. 195-199 ◽  
Author(s):  
Nivea Thomas ◽  
Anu V. Thomas

Construction investments are sensitive to time and cost overruns. Delay and cost escalation are considered two threats to project success. The project objective is to develop a model to predict project cost and duration based on historical data of similar projects. Statistical regression models are developed using real data of building projects. The methodology is adopted in 3 steps: a) Data collection b) Statistical analysis using Statistical Package for Social Sciences (SPSS) software c) Interpretation of results. The real data of cost and duration of 51 building projects have been collected. In statistics, regression analysis is a statistical process for estimating the relationships among variables. It includes many techniques for modelling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables. The analysis is done using SPSS developed by IBM Corporation. The Regression models have been developed using the data collected from Noel Builders, Kakkanad, Ernakulam to predict the project cost and duration. The developed models are validated using split sample approach. The model outputs can be used by project managers in the planning phase to validate the scheduled critical path time and project budget.


Sign in / Sign up

Export Citation Format

Share Document