Big Data for Finite Population Inference: Applying Quasi-Random Approaches to Naturalistic Driving Data Using Bayesian Additive Regression Trees

Abstract Big Data are a “big challenge” for finite population inference. Lack of control over data-generating processes by researchers in the absence of a known random selection mechanism may lead to biased estimates. Further, larger sample sizes increase the relative contribution of selection bias to squared or absolute error. One approach to mitigate this issue is to treat Big Data as a random sample and estimate the pseudo-inclusion probabilities through a benchmark survey with a set of relevant auxiliary variables common to the Big Data. Since the true propensity model is usually unknown, and Big Data tend to be poor in such variables that fully govern the selection mechanism, the use of flexible non-parametric models seems to be essential. Traditionally, a weighted logistic model is recommended to account for the sampling weights in the benchmark survey when estimating the propensity scores. However, handling weights is a hurdle when seeking a broader range of predictive methods. To further protect against model misspecification, we propose using an alternative pseudo-weighting approach that allows us to fit more flexible modern predictive tools such as Bayesian Additive Regression Trees (BART), which automatically detect non-linear associations as well as high-order interactions. In addition, the posterior predictive distribution generated by BART makes it easier to quantify the uncertainty due to pseudo-weighting. Our simulation findings reveal further reduction in bias by our approach compared with conventional propensity adjustment method when the true model is unknown. Finally, we apply our method to the naturalistic driving data from the Safety Pilot Model Deployment using the National Household Travel Survey as a benchmark.

Download Full-text

Incorporating external data into the analysis of clinical trials via Bayesian additive regression trees

Statistics in Medicine ◽

10.1002/sim.9191 ◽

2021 ◽

Author(s):

Tianjian Zhou ◽

Yuan Ji

Keyword(s):

Clinical Trials ◽

Regression Trees ◽

External Data ◽

Additive Regression ◽

Bayesian Additive Regression Trees

Download Full-text

BART: Bayesian additive regression trees

The Annals of Applied Statistics ◽

10.1214/09-aoas285 ◽

2010 ◽

Vol 4 (1) ◽

pp. 266-298 ◽

Cited By ~ 440

Author(s):

Hugh A. Chipman ◽

Edward I. George ◽

Robert E. McCulloch

Keyword(s):

Regression Trees ◽

Additive Regression ◽

Bayesian Additive Regression Trees

Download Full-text

Variable Selection and Interaction Detection with Bayesian Additive Regression Trees

10.1201/9781003089018-17 ◽

2021 ◽

pp. 395-414

Author(s):

Carlos M. Carvalho ◽

Edward I. George ◽

P. Richard Hahn ◽

Robert E. McCulloch

Keyword(s):

Variable Selection ◽

Regression Trees ◽

Interaction Detection ◽

Additive Regression ◽

Bayesian Additive Regression Trees

Download Full-text

Estimation of causal effects of multiple treatments in observational studies with a binary outcome

Statistical Methods in Medical Research ◽

10.1177/0962280220921909 ◽

2020 ◽

Vol 29 (11) ◽

pp. 3218-3234 ◽

Cited By ~ 7

Author(s):

Liangyuan Hu ◽

Chenyang Gu ◽

Michael Lopez ◽

Jiayi Ji ◽

Juan Wisnivesky

Keyword(s):

Maximum Likelihood ◽

Maximum Likelihood Estimator ◽

Regression Trees ◽

Likelihood Estimator ◽

Inverse Probability ◽

Common Support ◽

Targeted Maximum Likelihood ◽

Additive Regression ◽

Multiple Treatments ◽

Bayesian Additive Regression Trees

There is a dearth of robust methods to estimate the causal effects of multiple treatments when the outcome is binary. This paper uses two unique sets of simulations to propose and evaluate the use of Bayesian additive regression trees in such settings. First, we compare Bayesian additive regression trees to several approaches that have been proposed for continuous outcomes, including inverse probability of treatment weighting, targeted maximum likelihood estimator, vector matching, and regression adjustment. Results suggest that under conditions of non-linearity and non-additivity of both the treatment assignment and outcome generating mechanisms, Bayesian additive regression trees, targeted maximum likelihood estimator, and inverse probability of treatment weighting using generalized boosted models provide better bias reduction and smaller root mean squared error. Bayesian additive regression trees and targeted maximum likelihood estimator provide more consistent 95% confidence interval coverage and better large-sample convergence property. Second, we supply Bayesian additive regression trees with a strategy to identify a common support region for retaining inferential units and for avoiding extrapolating over areas of the covariate space where common support does not exist. Bayesian additive regression trees retain more inferential units than the generalized propensity score-based strategy, and shows lower bias, compared to targeted maximum likelihood estimator or generalized boosted model, in a variety of scenarios differing by the degree of covariate overlap. A case study examining the effects of three surgical approaches for non-small cell lung cancer demonstrates the methods.

Download Full-text