scholarly journals Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference

Author(s):  
Jae‐Kwang Kim ◽  
Siu‐Ming Tam
2020 ◽  
Vol 8 (1) ◽  
pp. 148-180
Author(s):  
Ali Rafei ◽  
Carol A C Flannagan ◽  
Michael R Elliott

Abstract Big Data are a “big challenge” for finite population inference. Lack of control over data-generating processes by researchers in the absence of a known random selection mechanism may lead to biased estimates. Further, larger sample sizes increase the relative contribution of selection bias to squared or absolute error. One approach to mitigate this issue is to treat Big Data as a random sample and estimate the pseudo-inclusion probabilities through a benchmark survey with a set of relevant auxiliary variables common to the Big Data. Since the true propensity model is usually unknown, and Big Data tend to be poor in such variables that fully govern the selection mechanism, the use of flexible non-parametric models seems to be essential. Traditionally, a weighted logistic model is recommended to account for the sampling weights in the benchmark survey when estimating the propensity scores. However, handling weights is a hurdle when seeking a broader range of predictive methods. To further protect against model misspecification, we propose using an alternative pseudo-weighting approach that allows us to fit more flexible modern predictive tools such as Bayesian Additive Regression Trees (BART), which automatically detect non-linear associations as well as high-order interactions. In addition, the posterior predictive distribution generated by BART makes it easier to quantify the uncertainty due to pseudo-weighting. Our simulation findings reveal further reduction in bias by our approach compared with conventional propensity adjustment method when the true model is unknown. Finally, we apply our method to the naturalistic driving data from the Safety Pilot Model Deployment using the National Household Travel Survey as a benchmark.


Author(s):  
Ângela Alpoim ◽  
Tiago Guimarães ◽  
Filipe Portela ◽  
Manuel Filipe Santos

2021 ◽  
Vol 2068 (1) ◽  
pp. 012025
Author(s):  
Jian Zheng ◽  
Zhaoni Li ◽  
Jiang Li ◽  
Hongling Liu

Abstract It is difficult to detect the anomalies in big data using traditional methods due to big data has the characteristics of mass and disorder. For the common methods, they divide big data into several small samples, then analyze these divided small samples. However, this manner increases the complexity of segmentation algorithms, moreover, it is difficult to control the risk of data segmentation. To address this, here proposes a neural network approch based on Vapnik risk model. Firstly, the sample data is randomly divided into small data blocks. Then, a neural network learns these divided small sample data blocks. To reduce the risks in the process of data segmentation, the Vapnik risk model is used to supervise data segmentation. Finally, the proposed method is verify on the historical electricity price data of Mountain View, California. The results show that our method is effectiveness.


Author(s):  
M. Asif Naeem ◽  
Gillian Dobbie ◽  
Gerald Weber

In order to make timely and effective decisions, businesses need the latest information from big data warehouse repositories. To keep these repositories up to date, real-time data integration is required. An important phase in real-time data integration is data transformation where a stream of updates, which is huge in volume and infinite, is joined with large disk-based master data. Stream processing is an important concept in Big Data, since large volumes of data are often best processed immediately. A well-known algorithm called Mesh Join (MESHJOIN) was proposed to process stream data with disk-based master data, which uses limited memory. MESHJOIN is a candidate for a resource-aware system setup. The problem that the authors consider in this chapter is that MESHJOIN is not very selective. In particular, the performance of the algorithm is always inversely proportional to the size of the master data table. As a consequence, the resource consumption is in some scenarios suboptimal. They present an algorithm called Cache Join (CACHEJOIN), which performs asymptotically at least as well as MESHJOIN but performs better in realistic scenarios, particularly if parts of the master data are used with different frequencies. In order to quantify the performance differences, the authors compare both algorithms with a synthetic dataset of a known skewed distribution as well as TPC-H and real-life datasets.


Sign in / Sign up

Export Citation Format

Share Document