Data Integration by Combining Big Data and Survey Sample Data for Finite Population Inference

Abstract Big Data are a “big challenge” for finite population inference. Lack of control over data-generating processes by researchers in the absence of a known random selection mechanism may lead to biased estimates. Further, larger sample sizes increase the relative contribution of selection bias to squared or absolute error. One approach to mitigate this issue is to treat Big Data as a random sample and estimate the pseudo-inclusion probabilities through a benchmark survey with a set of relevant auxiliary variables common to the Big Data. Since the true propensity model is usually unknown, and Big Data tend to be poor in such variables that fully govern the selection mechanism, the use of flexible non-parametric models seems to be essential. Traditionally, a weighted logistic model is recommended to account for the sampling weights in the benchmark survey when estimating the propensity scores. However, handling weights is a hurdle when seeking a broader range of predictive methods. To further protect against model misspecification, we propose using an alternative pseudo-weighting approach that allows us to fit more flexible modern predictive tools such as Bayesian Additive Regression Trees (BART), which automatically detect non-linear associations as well as high-order interactions. In addition, the posterior predictive distribution generated by BART makes it easier to quantify the uncertainty due to pseudo-weighting. Our simulation findings reveal further reduction in bias by our approach compared with conventional propensity adjustment method when the true model is unknown. Finally, we apply our method to the naturalistic driving data from the Safety Pilot Model Deployment using the National Household Travel Survey as a benchmark.

Download Full-text

Methodology of Big Data Integration from A Priori Unknown Heterogeneous Data Sources

Proceedings of the 2018 2nd International Conference on Computer Science and Artificial Intelligence - CSAI '18 ◽

10.1145/3297156.3297249 ◽

2018 ◽

Author(s):

Alexey Samoylov ◽

Nikolay Sergeev ◽

Margarita Kucherova ◽

Boris Denisov

Keyword(s):

Big Data ◽

Data Integration ◽

A Priori ◽

Heterogeneous Data ◽

Data Sources ◽

Heterogeneous Data Sources

Download Full-text

Evaluation Model for Big Data Integration Tools

Advances in Intelligent Systems and Computing - New Knowledge in Information Systems and Technologies ◽

10.1007/978-3-030-16187-3_58 ◽

2019 ◽

pp. 601-610 ◽

Cited By ~ 1

Author(s):

Ângela Alpoim ◽

Tiago Guimarães ◽

Filipe Portela ◽

Manuel Filipe Santos

Keyword(s):

Big Data ◽

Data Integration ◽

Evaluation Model

Download Full-text

A FEASIBILITY STUDY ON BIG DATA INTEGRATION AND ITS METHODOLOGIES FOR HADOOP TECHNIQUES USING MAP REDUCE MODEL

International Journal of Modern Trends in Engineering & Research ◽

10.21884/ijmter.2016.3072.qxemv ◽

2016 ◽

Vol 3 (9) ◽

pp. 230-238

Keyword(s):

Big Data ◽

Data Integration ◽

Feasibility Study ◽

Map Reduce

Download Full-text

Abnormal Detection to Big Data Using Deep Neural Networks

Journal of Physics Conference Series ◽

10.1088/1742-6596/2068/1/012025 ◽

2021 ◽

Vol 2068 (1) ◽

pp. 012025

Author(s):

Jian Zheng ◽

Zhaoni Li ◽

Jiang Li ◽

Hongling Liu

Keyword(s):

Neural Network ◽

Big Data ◽

Risk Model ◽

Small Sample ◽

Small Samples ◽

Small Data ◽

Data Segmentation ◽

Sample Data ◽

Segmentation Algorithms ◽

The Common

Abstract It is difficult to detect the anomalies in big data using traditional methods due to big data has the characteristics of mass and disorder. For the common methods, they divide big data into several small samples, then analyze these divided small samples. However, this manner increases the complexity of segmentation algorithms, moreover, it is difficult to control the risk of data segmentation. To address this, here proposes a neural network approch based on Vapnik risk model. Firstly, the sample data is randomly divided into small data blocks. Then, a neural network learns these divided small sample data blocks. To reduce the risks in the process of data segmentation, the Vapnik risk model is used to supervise data segmentation. Finally, the proposed method is verify on the historical electricity price data of Mountain View, California. The results show that our method is effectiveness.

Download Full-text

Big Data Integration

2017 14th International Conference on Telecommunications (ConTEL) ◽

10.23919/contel.2017.8000011 ◽

2017 ◽

Cited By ~ 1

Author(s):

Philippe Cudre-Mauroux

Keyword(s):

Big Data ◽

Data Integration

Download Full-text

Big Data Integration

Managing Data in Motion ◽

10.1016/b978-0-12-397167-8.00021-2 ◽

2013 ◽

pp. 141-156

Author(s):

April Reeve

Keyword(s):

Big Data ◽

Data Integration

Download Full-text

Introduction to Big Data Integration

Managing Data in Motion ◽

10.1016/b978-0-12-397167-8.00018-2 ◽

2013 ◽

pp. 127-128

Author(s):

April Reeve

Keyword(s):

Big Data ◽

Data Integration

Download Full-text

Big Data Management in the Context of Real-Time Data Warehousing

Big Data Management, Technologies, and Applications - Advances in Data Mining and Database Management ◽

10.4018/978-1-4666-4699-5.ch007 ◽

2013 ◽

pp. 150-176

Author(s):

M. Asif Naeem ◽

Gillian Dobbie ◽

Gerald Weber

Keyword(s):

Big Data ◽

Data Integration ◽

Real Time ◽

Real Life ◽

Skewed Distribution ◽

Stream Data ◽

Time Data ◽

Master Data ◽

Real Time Data ◽

Resource Aware

In order to make timely and effective decisions, businesses need the latest information from big data warehouse repositories. To keep these repositories up to date, real-time data integration is required. An important phase in real-time data integration is data transformation where a stream of updates, which is huge in volume and infinite, is joined with large disk-based master data. Stream processing is an important concept in Big Data, since large volumes of data are often best processed immediately. A well-known algorithm called Mesh Join (MESHJOIN) was proposed to process stream data with disk-based master data, which uses limited memory. MESHJOIN is a candidate for a resource-aware system setup. The problem that the authors consider in this chapter is that MESHJOIN is not very selective. In particular, the performance of the algorithm is always inversely proportional to the size of the master data table. As a consequence, the resource consumption is in some scenarios suboptimal. They present an algorithm called Cache Join (CACHEJOIN), which performs asymptotically at least as well as MESHJOIN but performs better in realistic scenarios, particularly if parts of the master data are used with different frequencies. In order to quantify the performance differences, the authors compare both algorithms with a synthetic dataset of a known skewed distribution as well as TPC-H and real-life datasets.

Download Full-text

Managing Big Data Integration in the Public Sector

10.4018/978-1-4666-9649-5 ◽

2016 ◽

Cited By ~ 2

Keyword(s):

Big Data ◽

Public Sector ◽

Data Integration ◽

The Public

Download Full-text