Big data transfer optimization based on offline knowledge discovery and adaptive sampling

Network bandwidth is a scarce resource in big data environments, so data locality is a fundamental problem for data-parallel frameworks such as Hadoop and Spark. This problem is exacerbated in multicore server-based clusters, where multiple tasks running on the same server compete for the server’s network bandwidth. Existing approaches solve this problem by scheduling computational tasks near the input data and considering the server’s free time, data placements, and data transfer costs. However, such approaches usually set identical values for data transfer costs, even though a multicore server’s data transfer cost increases with the number of data-remote tasks. Eventually, this hampers data-processing time, by minimizing it ineffectively. As a solution, we propose DynDL (Dynamic Data Locality), a novel data-locality-aware task-scheduling model that handles dynamic data transfer costs for multicore servers. DynDL offers greater flexibility than existing approaches by using a set of non-decreasing functions to evaluate dynamic data transfer costs. We also propose online and offline algorithms (based on DynDL) that minimize data-processing time and adaptively adjust data locality. Although DynDL is NP-complete (nondeterministic polynomial-complete), we prove that the offline algorithm runs in quadratic time and generates optimal results for DynDL’s specific uses. Using a series of simulations and real-world executions, we show that our algorithms are 30% better than algorithms that do not consider dynamic data transfer costs in terms of data-processing time. Moreover, they can adaptively adjust data localities based on the server’s free time, data placement, and network bandwidth, and schedule tens of thousands of tasks within subseconds or seconds.

Download Full-text

A spatial-adaptive sampling procedure for online monitoring of big data streams

Journal of Quality Technology ◽

10.1080/00224065.2018.1507560 ◽

2018 ◽

Vol 50 (4) ◽

pp. 329-343 ◽

Cited By ~ 7

Author(s):

Andi Wang ◽

Xiaochen Xian ◽

Fugee Tsung ◽

Kaibo Liu

Keyword(s):

Big Data ◽

Data Streams ◽

Adaptive Sampling ◽

Online Monitoring ◽

Sampling Procedure ◽

Big Data Streams

Download Full-text

Big Data Thinning: Knowledge Discovery from Relevant Data

Internet of Things - Convergence of Artificial Intelligence and the Internet of Things ◽

10.1007/978-3-030-44907-0_11 ◽

2020 ◽

pp. 259-297

Author(s):

Naji Shehab ◽

Christos Anagnostopoulos

Keyword(s):

Big Data ◽

Knowledge Discovery ◽

Data Thinning

Download Full-text

Context Awareness Technology Using Parallel Mining for Ambient Assisted Living System

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.19.15046 ◽

2018 ◽

Vol 7 (2.19) ◽

pp. 52

Author(s):

J Vivek ◽

Gandla Maharnisha ◽

Gandla Roopesh Kumar ◽

Ch Karun Sagar ◽

R Arunraj

Keyword(s):

Health Care ◽

Big Data ◽

Knowledge Discovery ◽

Assisted Living ◽

Context Awareness ◽

Health Care Services ◽

Ambient Assisted Living ◽

Living System ◽

Health Condition ◽

Parallel Mining

In this paper, context awareness is a promising technology that provides health care services and a niche area of big data paradigm. The drift in Knowledge Discovery from Data refers to a set of activities designed to refine and extract new knowledge from complex datasets. The proposed model facilitates a parallel mining of frequent item sets for Ambient Assisted Living (AAL) System [a.k.a. Health Care [System] of big data that reside inside a cloud environment. We extend a knowledge discovery framework for processing and classifying the abnormal conditions of patients having fluctuations in Blood Pressure (BP) and Heart Rate(HR) and storing this data sets called Big data into Cloud to access from anywhere when needed. This accessed data is used to compare the new data with it, which helps to know the patients health condition.

Download Full-text