Vertical Data Processing for Mining Big Data: A Predicate Tree Approach

Time is a critical factor in processing a very large volume of data a.k.a ‘Big Data’. Many existing data mining algorithms (supervised and unsupervised) become futile because of the ubiquitous use of horizontal processing i.e. row-by-row processing of stored data. Processing time for big data is further exacerbated by its high dimensionality (# of features) and high cardinality (# of records). To address this processing-time issue, we proposed a vertical approach with predicate trees (pTree). Our approach structures data into columns of bit slices, which range from few to hundreds and are processed vertically i.e. column by column. We tested and compared our vertical approach to traditional (horizontal) approach using three basic Boolean operations namely addition, subtraction and multiplication with 10 data sizes. The length of data size ranged from half a billion bits to 5 billion bits. The results are analyzed w.r.t processing speed time and speed gain for both the approaches. The result shows that our vertical approach outperformed the traditional approach for all Boolean operations (add, subtract and multiply) across all data sizes and results in speed-gain between 24% to 96%. We concluded from our results that our approach being in data-mining ready format is best suited to apply to operations involving complex computations in big data application to achieve significant speed gain.

Download Full-text

AUMENTANDODESEMPENHO DEALGORITMOSDEMINERAÇÃODEDADOSUTILIZANDOAPLATAFORMACUDA

Colloquium Exactarum ◽

10.5747/ce.2018.v10.n1.e226 ◽

2018 ◽

pp. 90-102

Author(s):

Matheus Varela Ferreira ◽

Francisco Assis da Silva ◽

Leandro Luiz de Almeida ◽

Danillo Roberto Pereira

Keyword(s):

Data Mining ◽

Comparative Analysis ◽

Data Processing ◽

Processing Time ◽

Naive Bayes ◽

Naïve Bayes ◽

Short Term ◽

Data Mining Algorithms ◽

Mining Algorithms

With the increasing need to make decisions in the short term, industry (pharmaceutical, petrochemical, aeronautics and etc.) has been seeking new ways to reduce the time of the data mining process to obtain knowledge. In recent years, many technological resources are being used to mitigate this need, an example is CUDA. CUDA is a platform that enables the use of GeForce GPUs in conjunction with CPUs for data processing, significantly reducing processing time. This work proposes to perform a comparative analysis of the processing time between two versions of some data mining algorithms (Apriori, AprioriAll, Naïve Bayes and K-Means), one running on CPU only and one on CPU in conjunction with GPU through platform CUDA. Through the experiments performed, it was observed that using the CUDA platform it is possible to obtain satisfactory results.

Download Full-text

DynDL: Scheduling Data-Locality-Aware Tasks with Dynamic Data Transfer Cost for Multicore-Server-Based Big Data Clusters

Applied Sciences ◽

10.3390/app8112216 ◽

2018 ◽

Vol 8 (11) ◽

pp. 2216

Author(s):

Jiahui Jin ◽

Qi An ◽

Wei Zhou ◽

Jiakai Tang ◽

Runqun Xiong

Keyword(s):

Big Data ◽

Data Processing ◽

Processing Time ◽

Data Transfer ◽

Data Locality ◽

Free Time ◽

Time Data ◽

Dynamic Data ◽

Network Bandwidth ◽

Transfer Cost

Network bandwidth is a scarce resource in big data environments, so data locality is a fundamental problem for data-parallel frameworks such as Hadoop and Spark. This problem is exacerbated in multicore server-based clusters, where multiple tasks running on the same server compete for the server’s network bandwidth. Existing approaches solve this problem by scheduling computational tasks near the input data and considering the server’s free time, data placements, and data transfer costs. However, such approaches usually set identical values for data transfer costs, even though a multicore server’s data transfer cost increases with the number of data-remote tasks. Eventually, this hampers data-processing time, by minimizing it ineffectively. As a solution, we propose DynDL (Dynamic Data Locality), a novel data-locality-aware task-scheduling model that handles dynamic data transfer costs for multicore servers. DynDL offers greater flexibility than existing approaches by using a set of non-decreasing functions to evaluate dynamic data transfer costs. We also propose online and offline algorithms (based on DynDL) that minimize data-processing time and adaptively adjust data locality. Although DynDL is NP-complete (nondeterministic polynomial-complete), we prove that the offline algorithm runs in quadratic time and generates optimal results for DynDL’s specific uses. Using a series of simulations and real-world executions, we show that our algorithms are 30% better than algorithms that do not consider dynamic data transfer costs in terms of data-processing time. Moreover, they can adaptively adjust data localities based on the server’s free time, data placement, and network bandwidth, and schedule tens of thousands of tasks within subseconds or seconds.

Download Full-text

Parallel Data Mining and Applications in Hospital Big Data Processing

Big Data Management and Processing ◽

10.1201/9781315154008-20 ◽

2017 ◽

pp. 403-424

Author(s):

Jianguo Chen ◽

Zhuo Tang ◽

Kenli Li ◽

Keqin Li

Keyword(s):

Data Mining ◽

Big Data ◽

Data Processing ◽

Big Data Processing ◽

Parallel Data ◽

Parallel Data Mining

Download Full-text

Novel IT Technologies on the Digital Battlefield: The Application of Big Data and Data Mining Technologies

Hadmérnök ◽

10.32567/hm.2020.4.10 ◽

2020 ◽

Vol 15 (4) ◽

pp. 141-158

Author(s):

Eszter Katalin Bognár

Keyword(s):

Data Mining ◽

Big Data ◽

Data Processing ◽

Relevant Information ◽

Military Operations ◽

Textual Data ◽

Modern Warfare ◽

Tools And Techniques

In modern warfare, the most important innovation to date has been the utilisation of information as a weapon. The basis of successful military operations is the ability to correctly assess a situation based on credible collected information. In today’s military, the primary challenge is not the actual collection of data. It has become more important to extract relevant information from that data. This requirement cannot be successfully completed without necessary improvements in tools and techniques to support the acquisition and analysis of data. This study defines Big Data and its concept as applied to military reconnaissance, focusing on the processing of imagery and textual data, bringing to light modern data processing and analytics methods that enable effective processing.

Download Full-text

Big Data Mining Algorithms for Predicting Dynamic Product Price by Online Analysis

Advances in Intelligent Systems and Computing - Computational Intelligence in Data Mining ◽

10.1007/978-981-13-8676-3_59 ◽

2019 ◽

pp. 701-708

Author(s):

Manjushree Nayak ◽

Bhavana Narain

Keyword(s):

Data Mining ◽

Big Data ◽

Product Price ◽

Online Analysis ◽

Big Data Mining ◽

Data Mining Algorithms ◽

Mining Algorithms

Download Full-text

Anomaly data management and big data analytics: an application on disability datasets

International Journal of Crowd Science ◽

10.1108/ijcs-09-2018-0020 ◽

2018 ◽

Vol 2 (2) ◽

pp. 164-176

Author(s):

Zhiwen Pan ◽

Wen Ji ◽

Yiqiang Chen ◽

Lianjun Dai ◽

Jun Zhang

Keyword(s):

Data Mining ◽

Big Data ◽

Data Management ◽

Big Data Analytics ◽

Quality Data ◽

Content Type ◽

Data Mining Algorithms ◽

Management Scheme ◽

The Disabled ◽

Mining Scheme

Purpose The disability datasets are the datasets that contain the information of disabled populations. By analyzing these datasets, professionals who work with disabled populations can have a better understanding of the inherent characteristics of the disabled populations, so that working plans and policies, which can effectively help the disabled populations, can be made accordingly. Design/methodology/approach In this paper, the authors proposed a big data management and analytic approach for disability datasets. Findings By using a set of data mining algorithms, the proposed approach can provide the following services. The data management scheme in the approach can improve the quality of disability data by estimating miss attribute values and detecting anomaly and low-quality data instances. The data mining scheme in the approach can explore useful patterns which reflect the correlation, association and interactional between the disability data attributes. Experiments based on real-world dataset are conducted at the end to prove the effectiveness of the approach. Originality/value The proposed approach can enable data-driven decision-making for professionals who work with disabled populations.

Download Full-text

Parallel Primitives for Vendor-Agnostic Implementation of Big Data Mining Algorithms

2018 32nd International Conference on Advanced Information Networking and Applications Workshops (WAINA) ◽

10.1109/waina.2018.00118 ◽

2018 ◽

Author(s):

Cesare Bandirali ◽

Stefano Lodi ◽

Gianluca Moro ◽

Andrea Pagliarani ◽

Claudio Sartori

Keyword(s):

Data Mining ◽

Big Data ◽

Big Data Mining ◽

Data Mining Algorithms ◽

Mining Algorithms

Download Full-text

Retraction: Parallel Algorithm for Reduction of Data Processing Time in Big Data (Journal of Physics: Conference Series 1432 012095)

Journal of Physics Conference Series ◽

10.1088/1742-6596/1432/1/012110 ◽

2020 ◽

Vol 1432 ◽

pp. 012110

Author(s):

Jesús Silva ◽

Hugo Hernández Palma ◽

William Niebles Núñez ◽

David Ovallos-Gazabon ◽

Noel Varela

Keyword(s):

Big Data ◽

Data Processing ◽

Parallel Algorithm ◽

Processing Time ◽

Conference Series

Download Full-text

Big Data Deployment for an Efficient Resource Prerequisite Job

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2019.8163 ◽

2019 ◽

Vol 16 (8) ◽

pp. 3211-3215 ◽

Cited By ~ 1

Author(s):

S. Prince Mary ◽

D. Usha Nandini ◽

B. Ankayarkanni ◽

R. Sathyabama Krishna

Keyword(s):

Big Data ◽

Data Processing ◽

Processing Time ◽

Map Reduce ◽

Efficient Resource ◽

Server System ◽

Distributed Server

Integration of cloud and big data very difficult and challenging task and to find the number of resources to complete their job is very difficult and challenging. So, virtualization is implemented it involves 3 phases map reduce, shuffle phase and reduce phase. Many researchers have been done already they have applied Heterogeneousmap reduce application and they use least-work-left policy technique to distributed server system. In this paper we have discussed about virtualization is used for hadoop jobs for effective data processing and to find the processing time of job and balance partition algorithm is used. The main objective is to implement virtualization in our local machines.

Download Full-text

Big Data Processing for Prediction of Traffic Time Based on Vertical Data Arrangement

2014 IEEE 6th International Conference on Cloud Computing Technology and Science ◽

10.1109/cloudcom.2014.54 ◽

2014 ◽

Cited By ~ 3

Author(s):

Seungwoo Jeon ◽

Bonghee Hong ◽

Byungsoo Kim

Keyword(s):

Big Data ◽

Data Processing ◽

Big Data Processing ◽

Vertical Data ◽

Data Arrangement

Download Full-text