scholarly journals Vertical Data Processing for Mining Big Data: A Predicate Tree Approach

10.29007/db8n ◽  
2019 ◽  
Author(s):  
Mohammad Hossain ◽  
Maninder Singh ◽  
Sameer Abufardeh

Time is a critical factor in processing a very large volume of data a.k.a ‘Big Data’. Many existing data mining algorithms (supervised and unsupervised) become futile because of the ubiquitous use of horizontal processing i.e. row-by-row processing of stored data. Processing time for big data is further exacerbated by its high dimensionality (# of features) and high cardinality (# of records). To address this processing-time issue, we proposed a vertical approach with predicate trees (pTree). Our approach structures data into columns of bit slices, which range from few to hundreds and are processed vertically i.e. column by column. We tested and compared our vertical approach to traditional (horizontal) approach using three basic Boolean operations namely addition, subtraction and multiplication with 10 data sizes. The length of data size ranged from half a billion bits to 5 billion bits. The results are analyzed w.r.t processing speed time and speed gain for both the approaches. The result shows that our vertical approach outperformed the traditional approach for all Boolean operations (add, subtract and multiply) across all data sizes and results in speed-gain between 24% to 96%. We concluded from our results that our approach being in data-mining ready format is best suited to apply to operations involving complex computations in big data application to achieve significant speed gain.

2018 ◽  
pp. 90-102
Author(s):  
Matheus Varela Ferreira ◽  
Francisco Assis da Silva ◽  
Leandro Luiz de Almeida ◽  
Danillo Roberto Pereira

With the increasing need to make decisions in the short term, industry (pharmaceutical, petrochemical, aeronautics and etc.) has been seeking new ways to reduce the time of the data mining process to obtain knowledge. In recent years, many technological resources are being used to mitigate this need, an example is CUDA. CUDA is a platform that enables the use of GeForce GPUs in conjunction with CPUs for data processing, significantly reducing processing time. This work proposes to perform a comparative analysis of the processing time between two versions of some data mining algorithms (Apriori, AprioriAll, Naïve Bayes and K-Means), one running on CPU only and one on CPU in conjunction with GPU through platform CUDA. Through the experiments performed, it was observed that using the CUDA platform it is possible to obtain satisfactory results.


2018 ◽  
Vol 8 (11) ◽  
pp. 2216
Author(s):  
Jiahui Jin ◽  
Qi An ◽  
Wei Zhou ◽  
Jiakai Tang ◽  
Runqun Xiong

Network bandwidth is a scarce resource in big data environments, so data locality is a fundamental problem for data-parallel frameworks such as Hadoop and Spark. This problem is exacerbated in multicore server-based clusters, where multiple tasks running on the same server compete for the server’s network bandwidth. Existing approaches solve this problem by scheduling computational tasks near the input data and considering the server’s free time, data placements, and data transfer costs. However, such approaches usually set identical values for data transfer costs, even though a multicore server’s data transfer cost increases with the number of data-remote tasks. Eventually, this hampers data-processing time, by minimizing it ineffectively. As a solution, we propose DynDL (Dynamic Data Locality), a novel data-locality-aware task-scheduling model that handles dynamic data transfer costs for multicore servers. DynDL offers greater flexibility than existing approaches by using a set of non-decreasing functions to evaluate dynamic data transfer costs. We also propose online and offline algorithms (based on DynDL) that minimize data-processing time and adaptively adjust data locality. Although DynDL is NP-complete (nondeterministic polynomial-complete), we prove that the offline algorithm runs in quadratic time and generates optimal results for DynDL’s specific uses. Using a series of simulations and real-world executions, we show that our algorithms are 30% better than algorithms that do not consider dynamic data transfer costs in terms of data-processing time. Moreover, they can adaptively adjust data localities based on the server’s free time, data placement, and network bandwidth, and schedule tens of thousands of tasks within subseconds or seconds.


Hadmérnök ◽  
2020 ◽  
Vol 15 (4) ◽  
pp. 141-158
Author(s):  
Eszter Katalin Bognár

In modern warfare, the most important innovation to date has been the utilisation of information as a  weapon. The basis of successful military operations is  the ability to correctly assess a situation based on  credible collected information. In today’s military, the primary challenge is not the actual collection of data.  It has become more important to extract relevant  information from that data. This requirement cannot  be successfully completed without necessary  improvements in tools and techniques to support the acquisition and analysis of data. This study defines  Big Data and its concept as applied to military  reconnaissance, focusing on the processing of  imagery and textual data, bringing to light modern  data processing and analytics methods that enable  effective processing.


2018 ◽  
Vol 2 (2) ◽  
pp. 164-176
Author(s):  
Zhiwen Pan ◽  
Wen Ji ◽  
Yiqiang Chen ◽  
Lianjun Dai ◽  
Jun Zhang

Purpose The disability datasets are the datasets that contain the information of disabled populations. By analyzing these datasets, professionals who work with disabled populations can have a better understanding of the inherent characteristics of the disabled populations, so that working plans and policies, which can effectively help the disabled populations, can be made accordingly. Design/methodology/approach In this paper, the authors proposed a big data management and analytic approach for disability datasets. Findings By using a set of data mining algorithms, the proposed approach can provide the following services. The data management scheme in the approach can improve the quality of disability data by estimating miss attribute values and detecting anomaly and low-quality data instances. The data mining scheme in the approach can explore useful patterns which reflect the correlation, association and interactional between the disability data attributes. Experiments based on real-world dataset are conducted at the end to prove the effectiveness of the approach. Originality/value The proposed approach can enable data-driven decision-making for professionals who work with disabled populations.


2020 ◽  
Vol 1432 ◽  
pp. 012110
Author(s):  
Jesús Silva ◽  
Hugo Hernández Palma ◽  
William Niebles Núñez ◽  
David Ovallos-Gazabon ◽  
Noel Varela

2019 ◽  
Vol 16 (8) ◽  
pp. 3211-3215 ◽  
Author(s):  
S. Prince Mary ◽  
D. Usha Nandini ◽  
B. Ankayarkanni ◽  
R. Sathyabama Krishna

Integration of cloud and big data very difficult and challenging task and to find the number of resources to complete their job is very difficult and challenging. So, virtualization is implemented it involves 3 phases map reduce, shuffle phase and reduce phase. Many researchers have been done already they have applied Heterogeneousmap reduce application and they use least-work-left policy technique to distributed server system. In this paper we have discussed about virtualization is used for hadoop jobs for effective data processing and to find the processing time of job and balance partition algorithm is used. The main objective is to implement virtualization in our local machines.


Sign in / Sign up

Export Citation Format

Share Document