DISTRIBUTED PROCESSING OF LARGE VOLUMES OF TRANSACTIONAL DATA

Naukovyi visnyk Donetskoho natsionalnoho tekhnichnoho universytetu ◽

10.31474/2415-7902-2020-1(4)-2(5)-27-36 ◽

2020 ◽

pp. 27-36

Author(s):

O. Dmytriieva ◽

◽

D. Nikulin

Keyword(s):

Distributed Processing ◽

Apache Spark ◽

Hadoop Mapreduce ◽

Transactional Data

Роботу присвячено питанням розподіленої обробки транзакцій при проведенні аналізу великих обсягів даних з метою пошуку асоціативних правил. На основі відомих алгоритмів глибинного аналізу даних для пошуку частих предметних наборів AIS та Apriori було визначено можливі варіанти паралелізації, які позбавлені необхідності ітераційного сканування бази даних та великого споживання пам'яті. Досліджено можливість перенесення обчислень на різні платформи, які підтримують паралельну обробку даних. В якості обчислювальних платформ було обрано MapReduce – потужну базу для обробки великих, розподілених наборів даних на кластері Hadoop, а також програмний інструмент для обробки надзвичайно великої кількості даних Apache Spark. Проведено порівняльний аналіз швидкодії розглянутих методів, отримано рекомендації щодо ефективного використання паралельних обчислювальних платформ, запропоновано модифікації алгоритмів пошуку асоціативних правил. В якості основних завдань, реалізованих в роботі, слід визначити дослідження сучасних засобів розподіленої обробки структурованих і не структурованих даних, розгортання тестового кластера в хмарному сервісі, розробку скриптів для автоматизації розгортання кластера, проведення модифікацій розподілених алгоритмів з метою адаптації під необхідні фреймворки розподілених обчислень, отримання показників швидкодії обробки даних в послідовному і розподіленому режимах з застосуванням Hadoop MapReduce. та Apache Spark, проведення порівняльного аналізу результатів тестових вимірів швидкодії, отримання та обґрунтування залежності між кількістю оброблюваних даних, і часом, витраченим на обробку, оптимізацію розподілених алгоритмів пошуку асоціативних правил при обробці великих обсягів транзакційних даних, отримання показників швидкодії розподіленої обробки існуючими програмними засобами. Ключові слова: розподілена обробка, транзакційні дані, асоціативні правила, обчислюваний кластер, Hadoop, MapReduce, Apache Spark

Download Full-text

Computational storage: an efficient and scalable platform for big data and HPC applications

Journal Of Big Data ◽

10.1186/s40537-019-0265-5 ◽

2019 ◽

Vol 6 (1) ◽

Cited By ~ 2

Author(s):

Mahdi Torabzadehkashi ◽

Siavash Rezaei ◽

Ali HeydariGorji ◽

Hosein Bobarshad ◽

Vladimir Alves ◽

...

Keyword(s):

Big Data ◽

High Performance ◽

Distributed Processing ◽

Data Access ◽

Distributed Applications ◽

Process Data ◽

Storage Devices ◽

Hadoop Mapreduce ◽

Big Data Applications ◽

Application Processor

AbstractIn the era of big data applications, the demand for more sophisticated data centers and high-performance data processing mechanisms is increasing drastically. Data are originally stored in storage systems. To process data, application servers need to fetch them from storage devices, which imposes the cost of moving data to the system. This cost has a direct relation with the distance of processing engines from the data. This is the key motivation for the emergence of distributed processing platforms such as Hadoop, which move process closer to data. Computational storage devices (CSDs) push the “move process to data” paradigm to its ultimate boundaries by deploying embedded processing engines inside storage devices to process data. In this paper, we introduce Catalina, an efficient and flexible computational storage platform, that provides a seamless environment to process data in-place. Catalina is the first CSD equipped with a dedicated application processor running a full-fledged operating system that provides filesystem-level data access for the applications. Thus, a vast spectrum of applications can be ported for running on Catalina CSDs. Due to these unique features, to the best of our knowledge, Catalina CSD is the only in-storage processing platform that can be seamlessly deployed in clusters to run distributed applications such as Hadoop MapReduce and HPC applications in-place without any modifications on the underlying distributed processing framework. For the proof of concept, we build a fully functional Catalina prototype and a CSD-equipped platform using 16 Catalina CSDs to run Intel HiBench Hadoop and HPC benchmarks to investigate the benefits of deploying Catalina CSDs in the distributed processing environments. The experimental results show up to 2.2× improvement in performance and 4.3× reduction in energy consumption, respectively, for running Hadoop MapReduce benchmarks. Additionally, thanks to the Neon SIMD engines, the performance and energy efficiency of DFT algorithms are improved up to 5.4× and 8.9×, respectively.

Download Full-text

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

10.1145/3481646.3481649 ◽

2021 ◽

Author(s):

Taha Tekdogan ◽

Ali Cakmak

Keyword(s):

Big Data ◽

Data Classification ◽

Apache Spark ◽

Hadoop Mapreduce ◽

Big Data Classification

Download Full-text

Hadoop-EDF: Large-scale Distributed Processing of Electrophysiological Signal Data in Hadoop MapReduce

2019 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) ◽

10.1109/bibm47256.2019.8983371 ◽

2019 ◽

Author(s):

Yuanyuan Wu ◽

Xiaojin Li ◽

Jinze Liu ◽

Licong Cui

Keyword(s):

Large Scale ◽

Distributed Processing ◽

Hadoop Mapreduce ◽

Electrophysiological Signal

Download Full-text

Big Data Processing on Cloud Computing Using Hadoop Mapreduce and Apache Spark

Advances in Business Information Systems and Analytics - Cloud Computing Technologies for Green Enterprises ◽

10.4018/978-1-5225-3038-1.ch009 ◽

2018 ◽

pp. 224-250

Author(s):

Yassir Samadi ◽

Mostapha Zbakh ◽

Amine Haouari

Keyword(s):

Cloud Computing ◽

Big Data ◽

Data Processing ◽

Apache Spark ◽

Big Data Processing ◽

Data Intensive ◽

Hadoop Mapreduce ◽

Huge Data ◽

Increasing Demand ◽

Exponential Rates

Size of the data used by enterprises has been growing at exponential rates since last few years; handling such huge data from various sources is a challenge for Businesses. In addition, Big Data becomes one of the major areas of research for Cloud Service providers due to a large amount of data produced every day, and the inefficiency of traditional algorithms and technologies to handle these large amounts of data. In order to resolve the aforementioned problems and to meet the increasing demand for high-speed and data-intensive computing, several solutions have been developed by researches and developers. Among these solutions, there are Cloud Computing tools such as Hadoop MapReduce and Apache Spark, which work on the principles of parallel computing. This chapter focuses on how big data processing challenges can be handled by using Cloud Computing frameworks and the importance of using Cloud Computing by businesses

Download Full-text

On the usability of Hadoop MapReduce, Apache Spark & Apache flink for data science

2017 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2017.8257938 ◽

2017 ◽

Cited By ~ 4

Author(s):

Bilal Akil ◽

Ying Zhou ◽

Uwe Rohm

Keyword(s):

Data Science ◽

Apache Spark ◽

Hadoop Mapreduce

Download Full-text

A comparative between hadoop mapreduce and apache Spark on HDFS

Proceedings of the 1st International Conference on Internet of Things and Machine Learning - IML '17 ◽

10.1145/3109761.3109775 ◽

2017 ◽

Cited By ~ 2

Author(s):

Mohamed Saouabi ◽

Abdellah Ezzati

Keyword(s):

Apache Spark ◽

Hadoop Mapreduce

Download Full-text

Software for Calculating Deformations of the Earth's Surface using Satellite Radar Data

PROGRAMMNAYA INGENERIA ◽

10.17587/prin.12.246-259 ◽

2021 ◽

Vol 12 (5) ◽

pp. 246-259

Author(s):

S. E. Popov ◽

◽

R. Yu. Zamaraev ◽

N. I. Yukina ◽

O. L. Giniyatullina ◽

...

Keyword(s):

Programming Languages ◽

Software Package ◽

High Performance ◽

Job Scheduling ◽

Distributed Processing ◽

Computing System ◽

Radar Data ◽

Parallel Execution ◽

Apache Spark ◽

Software Products

The article presents a description of a software package for calculating displacement rates and detecting displacements of the earths surface over areas of intensive coal mining. The complex is built on the basis of the microservice architecture Docker Swarm in integration with the system of massively parallel execution of tasks Apache Spark, as a high-level tool for organizing container-type computations with orchestration of hardware resources. In the software package, the container is used as an element of the sequence of calculation stages of the mathematical model of interferometric processing, presented in the form of a managed service. The service itself is built on the basis of a microkernel of the specified operating system, with support for multitasking of process identifiers and network protocols. Due to the use of containerization of executor objects, the independence of calculations is achieved both within one pool of jobs and between different pools initialized in multi-user mode. The use of the cluster resource management system and YARN job scheduling made it possible to abstract all the computing resources of the cluster from the specific launch of jobs and to provide dispatching of distributed processing applications. The use in the program code based on the Sentinel-1 Toolbox of the possibility of storing the intermediate results of the operation of procedures in the schemes for calculating the displacement rates makes it possible to carry out calculations with various parameters, and parallelization provides a reduction in the calculation time in comparison with commercial software products. The combination of Docker Swarm and Apache Spark technologies in one software package made it possible to implement the idea of a high-performance computing system based on open source software and cross-platform programming languages Java and Python using low-budget hardware blocks, including those made in Russia.

Download Full-text

Comparative Analysis of Apache Spark and Hadoop MapReduce Using Various Parameters and Execution Time

Intelligent Computing and Communication - Advances in Intelligent Systems and Computing ◽

10.1007/978-981-15-1084-7_70 ◽

2020 ◽

pp. 719-725

Author(s):

Bhagavathula Meena ◽

I. S. L. Sarwani ◽

M. Archana ◽

P. Supriya

Keyword(s):

Comparative Analysis ◽

Execution Time ◽

Apache Spark ◽

Hadoop Mapreduce

Download Full-text

BigFeel—A Distributed Processing Environment for the Integration of Sentiment Analysis Methods

The Computer Journal ◽

10.1093/comjnl/bxz020 ◽

2019 ◽

Vol 62 (11) ◽

pp. 1671-1683 ◽

Cited By ~ 1

Author(s):

Roger Santos Ferreira ◽

Denilson Alves Pereira

Keyword(s):

Sentiment Analysis ◽

Distributed Processing ◽

Source Code ◽

Experimental Results ◽

Apache Spark ◽

Large Dataset ◽

Native Form ◽

Experimental Assessment ◽

Processing Methods ◽

Computer Cluster

Abstract Sentiment analysis has been the main focus of plenty of research efforts, particularly justified by its commercial significance, both for consumers and businesses. Thus, many methods have been proposed so far, and the most prominent have been compared in terms of effectiveness. Nonetheless, the literature is deficient when it comes to assessing the efficiency of these methods for processing large volumes of data. In this study, we performed an experimental assessment of the efficiency of 22 methods in total, whose implementations were available. We also proposed and assessed an environment for distributed processing methods for sentiment analysis, using the Apache Spark platform, named BigFeel. In this environment, the existing methods, outlined to run in a non-distributed way, can be adapted, without altering their source code, to run in a distributed manner. The experimental results reveal that (i) few methods are efficient in their native form, (ii) the methods improve their efficiency after having been integrated into BigFeel, (iii) some of them, which were unfeasible to process a large dataset, became viable when deployed in a computer cluster and (iv) some methods can only handle small datasets, even in a distributed manner.

Download Full-text

Detection of DDoS Attacks in OpenStack-based Private Cloud Using Apache Spark

Journal of Telecommunications and Information Technology ◽

10.26636/jtit.2020.146120 ◽

2020 ◽

Vol 4 ◽

pp. 62-71

Author(s):

Shweta Gumaste ◽

Narayan D. G. ◽

Sumedha Shinde ◽

Amit K

Keyword(s):

Machine Learning ◽

Real Time ◽

Service Providers ◽

Distributed Processing ◽

Denial Of Service ◽

Machine Learning Algorithms ◽

Apache Spark ◽

Ddos Attacks ◽

Ddos Detection ◽

Real Time Detection

Security is a critical concern for cloud service providers. Distributed denial of service (DDoS) attacks are the most frequent of all cloud security threats, and the consequences of damage caused by DDoS are very serious. Thus, the design of an efficient DDoS detection system plays an important role in monitoring suspicious activity in the cloud. Real-time detection mechanisms operating in cloud environments and relying on machine learning algorithms and distributed processing are an important research issue. In this work, we propose a real-time detection of DDoS attacks using machine learning classifiers on a distributed processing platform. We evaluate the DDoS detection mechanism in an OpenStack-based cloud testbed using the Apache Spark framework. We compare the classification performance using benchmark and real-time cloud datasets. Results of the experiments reveal that the random forest method offers better classifier accuracy. Furthermore, we demonstrate the effectiveness of the proposed distributed approach in terms of training and detection time.

Download Full-text