On the usability of Hadoop MapReduce, Apache Spark & Apache flink for data science

2020 ◽

pp. 27-36

Author(s):

O. Dmytriieva ◽

◽

D. Nikulin

Keyword(s):

Distributed Processing ◽

Apache Spark ◽

Hadoop Mapreduce ◽

Transactional Data

Роботу присвячено питанням розподіленої обробки транзакцій при проведенні аналізу великих обсягів даних з метою пошуку асоціативних правил. На основі відомих алгоритмів глибинного аналізу даних для пошуку частих предметних наборів AIS та Apriori було визначено можливі варіанти паралелізації, які позбавлені необхідності ітераційного сканування бази даних та великого споживання пам'яті. Досліджено можливість перенесення обчислень на різні платформи, які підтримують паралельну обробку даних. В якості обчислювальних платформ було обрано MapReduce – потужну базу для обробки великих, розподілених наборів даних на кластері Hadoop, а також програмний інструмент для обробки надзвичайно великої кількості даних Apache Spark. Проведено порівняльний аналіз швидкодії розглянутих методів, отримано рекомендації щодо ефективного використання паралельних обчислювальних платформ, запропоновано модифікації алгоритмів пошуку асоціативних правил. В якості основних завдань, реалізованих в роботі, слід визначити дослідження сучасних засобів розподіленої обробки структурованих і не структурованих даних, розгортання тестового кластера в хмарному сервісі, розробку скриптів для автоматизації розгортання кластера, проведення модифікацій розподілених алгоритмів з метою адаптації під необхідні фреймворки розподілених обчислень, отримання показників швидкодії обробки даних в послідовному і розподіленому режимах з застосуванням Hadoop MapReduce. та Apache Spark, проведення порівняльного аналізу результатів тестових вимірів швидкодії, отримання та обґрунтування залежності між кількістю оброблюваних даних, і часом, витраченим на обробку, оптимізацію розподілених алгоритмів пошуку асоціативних правил при обробці великих обсягів транзакційних даних, отримання показників швидкодії розподіленої обробки існуючими програмними засобами. Ключові слова: розподілена обробка, транзакційні дані, асоціативні правила, обчислюваний кластер, Hadoop, MapReduce, Apache Spark

Download Full-text

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

10.1145/3481646.3481649 ◽

2021 ◽

Author(s):

Taha Tekdogan ◽

Ali Cakmak

Keyword(s):

Big Data ◽

Data Classification ◽

Apache Spark ◽

Hadoop Mapreduce ◽

Big Data Classification

Download Full-text

Large Scale Distributed Data Science using Apache Spark

Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining - KDD '15 ◽

10.1145/2783258.2789993 ◽

2015 ◽

Cited By ~ 77

Author(s):

James G. Shanahan ◽

Laing Dai

Keyword(s):

Large Scale ◽

Data Science ◽

Apache Spark ◽

Distributed Data

Download Full-text

Human Behavior Analysis Using Intelligent Big Data Analytics

Frontiers in Psychology ◽

10.3389/fpsyg.2021.686610 ◽

2021 ◽

Vol 12 ◽

Author(s):

Muhammad Usman Tariq ◽

Muhammad Babar ◽

Marc Poulin ◽

Akmal Saeed Khattak ◽

Mohammad Dahman Alshehri ◽

...

Keyword(s):

Social Media ◽

Big Data ◽

Human Behavior ◽

Data Analytics ◽

Data Science ◽

Big Data Analytics ◽

Apache Spark ◽

Social Media Data ◽

The Social ◽

Media Data

Intelligent big data analysis is an evolving pattern in the age of big data science and artificial intelligence (AI). Analysis of organized data has been very successful, but analyzing human behavior using social media data becomes challenging. The social media data comprises a vast and unstructured format of data sources that can include likes, comments, tweets, shares, and views. Data analytics of social media data became a challenging task for companies, such as Dailymotion, that have billions of daily users and vast numbers of comments, likes, and views. Social media data is created in a significant amount and at a tremendous pace. There is a very high volume to store, sort, process, and carefully study the data for making possible decisions. This article proposes an architecture using a big data analytics mechanism to efficiently and logically process the huge social media datasets. The proposed architecture is composed of three layers. The main objective of the project is to demonstrate Apache Spark parallel processing and distributed framework technologies with other storage and processing mechanisms. The social media data generated from Dailymotion is used in this article to demonstrate the benefits of this architecture. The project utilized the application programming interface (API) of Dailymotion, allowing it to incorporate functions suitable to fetch and view information. The API key is generated to fetch information of public channel data in the form of text files. Hive storage machinist is utilized with Apache Spark for efficient data processing. The effectiveness of the proposed architecture is also highlighted.

Download Full-text

Vegetation change detection based on time series analysis by Apache Spark and RasterFrame

Journal of Mining and Earth Sciences ◽

10.46326/jmes.2021.62(1).06 ◽

2021 ◽

Vol 62 (1) ◽

pp. 42-52

Author(s):

Dung Mai Thi Nguyen ◽

Thu Hoai Thi Vu ◽

Keyword(s):

Time Series ◽

Large Scale ◽

Data Science ◽

Vegetation Change ◽

Environmental Changes ◽

Satellite Image ◽

Image Data ◽

Apache Spark ◽

Raster Data ◽

Average Value

Spatial big data has a large scale and complex, therefore, it cannot be collected, managed, and analyzed by traditional data analytic software shortly. These platforms in many situations are restricted to vectors data. However, the raster data generated by the sensors on the enormous number of satellites now needs to be processed in parallel on the cluster environment. The article introduces the satellite image data analyzing method using the RasterFrames library on the Apache Spark platform. The RasterFrames library examines raster data for Python, Scala, and SQL, bringing the power of Spark DataFrames to access to Earth Observation, cloud computing, and data science. In the experimental part, the NDVI and the change in the average value of NDVI in the time series are calculated to demonstrate the vegetation mantle changes in Phu Tho province. These results are the reference data source in the assessment of weather, climate, and environmental changes in the study area during that time.

Download Full-text

Big Data Processing on Cloud Computing Using Hadoop Mapreduce and Apache Spark

Advances in Business Information Systems and Analytics - Cloud Computing Technologies for Green Enterprises ◽

10.4018/978-1-5225-3038-1.ch009 ◽

2018 ◽

pp. 224-250

Author(s):

Yassir Samadi ◽

Mostapha Zbakh ◽

Amine Haouari

Keyword(s):

Cloud Computing ◽

Big Data ◽

Data Processing ◽

Apache Spark ◽

Big Data Processing ◽

Data Intensive ◽

Hadoop Mapreduce ◽

Huge Data ◽

Increasing Demand ◽

Exponential Rates

Size of the data used by enterprises has been growing at exponential rates since last few years; handling such huge data from various sources is a challenge for Businesses. In addition, Big Data becomes one of the major areas of research for Cloud Service providers due to a large amount of data produced every day, and the inefficiency of traditional algorithms and technologies to handle these large amounts of data. In order to resolve the aforementioned problems and to meet the increasing demand for high-speed and data-intensive computing, several solutions have been developed by researches and developers. Among these solutions, there are Cloud Computing tools such as Hadoop MapReduce and Apache Spark, which work on the principles of parallel computing. This chapter focuses on how big data processing challenges can be handled by using Cloud Computing frameworks and the importance of using Cloud Computing by businesses

Download Full-text

High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach

Genomics & Informatics ◽

10.5808/gi.21056 ◽

2021 ◽

Vol 19 (4) ◽

pp. e49

Author(s):

Anas Oujja ◽

Mohamed Riduan Abid ◽

Jaouad Boumhidi ◽

Safae Bourhnane ◽

Asmaa Mourhir ◽

...

Keyword(s):

High Performance Computing ◽

High Performance ◽

Data Science ◽

Longest Common Subsequence ◽

Rna Sequences ◽

Hadoop Mapreduce ◽

Common Subsequence ◽

Ict Tools ◽

Clustering Approach ◽

Performance Computing

Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.

Download Full-text

A comparative between hadoop mapreduce and apache Spark on HDFS

Proceedings of the 1st International Conference on Internet of Things and Machine Learning - IML '17 ◽

10.1145/3109761.3109775 ◽

2017 ◽

Cited By ~ 2

Author(s):

Mohamed Saouabi ◽

Abdellah Ezzati

Keyword(s):

Apache Spark ◽

Hadoop Mapreduce

Download Full-text

Comparative Analysis of Apache Spark and Hadoop MapReduce Using Various Parameters and Execution Time

Intelligent Computing and Communication - Advances in Intelligent Systems and Computing ◽

10.1007/978-981-15-1084-7_70 ◽

2020 ◽

pp. 719-725

Author(s):

Bhagavathula Meena ◽

I. S. L. Sarwani ◽

M. Archana ◽

P. Supriya

Keyword(s):

Comparative Analysis ◽

Execution Time ◽

Apache Spark ◽

Hadoop Mapreduce

Download Full-text

Data Science and Big Data Practice Using Apache Spark and Python

Advances in Data Mining and Database Management - Intelligent Analytics With Advanced Multi-Industry Applications ◽

10.4018/978-1-7998-4963-6.ch004 ◽

2021 ◽

pp. 67-95

Author(s):

Li Chen ◽

Lala Aicha Coulibaly

Keyword(s):

Information Technology ◽

Big Data ◽

Computer Science ◽

Data Analytics ◽

Data Science ◽

Principal Component ◽

Real Data ◽

Apache Spark ◽

Data Sets ◽

Information Technology Students

Data science and big data analytics are still at the center of computer science and information technology. Students and researchers not in computer science often found difficulties in real data analytics using programming languages such as Python and Scala, especially when they attempt to use Apache-Spark in cloud computing environments-Spark Scala and PySpark. At the same time, students in information technology could find it difficult to deal with the mathematical background of data science algorithms. To overcome these difficulties, this chapter will provide a practical guideline to different users in this area. The authors cover the main algorithms for data science and machine learning including principal component analysis (PCA), support vector machine (SVM), k-means, k-nearest neighbors (kNN), regression, neural networks, and decision trees. A brief description of these algorithms will be explained, and the related code will be selected to fit simple data sets and real data sets. Some visualization methods including 2D and 3D displays will be also presented in this chapter.

Download Full-text

On the usability of Hadoop MapReduce, Apache Spark & Apache flink for data science

DISTRIBUTED PROCESSING OF LARGE VOLUMES OF TRANSACTIONAL DATA

Benchmarking Apache Spark and Hadoop MapReduce on Big Data Classification

Large Scale Distributed Data Science using Apache Spark

Human Behavior Analysis Using Intelligent Big Data Analytics

Vegetation change detection based on time series analysis by Apache Spark and RasterFrame

Big Data Processing on Cloud Computing Using Hadoop Mapreduce and Apache Spark

High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach

A comparative between hadoop mapreduce and apache Spark on HDFS

Comparative Analysis of Apache Spark and Hadoop MapReduce Using Various Parameters and Execution Time

Data Science and Big Data Practice Using Apache Spark and Python

Export Citation Format