Using apache spark to collect analytic from the streaming data processing application logs

Abstract Finding frequent itemsets in a continuous streaming data is an important data mining task which is widely used in network monitoring, Internet of Things data analysis and so on. In the era of big data, it is necessary to develop a distributed frequent itemset mining algorithm to meet the needs of massive streaming data processing. Apache Spark is a unified analytic engine for massive data processing which has been successfully used in many data mining fields. In this paper, we propose a distributed algorithm for mining frequent itemsets over massive streaming data named SWEclat. The algorithm uses sliding window to process streaming data and uses vertical data structure to store the dataset in the sliding window. This algorithm is implemented by Apache Spark and uses Spark RDD to store streaming data and dataset in vertical data format, so as to divide these RDDs into partitions for distributed processing. Experimental results show that SWEclat algorithm has good acceleration, parallel scalability and load balancing.

Download Full-text

Towards automated Laue data processing: application to the choice of the optimal X-ray spectrum

Acta Crystallographica Section A Foundations of Crystallography ◽

10.1107/s0108767300026295 ◽

2000 ◽

Vol 56 (s1) ◽

pp. s296-s296 ◽

Cited By ~ 1

Author(s):

D. Bourgeois ◽

U. Wagner ◽

M. Wulff

Keyword(s):

Data Processing ◽

Processing Application ◽

X Ray

Download Full-text

Some Research Issues of Harmful and Violent Content Filtering for Social Networks in the Context of Large-Scale and Streaming Data with Apache Spark

Recent Advances in Security, Privacy, and Trust for Internet of Things (IoT) and Cyber-Physical Systems (CPS) ◽

10.1201/9780429270567-11 ◽

2020 ◽

pp. 249-272

Author(s):

Phuc Do ◽

Phu Pham ◽

Trung Phan

Keyword(s):

Social Networks ◽

Large Scale ◽

Streaming Data ◽

Apache Spark ◽

Content Filtering ◽

Research Issues ◽

Violent Content

Download Full-text

Heterogeneous data-processing optimization with CLARA’s adaptive workflow orchestrator

EPJ Web of Conferences ◽

10.1051/epjconf/202024505020 ◽

2020 ◽

Vol 245 ◽

pp. 05020

Author(s):

Vardan Gyurjyan ◽

Sebastian Mancilla

Keyword(s):

Data Processing ◽

Heterogeneous Systems ◽

Optimal Solution ◽

Heterogeneous Data ◽

Streaming Data ◽

Time Data ◽

Maximum Performance ◽

Adaptive Workflow ◽

Main Challenge ◽

The Right

The hardware landscape used in HEP and NP is changing from homogeneous multi-core systems towards heterogeneous systems with many different computing units, each with their own characteristics. To achieve maximum performance with data processing, the main challenge is to place the right computing on the right hardware. In this paper, we discuss CLAS12 charge particle tracking workflow orchestration that allows us to utilize both CPU and GPU to improve the performance. The tracking application algorithm was decomposed into micro-services that are deployed on CPU and GPU processing units, where the best features of both are intelligently combined to achieve maximum performance. In this heterogeneous environment, CLARA aims to match the requirements of each micro-service to the strength of a CPU or a GPU architecture. A predefined execution of a micro-service on a CPU or a GPU may not be the most optimal solution due to the streaming data-quantum size and the data-quantum transfer latency between CPU and GPU. So, the CLARA workflow orchestrator is designed to dynamically assign micro-service execution to a CPU or a GPU, based on the online benchmark results analyzed for a period of real-time data-processing.

Download Full-text

A Study of Big Data Processing for Sentiments Analysis

10.4018/978-1-6684-3662-2.ch056 ◽

2022 ◽

pp. 1162-1191

Author(s):

Dinesh Chander ◽

Hari Singh ◽

Abhinav Kirti Gupta

Keyword(s):

Big Data ◽

Data Processing ◽

Data Analytics ◽

Important Application ◽

Streaming Data ◽

Test Case ◽

Big Data Processing ◽

Important Field ◽

Batch Data ◽

Different Sources

Data processing has become an important field in today's big data-dominated world. The data has been generating at a tremendous pace from different sources. There has been a change in the nature of data from batch-data to streaming-data, and consequently, data processing methodologies have also changed. Traditional SQL is no longer capable of dealing with this big data. This chapter describes the nature of data and various tools, techniques, and technologies to handle this big data. The chapter also describes the need of shifting big data on to cloud and the challenges in big data processing in the cloud, the migration from data processing to data analytics, tools used in data analytics, and the issues and challenges in data processing and analytics. Then the chapter touches an important application area of streaming data, sentiment analysis, and tries to explore it through some test case demonstrations and results.

Download Full-text