Using apache spark to collect analytic from the streaming data processing application logs

Author(s):  
Golovanov Mikhail Evgenyevich ◽  
Bakulev Aleksandr Valerievich ◽  
Bakuleva Marina Alekseevna
2020 ◽  
Vol 76 (10) ◽  
pp. 7619-7634 ◽  
Author(s):  
Wen Xiao ◽  
Juan Hu

Abstract Finding frequent itemsets in a continuous streaming data is an important data mining task which is widely used in network monitoring, Internet of Things data analysis and so on. In the era of big data, it is necessary to develop a distributed frequent itemset mining algorithm to meet the needs of massive streaming data processing. Apache Spark is a unified analytic engine for massive data processing which has been successfully used in many data mining fields. In this paper, we propose a distributed algorithm for mining frequent itemsets over massive streaming data named SWEclat. The algorithm uses sliding window to process streaming data and uses vertical data structure to store the dataset in the sliding window. This algorithm is implemented by Apache Spark and uses Spark RDD to store streaming data and dataset in vertical data format, so as to divide these RDDs into partitions for distributed processing. Experimental results show that SWEclat algorithm has good acceleration, parallel scalability and load balancing.


2020 ◽  
Vol 245 ◽  
pp. 05020
Author(s):  
Vardan Gyurjyan ◽  
Sebastian Mancilla

The hardware landscape used in HEP and NP is changing from homogeneous multi-core systems towards heterogeneous systems with many different computing units, each with their own characteristics. To achieve maximum performance with data processing, the main challenge is to place the right computing on the right hardware. In this paper, we discuss CLAS12 charge particle tracking workflow orchestration that allows us to utilize both CPU and GPU to improve the performance. The tracking application algorithm was decomposed into micro-services that are deployed on CPU and GPU processing units, where the best features of both are intelligently combined to achieve maximum performance. In this heterogeneous environment, CLARA aims to match the requirements of each micro-service to the strength of a CPU or a GPU architecture. A predefined execution of a micro-service on a CPU or a GPU may not be the most optimal solution due to the streaming data-quantum size and the data-quantum transfer latency between CPU and GPU. So, the CLARA workflow orchestrator is designed to dynamically assign micro-service execution to a CPU or a GPU, based on the online benchmark results analyzed for a period of real-time data-processing.


2022 ◽  
pp. 1162-1191
Author(s):  
Dinesh Chander ◽  
Hari Singh ◽  
Abhinav Kirti Gupta

Data processing has become an important field in today's big data-dominated world. The data has been generating at a tremendous pace from different sources. There has been a change in the nature of data from batch-data to streaming-data, and consequently, data processing methodologies have also changed. Traditional SQL is no longer capable of dealing with this big data. This chapter describes the nature of data and various tools, techniques, and technologies to handle this big data. The chapter also describes the need of shifting big data on to cloud and the challenges in big data processing in the cloud, the migration from data processing to data analytics, tools used in data analytics, and the issues and challenges in data processing and analytics. Then the chapter touches an important application area of streaming data, sentiment analysis, and tries to explore it through some test case demonstrations and results.


2019 ◽  
Vol 158 (1) ◽  
pp. 37 ◽  
Author(s):  
Petar Zečević ◽  
Colin T. Slater ◽  
Mario Jurić ◽  
Andrew J. Connolly ◽  
Sven Lončarić ◽  
...  

2014 ◽  
Vol 91 (11) ◽  
pp. 2002-2004 ◽  
Author(s):  
Qiyue Li ◽  
Zhiwei Chen ◽  
Zhiping Yan ◽  
Cheng Wang ◽  
Zhong Chen

Sign in / Sign up

Export Citation Format

Share Document