s2p: Provenance Research for Stream Processing System

Qian Ye; Minyan Lu

doi:10.3390/app11125523

s2p: Provenance Research for Stream Processing System

Applied Sciences ◽

10.3390/app11125523 ◽

2021 ◽

Vol 11 (12) ◽

pp. 5523

Author(s):

Qian Ye ◽

Minyan Lu

Keyword(s):

Data Storage ◽

Stream Processing ◽

Processing System ◽

Coarse Grained ◽

Stream Data ◽

Related Data ◽

Dsp System ◽

Provenance Research ◽

Dsp Systems ◽

Abnormal Results

The main purpose of our provenance research for DSP (distributed stream processing) systems is to analyze abnormal results. Provenance for these systems is not nontrivial because of the ephemerality of stream data and instant data processing mode in modern DSP systems. Challenges include but are not limited to an optimization solution for avoiding excessive runtime overhead, reducing provenance-related data storage, and providing it in an easy-to-use fashion. Without any prior knowledge about which kinds of data may finally lead to the abnormal, we have to track all transformations in detail, which potentially causes hard system burden. This paper proposes s2p (Stream Process Provenance), which mainly consists of online provenance and offline provenance, to provide fine- and coarse-grained provenance in different precision. We base our design of s2p on the fact that, for a mature online DSP system, the abnormal results are rare, and the results that require a detailed analysis are even rarer. We also consider state transition in our provenance explanation. We implement s2p on Apache Flink named as s2p-flink and conduct three experiments to evaluate its scalability, efficiency, and overhead from end-to-end cost, throughput, and space overhead. Our evaluation shows that s2p-flink incurs a 13% to 32% cost overhead, 11% to 24% decline in throughput, and few additional space costs in the online provenance phase. Experiments also demonstrates the s2p-flink can scale well. A case study is presented to demonstrate the feasibility of the whole s2p solution.

Load adaptive and fault tolerant distributed stream processing system for explosive stream data

2016 18th International Conference on Advanced Communication Technology (ICACT) ◽

10.1109/icact.2016.7423612 ◽

2016 ◽

Author(s):

Myungcheol Lee ◽

Miyoung Lee ◽

Sung Jin Hur ◽

Ikkyun Kim

Keyword(s):

Fault Tolerant ◽

Stream Processing ◽

Processing System ◽

Stream Data ◽

Distributed Stream Processing

An efficient approach for low latency processing in stream data

PeerJ Computer Science ◽

10.7717/peerj-cs.426 ◽

2021 ◽

Vol 7 ◽

pp. e426

Author(s):

Nirav Bhatt ◽

Amit Thakkar

Keyword(s):

Stock Market ◽

Stream Processing ◽

Big Data Analytics ◽

Processing System ◽

Window Size ◽

Arrival Rate ◽

Low Latency ◽

Stream Data ◽

Distributed Environment ◽

Real World Application

Stream data is the data that is generated continuously from the different data sources and ideally defined as the data that has no discrete beginning or end. Processing the stream data is a part of big data analytics that aims at querying the continuously arriving data and extracting meaningful information from the stream. Although earlier processing of such stream was using batch analytics, nowadays there are applications like the stock market, patient monitoring, and traffic analysis which can cause a drastic difference in processing, if the output is generated in levels of hours and minutes. The primary goal of any real-time stream processing system is to process the stream data as soon as it arrives. Correspondingly, analytics of the stream data also needs consideration of surrounding dependent data. For example, stock market analytics results are often useless if we do not consider their associated or dependent parameters which affect the result. In a real-world application, these dependent stream data usually arrive from the distributed environment. Hence, the stream processing system has to be designed, which can deal with the delay in the arrival of such data from distributed sources. We have designed the stream processing model which can deal with all the possible latency and provide an end-to-end low latency system. We have performed the stock market prediction by considering affecting parameters, such as USD, OIL Price, and Gold Price with an equal arrival rate. We have calculated the Normalized Root Mean Square Error (NRMSE) which simplifies the comparison among models with different scales. A comparative analysis of the experiment presented in the report shows a significant improvement in the result when considering the affecting parameters. In this work, we have used the statistical approach to forecast the probability of possible data latency arrives from distributed sources. Moreover, we have performed preprocessing of stream data to ensure at-least-once delivery semantics. In the direction towards providing low latency in processing, we have also implemented exactly-once processing semantics. Extensive experiments have been performed with varying sizes of the window and data arrival rate. We have concluded that system latency can be reduced when the window size is equal to the data arrival rate.

Load adaptive distributed stream processing system for explosive stream data

2015 17th International Conference on Advanced Communication Technology (ICACT) ◽

10.1109/icact.2015.7224896 ◽

2015 ◽

Cited By ~ 1

Author(s):

Myungcheol Lee ◽

Miyoung Lee ◽

Sung Jin Hur ◽

Ikkyun Kim

Keyword(s):

Stream Processing ◽

Processing System ◽

Stream Data ◽

Distributed Stream Processing

Load adaptive and fault tolerant distributed stream processing system for explosive stream data

2016 18th International Conference on Advanced Communication Technology (ICACT) ◽

10.1109/icact.2016.7423613 ◽

2016 ◽

Author(s):

Myungcheol Lee ◽

Miyoung Lee ◽

Sung Jin Hur ◽

Ikkyun Kim

Keyword(s):

Fault Tolerant ◽

Stream Processing ◽

Processing System ◽

Stream Data ◽

Distributed Stream Processing

Architecture of a stream processing system

Fundamentals of Stream Processing ◽

10.1017/cbo9781139058940.009 ◽

2014 ◽

pp. 203-217

Author(s):

Henrique Andrade ◽

Bugra Gedik ◽

Deepak Turaga

Keyword(s):

Stream Processing ◽

Processing System

Design of Cloud Data Storage and Processing System

2018 International Conference on Big Data and Artificial Intelligence (BDAI) ◽

10.1109/bdai.2018.8546667 ◽

2018 ◽

Author(s):

Baoke Zhou ◽

Wusheng Chou

Keyword(s):

Data Storage ◽

Processing System ◽

Cloud Data ◽

Cloud Data Storage

Special phase mask and related data format for page-based holographic data storage systems

Optical Review ◽

10.1007/s10043-009-0115-3 ◽

2009 ◽

Vol 16 (6) ◽

pp. 583-586 ◽

Cited By ~ 1

Author(s):

Frank Przygodda ◽

Joachim Knittel ◽

Oliver Malki ◽

Heiko Trautner ◽

Hartmut Richter

Keyword(s):

Data Storage ◽

Storage Systems ◽

Phase Mask ◽

Holographic Data Storage ◽

Data Format ◽

Related Data

Integrating fault-tolerance and elasticity in a distributed data stream processing system

Proceedings of the 26th International Conference on Scientific and Statistical Database Management - SSDBM '14 ◽

10.1145/2618243.2618288 ◽

2014 ◽

Cited By ~ 7

Author(s):

Kasper Grud Skat Madsen ◽

Philip Thyssen ◽

Yongluan Zhou

Keyword(s):

Fault Tolerance ◽

Data Stream ◽

Stream Processing ◽

Processing System ◽

Distributed Data ◽

Data Stream Processing

Efficient Sensor Stream Data Processing System to use Cache Technique for Ubiquitous Sensor Network Application Service

Journal of Computer Science ◽

10.3844/jcssp.2012.333.336 ◽

2012 ◽

Vol 8 (3) ◽

pp. 333-336 ◽

Cited By ~ 2

Author(s):

Keyword(s):

Data Processing ◽

Sensor Network ◽

Processing System ◽

Stream Data ◽

Data Processing System ◽

Application Service ◽

Network Application ◽

Stream Data Processing

A Containerized Approach for Allocating Distributed Stream Queries to Fog Nodes

10.36227/techrxiv.14151650.v1 ◽

2021 ◽

Author(s):

Hamed Hasibi ◽

Saeed Sedighian Kashi

Keyword(s):

Fog Computing ◽

Stream Processing ◽

Stream Data ◽

Process Data ◽

Stream Query Processing ◽

Tremendous Amount ◽

Stream Processing Engines ◽

Iot Devices ◽

Distributed Stream Processing

Fog computing brings cloud capabilities closer to the Internet of Things (IoT) devices. IoT devices generate a tremendous amount of stream data towards the cloud via hierarchical fog nodes. To process data streams, many Stream Processing Engines (SPEs) have been developed. Without the fog layer, the stream query processing executes on the cloud, which forwards much traffic toward the cloud. When a hierarchical fog layer is available, a complex query can be divided into simple queries to run on fog nodes by using distributed stream processing. In this paper, we propose an approach to assign stream queries to fog nodes using container technology. We name this approach Stream Queries Placement in Fog (SQPF). Our goal is to minimize end-to-end delay to achieve a better quality of service. At first, in the emulation step, we make docker container instances from SPEs and evaluate their processing delay and throughput under different resource configurations and queries with varying input rates. Then in the placement step, we assign queries among fog nodes by using a genetic algorithm. The practical approach used in SQPF achieves a near-the-best assignment based on the lowest application deadline in real scenarios, and evaluation results are evidence of this goal.