Towards autoscaling of Apache Flink jobs

Abstract Data stream processing has been gaining attention in the past decade. Apache Flink is an open-source distributed stream processing engine that is able to process a large amount of data in real time with low latency. Computations are distributed among a cluster of nodes. Currently, provisioning the appropriate amount of cloud resources must be done manually ahead of time. A dynamically varying workload may exceed the capacity of the cluster, or leave resources underutilized. In our paper, we describe an architecture that enables the automatic scaling of Flink jobs on Kubernetes based on custom metrics, and describe a simple scaling policy. We also measure the e ects of state size and target parallelism on the duration of the scaling operation, which must be considered when designing an autoscaling policy, so that the Flink job respects a Service Level Agreement.

Download Full-text

SLA-Based Adaptation Schemes in Distributed Stream Processing Engines

Applied Sciences ◽

10.3390/app9061045 ◽

2019 ◽

Vol 9 (6) ◽

pp. 1045 ◽

Cited By ~ 2

Author(s):

Muhammad Hanif ◽

Eunsam Kim ◽

Sumi Helal ◽

Choonhwa Lee

Keyword(s):

Distributed Processing ◽

Service Level Agreement ◽

Stream Processing ◽

Big Data Analytics ◽

Service Level ◽

Streaming Data ◽

Adaptive Watermarking ◽

The Status ◽

Distributed Stream Processing ◽

Cloud Applications

With the upswing in the volume of data, information online, and magnanimous cloud applications, big data analytics becomes mainstream in the research communities in the industry as well as in the scholarly world. This prompted the emergence and development of real-time distributed stream processing frameworks, such as Flink, Storm, Spark, and Samza. These frameworks endorse complex queries on streaming data to be distributed across multiple worker nodes in a cluster. Few of these stream processing frameworks provides fundamental support for controlling the latency and throughput of the system as well as the correctness of the results. However, none has the ability to handle them on the fly at runtime. We present a well-informed and efficient adaptive watermarking and dynamic buffering timeout mechanism for the distributed streaming frameworks. It is designed to increase the overall throughput of the system by making the watermarks adaptive towards the stream of incoming workload, and scale the buffering timeout dynamically for each task tracker on the fly while maintaining the Service Level Agreement (SLA)-based end-to-end latency of the system. This work focuses on tuning the parameters of the system (such as window correctness, buffering timeout, and so on) based on the prediction of incoming workloads and assesses whether a given workload will breach an SLA using output metrics including latency, throughput, and correctness of both intermediate and final results. We used Apache Flink as our testbed distributed processing engine for this work. However, the proposed mechanism can be applied to other streaming frameworks as well. Our results on the testbed model indicate that the proposed system outperforms the status quo of stream processing. With the inclusion of learning models like naïve Bayes, multilayer perceptron (MLP), and sequential minimal optimization (SMO)., the system shows more progress in terms of keeping the SLA intact as well as quality of service (QoS).

Download Full-text

Dragon: A Lightweight, High Performance Distributed Stream Processing Engine

2020 IEEE 40th International Conference on Distributed Computing Systems (ICDCS) ◽

10.1109/icdcs47774.2020.00177 ◽

2020 ◽

Author(s):

Aaron Harwood ◽

Maria Rodriguez Read ◽

Gayashan Niroshana Amarasinghe

Keyword(s):

High Performance ◽

Stream Processing ◽

Distributed Stream Processing ◽

Processing Engine

Download Full-text

Modeling Data Stream Intensity in Distributed Stream Processing System

Computer Networks - Communications in Computer and Information Science ◽

10.1007/978-3-642-38865-1_38 ◽

2013 ◽

pp. 372-383

Author(s):

Marcin Gorawski ◽

Pawel Marks ◽

Michal Gorawski

Keyword(s):

Data Stream ◽

Stream Processing ◽

Processing System ◽

Modeling Data ◽

Distributed Stream Processing

Download Full-text

A Distributed Stream Processing Middleware Framework for Real-Time Analysis of Heterogeneous Data on Big Data Platform: Case of Environmental Monitoring

Sensors ◽

10.3390/s20113166 ◽

2020 ◽

Vol 20 (11) ◽

pp. 3166

Author(s):

Adeyinka Akanbi ◽

Muthoni Masinde

Keyword(s):

Big Data ◽

Environmental Monitoring ◽

Real Time ◽

Stream Processing ◽

Heterogeneous Data ◽

Legacy Systems ◽

Time Analysis ◽

Real Time Analysis ◽

Distributed Stream Processing ◽

Processing Engine

In recent years, the application and wide adoption of Internet of Things (IoT)-based technologies have increased the proliferation of monitoring systems, which has consequently exponentially increased the amounts of heterogeneous data generated. Processing and analysing the massive amount of data produced is cumbersome and gradually moving from classical ‘batch’ processing—extract, transform, load (ETL) technique to real-time processing. For instance, in environmental monitoring and management domain, time-series data and historical dataset are crucial for prediction models. However, the environmental monitoring domain still utilises legacy systems, which complicates the real-time analysis of the essential data, integration with big data platforms and reliance on batch processing. Herein, as a solution, a distributed stream processing middleware framework for real-time analysis of heterogeneous environmental monitoring and management data is presented and tested on a cluster using open source technologies in a big data environment. The system ingests datasets from legacy systems and sensor data from heterogeneous automated weather systems irrespective of the data types to Apache Kafka topics using Kafka Connect APIs for processing by the Kafka streaming processing engine. The stream processing engine executes the predictive numerical models and algorithms represented in event processing (EP) languages for real-time analysis of the data streams. To prove the feasibility of the proposed framework, we implemented the system using a case study scenario of drought prediction and forecasting based on the Effective Drought Index (EDI) model. Firstly, we transform the predictive model into a form that could be executed by the streaming engine for real-time computing. Secondly, the model is applied to the ingested data streams and datasets to predict drought through persistent querying of the infinite streams to detect anomalies. As a conclusion of this study, a performance evaluation of the distributed stream processing middleware infrastructure is calculated to determine the real-time effectiveness of the framework.

Download Full-text

Data stream prediction in distributed stream processing environment

International Conference on Automatic Control and Artificial Intelligence (ACAI 2012) ◽

10.1049/cp.2012.1085 ◽

2012 ◽

Author(s):

Jie Chen ◽

Zhongzhi Luan ◽

Yuanqiang Huang

Keyword(s):

Data Stream ◽

Stream Processing ◽

Distributed Stream Processing

Download Full-text

Efficient query processing on distributed stream processing engine

Proceedings of the 11th International Conference on Ubiquitous Information Management and Communication - IMCOM '17 ◽

10.1145/3022227.3022255 ◽

2017 ◽

Cited By ~ 2

Author(s):

Manhui Han ◽

Jonghem Youn ◽

Sang-goo Lee

Keyword(s):

Query Processing ◽

Stream Processing ◽

Efficient Query Processing ◽

Distributed Stream Processing ◽

Processing Engine

Download Full-text

Cost-efficient enactment of stream processing topologies

PeerJ Computer Science ◽

10.7717/peerj-cs.141 ◽

2017 ◽

Vol 3 ◽

pp. e141 ◽

Cited By ~ 6

Author(s):

Christoph Hochreiner ◽

Michael Vögler ◽

Stefan Schulte ◽

Schahram Dustdar

Keyword(s):

Virtual Machines ◽

Service Level Agreement ◽

Stream Processing ◽

Service Level ◽

Resource Provisioning ◽

Streaming Data ◽

Software Systems ◽

Continuous Increase ◽

Data Volume ◽

Cost Efficient

The continuous increase of unbound streaming data poses several challenges to established data stream processing engines. One of the most important challenges is the cost-efficient enactment of stream processing topologies under changing data volume. These data volume pose different loads to stream processing systems whose resource provisioning needs to be continuously updated at runtime. First approaches already allow for resource provisioning on the level of virtual machines (VMs), but this only allows for coarse resource provisioning strategies. Based on current advances and benefits for containerized software systems, we have designed a cost-efficient resource provisioning approach and integrated it into the runtime of the Vienna ecosystem for elastic stream processing. Our resource provisioning approach aims to maximize the resource usage for VMs obtained from cloud providers. This strategy only releases processing capabilities at the end of the VMs minimal leasing duration instead of releasing them eagerly as soon as possible as it is the case for threshold-based approaches. This strategy allows us to improve the service level agreement compliance by up to 25% and a reduction for the operational cost of up to 36%.

Download Full-text