In the era of big data, as the amount of streaming data continues to
increase, stream processing tasks (SPTs) face serious challenges in
real-time processing scenarios with low latency and high throughput.
However, much of the current literature on the performance of SPTs pays
attention to the reactive approach, which cannot well avoid the problem of
system crashes due to the inherent performance volatility. In this paper, a
novel throughput prediction method based on ExtraTree for SPTs is presented
to address these challenges. A volatility detection algorithm was proposed
to obtain the reasonable metric values after the performance volatility of
SPTs was studied. Moreover, a selection algorithm of regression function was
proposed to output the performance values of SPTs under a relative stead
state. Furthermore, a ExtraTree-based algorithm was proposed to predict the
throughput of SPTs. The experimental results from two open-source benchmarks
running on Apache Flink, a popular stream processing system (SPS), indicated
that the average of the accuracy and efficiency of the proposed method could
achieve 90.535% and 0.835 s/10,000 samples, which proved the effectiveness
of the proposed method on the task of predicting the throughput of SPTs.