File(s) under embargo
An adaptive load balancing strategy for stateful join operator in skewed data stream environments
journal contributionposted on 2023-11-21, 04:42 authored by D Sun, C Zhang, Shang GaoShang Gao, R Buyya
As one of the most computationally intensive operations in stream processing applications, join operation can cause severe load imbalance problem when dealing with skewed data. Most of the popular solutions focused on monitoring-based dynamic balancing strategies, making it difficult to quickly adapt to the changing frequency of data stream, and sometimes failing the balancing strategies that try to address the skewed load in the cluster. To address these issues, we propose to use the prediction results of a deep reinforcement learning model and adjust the grouping strategy in advance before the frequency change of data stream. It will enable the system to quickly adapt to data stream fluctuation, while managing the resources for effective resource utilization. The following contributions are made in this paper: 1) Explore the main factors that trigger the load skewness problem in distributed stream join systems and carefully model the load balancing problem at the application level. 2) Develop a Gated Recurrent Unit Sequence to Sequence model to predict key frequency distribution of streams, and propose a dynamic grouping algorithm and a feedback-based resource elasticity scaling algorithm to solve the load imbalance problem caused by hot keys in real time. 3) Design and implement an adaptive stream join system Aj-Stream based on the prediction model and the proposed algorithm on Apache Storm. 4) Evaluate the system performance through extensive experiments on a large scale real-world dataset and multiple synthetic datasets. The experimental results demonstrate that the Aj-Stream proposed in this paper exhibits stable throughput and latency performance with both static data streams of varying skewnesses and dynamic data streams. In comparison to existing stream-connected systems, Aj-Stream demonstrated a 22.1% increase in system throughput and a 45.5% decrease in system latency when dealing with frequently fluctuating data streams.