Distributed steam processing is necessary for a large class of stream-based applications. To exploit the full power of distributed computation, effective load distribution techniques must be developed to optimize the system performance and cope with time-varying loads. When traditional load balancing or load sharing strategies are applied to such systems, we find that they either fall short in achieving good load distribution or fail to maintain good task partition in the long run.
In this paper, we study two important issues of dynamic load distribution in the context of data-intensive stream processing. The first one is how to allocate processing resources for push-based tasks such that the average end-to-end data processing latency can be minimized. The second issue is how to maintain a good load distribution dynamically for long running continuous queries. We propose a new hybrid load distribution strategy that addresses the above concerns by load clustering. To achieve scalability, our algorithm is completely decentralized and asynchronous.