Abstract
Emerging large-scale monitoring applications rely on continuous tracking of complex data-analysis queries over collections of massive, physically-distributed data streams. Thus, in addition to the space- and time-efficiency requirements of conventional stream processing (at each remote monitor site), effective solutions also need to guarantee communication efficiency (over the underlying communication network). The complexity of the monitored query adds to the difficulty of the problem -- this is especially true for nonlinear queries (e.g., joins), where no obvious solutions exist for distributing the monitor condition across sites. The recently proposed geometric method offers a generic methodology for splitting an arbitrary (non-linear) global threshold-monitoring task into a collection of local site constraints; still, the approach relies on maintaining the complete stream(s) at each site, thus raising serious efficiency concerns for massive data streams. In this paper, we propose novel algorithms for efficiently tracking a broad class of complex aggregate queries in such distributed-streams settings. Our tracking schemes rely on a novel combination of the geometric method with compact sketch summaries of local data streams, and maintain approximate answers with provable error guarantees, while optimizing space and processing costs at each remote site and communication cost across the network. One of our key technical insights for the effective use of the geometric method lies in exploiting a much lower-dimensional space for monitoring the sketch-based estimation query. Due to the complex, highly nonlinear nature of these estimates, efficiently monitoring the local geometric constraints poses challenging algorithmic issues for which we propose novel solutions. Experimental results on real-life data streams verify the effectiveness of our approach.
- N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. "Tracking Join and Self-Join Sizes in Limited Storage". In ACM PODS, 1999. Google Scholar
- N. Alon, Y. Matias, and M. Szegedy. "The Space Complexity of Approximating the Frequency Moments". In ACM STOC, 1996. Google Scholar
- B. Babcock and C. Olston. "Distributed Top-K Monitoring". In ACM SIGMOD, 2003. Google Scholar
- M. Charikar, K. Chen, and M. Farach-Colton. "Finding Frequent Items in Data Streams". In ICALP, 2002. Google Scholar
- D. Chu, A. Deshpande, J. M. Hellerstein, and W. Hong. "Approximate Data Collection in Sensor Networks using Probabilistic Models". In IEEE ICDE, 2006. Google Scholar
- G. Cormode and M. Garofalakis. Streaming in a connected world: querying and tracking distributed data streams. In ACM SIGMOD, 2007. Google Scholar
- G. Cormode and M. Garofalakis. "Approximate Continuous Querying of Distributed Streams". ACM TODS, 33(2), 2008. Google Scholar
- G. Cormode, M. Garofalakis, S. Muthukrishnan, and R. Rastogi. "Holistic Aggregates in a Networked World: Distributed Tracking of Approximate Quantiles". In ACM SIGMOD, 2005. Google Scholar
- G. Cormode, M. Garofalakis, and D. Sacharidis. "Fast Approximate Wavelet Tracking on Streams". In EDBT, 2006. Google Scholar
- G. Cormode and S. Muthukrishnan. "What's Hot and What's Not: Tracking Most Frequent Items Dynamically". In ACM PODS, 2003. Google Scholar
- G. Cormode and S. Muthukrishnan. "An improved data stream summary: The count-min sketch and its applications". In Jrnl. of Algorithms, 55(1), 2005. Google Scholar
- C. Cranor, T. Johnson, O. Spatscheck, and V. Shkapenyuk. "Gigascope: A Stream Database for Network Applications". In ACM SIGMOD, 2003. Google Scholar
- A. Das, S. Ganguly, M. Garofalakis, and R. Rastogi. "Distributed Set-Expression Cardinality Estimation". In VLDB, 2004. Google Scholar
- A. Deshpande, C. Guestrin, S. R. Madden, J. M. Hellerstein, and W. Hong. "Model-Driven Data Acquisition in Sensor Networks". In VLDB, 2004. Google Scholar
- A. Dobra, M. Garofalakis, J. Gehrke, and R. Rastogi. "Processing Complex Aggregate Queries over Data Streams". In ACM SIGMOD, 2002. Google Scholar
- S. Ganguly, M. Garofalakis, and R. Rastogi. "Processing Set Expressions over Continuous Update Streams". In ACM SIGMOD, 2003. Google Scholar
- N. Giatrakos, A. Deligiannakis, M. Garofalakis, I. Sharfman, and A. Schuster. "Prediction-based Geometric Monitoring over Distributed Data Streams". In ACM SIGMOD, 2012. Google Scholar
- P. B. Gibbons. "Distinct Sampling for Highly-Accurate Answers to Distinct Values Queries and Event Reports". In VLDB, 2001. Google Scholar
- A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. "How to Summarize the Universe: Dynamic Maintenance of Quantiles". In VLDB, 2002. Google Scholar
- A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. J. Strauss. "One-pass wavelet decomposition of data streams". IEEE TKDE, 15(3), 2003. Google Scholar
- M. B. Greenwald and S. Khanna. "Space-Efficient Online Computation of Quantile Summaries". In ACM SIGMOD, 2001. Google Scholar
- M. B. Greenwald and S. Khanna. "Power-Conserving Computation of Order-Statistics over Sensor Networks". In ACM PODS, 2004. Google Scholar
- A. Jain, E. Y. Chang, and Y.-F. Wang. "Adaptive stream resource management using Kalman Filters". In ACM SIGMOD, 2004. Google Scholar
- R. Keralapura, G. Cormode, and J. Ramamirtham. "Communication-efficient distributed monitoring of thresholded counts". In ACM SIGMOD, 2006. Google Scholar
- D. Keren, I. Sharfman, A. Schuster, and A. Livne. "Shape-Sensitive Geometric Monitoring". IEEE TKDE, 24(8), 2012. Google Scholar
- S. R. Madden, M. J. Franklin, J. M. Hellerstein, and W. Hong. "The Design of an Acquisitional Query Processor for Sensor Networks". In ACM SIGMOD, 2003. Google Scholar
- A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. "Finding (Recently) Frequent Items in Distributed Data Streams". In IEEE ICDE, 2005. Google Scholar
- G. S. Manku and R. Motwani. "Approximate Frequency Counts over Data Streams". In VLDB, 2002. Google Scholar
- NII Shonan Workshop on Large-Scale Distributed Computation, Shonan Village, Japan, January 2012. http://www.nii.ac.jp/shonan/seminar011/.Google Scholar
- C. Olston, J. Jiang, and J. Widom. "Adaptive Filters for Continuous Queries over Distributed Data Streams". In ACM SIGMOD, 2003. Google Scholar
- I. Sharfman, A. Schuster, and D. Keren. "A geometric approach to monitoring threshold functions over distributed data streams". In ACM SIGMOD, 2006. Google Scholar
- N. Thaper, S. Guha, P. Indyk, and N. Koudas. "Dynamic Multidimensional Histograms". In ACM SIGMOD, 2002. Google Scholar
Index Terms
- Sketch-based geometric monitoring of distributed stream queries
Recommendations
Sketch-based querying of distributed sliding-window data streams
While traditional data-management systems focus on evaluating single, ad-hoc queries over static data sets in a centralized setting, several emerging applications require (possibly, continuous) answers to queries on dynamic data that is widely ...
Distributed stream join query processing with semijoins
This paper addresses the distributed stream processing of window-based multi-way join queries considering the semijoin as a key join operator. In distributed stream processing, data streams arriving at remote sites need to be shipped to the processing ...
Supporting Predicate-Window Queries in Data Stream Management Systems
ICDEW '06: Proceedings of the 22nd International Conference on Data Engineering WorkshopsThe window query model is widely used in data stream management systems where the focus of a continuous query is limited to a set of the most recent tuples. In this dissertation, we show that an interesting and important class of continuous queries can ...
Comments