ABSTRACT
The sliding window model is useful for discounting stale data in data stream applications. In this model, data elements arrive continually and only the most recent N elements are used when answering queries. We present a novel technique for solving two important and related problems in the sliding window model---maintaining variance and maintaining a k--median clustering. Our solution to the problem of maintaining variance provides a continually updated estimate of the variance of the last N values in a data stream with relative error of at most ε using O(1/ε2 log N) memory. We present a constant-factor approximation algorithm which maintains an approximate k--median solution for the last N data points using O(k/τ4 N2τ log2 N) memory, where τ < 1/2 is a parameter which trades off the space bound with the approximation factor of O(2O(1/τ)).
- V. Arya, N. Garg, R. Khandekar, V. Pandit, A. Meyerson, and K. Munagala. "Local Search Heuristics for k--median and Facility Location Problems." In Proc. 33rd ACM Symp. on Theory of Computing (STOC), 2001, pages 21--29. Google ScholarDigital Library
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. "Models and Issues in Data Stream Systems." In Proceedings of the 21st ACM Symposium on Principles of Databases Systems (PODS), 2002, pages 1--16. Google ScholarDigital Library
- B. Babcock, M. Datar, and R. Motwani. "Sampling from a Moving Window over Streaming Data." In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002, pages 633--634. Google ScholarDigital Library
- P. S. Bradley, U. M. Fayyad, and C. Reina. "Scaling clustering algorithms to large databases." In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD), 1998, pages 9--15.Google Scholar
- M. Charikar and S. Guha. "Improved Combinatorial Algorithms for the Facility Location and k--median Problems." In Proc. of the 40th Annual IEEE Symposium on Foundations of Computer Science(FOCS), 1999, pages 378--388. Google ScholarDigital Library
- M. Charikar, L. O'Callaghan, and R. Panigrahy. "Better Streaming Algorithms for Clustering Problems." In Proc. of 35th ACM Symposium on Theory of Computing (STOC), 2003. Google ScholarDigital Library
- M. Datar, A. Gionis, P. Indyk, and R. Motwani. "Maintaining Stream Statistics over Sliding Windows." In Proceedings of Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002, pages 635--644. Google ScholarDigital Library
- P. Domingos and G. Hulten. "Mining High-speed Data Streams." In Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining (KDD), 2000, pages 71--80. Google ScholarDigital Library
- P. Domingos, G. Hulten, and L. Spencer. "Mining Time-changing Data Streams." In Proceedings of the 7th International Conference on Knowledge Discovery and Data Mining (KDD), 2001, pages 97--106. Google ScholarDigital Library
- P. Gibbons, and S. Tirthapura. "Distributed Streams Algorithms for Sliding Windows" In Proc. of ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2002. Google ScholarDigital Library
- A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. "Surfing Wavelets on Streams: One-pass Summaries for Approximate Aggregate Queries." In Proc. 27th Conf. on Very Large Data Bases (VLDB), 2001, pages 79--88. Google ScholarDigital Library
- S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. "Clustering Data Streams." In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2000, pages 359--366. Google ScholarDigital Library
- S. Guha, R. Rastogi, and K. Shim. "CURE: An efficient clustering algorithm for large databases." In Proc. of the ACM SIGMOD Intl. Conference on Management of Data (SIGMOD), 1998, pages 73--84. Google ScholarDigital Library
- M.R. Henzinger, P. Raghavan, and S. Rajagopalan "Computing on Data Streams.", Technical Report 1998--011, Compaq Systems Research Center, Palo Alto, CA, May, 1998.Google Scholar
- P. Indyk. "Sublinear Time Algorithms for Metric Space Problems." In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC), 1999, pages 428--434. Google ScholarDigital Library
- K. Jain and V. Vazirani. "Primal-Dual Approximation Algorithms for Metric Facility Location and k--median Problems." In Proc. 40th IEEE Symp. on Foundations of Computer Science (FOCS), 1999, pages 1--10. Google ScholarDigital Library
- R. R. Mettu and C. G. Plaxton. "The Online k--median Problem." In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2000, pages 339--348. Google ScholarDigital Library
- N. Mishra, D. Oblinger, and L. Pitt. "Sublinear Time Approximate Clustering." In Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2001, pages 439--447. Google ScholarDigital Library
- L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. "Streaming-Data Algorithms for High-Quality Clustering." In Proceedings of the 18th Annual IEEE International Conference on Data Engineering (ICDE), 2001. Google ScholarDigital Library
- C. R. Palmer and C. Faloutsos. "Density biased sampling: an improved method for data mining and clustering." In Proc. of the ACM SIGMOD Intl. Conference on Management of Data (SIGMOD), 2000, pages 82--92. Google ScholarDigital Library
- D. Pelleg and A. W. Moore. "Accelerating exact k--means algorithms with geometric reasoning." In Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining (KDD), 1999. Google ScholarDigital Library
- T. Zhang, R. Ramakrishnan, and M. Livny. "BIRCH: an efficient data clustering method for very large databases." In Proc. of the ACM SIGMOD Intl. Conference on Management of Data (SIGMOD), 1996, pages 103--114. Google ScholarDigital Library
Index Terms
- Maintaining variance and k-medians over data stream windows
Recommendations
Maintaining stream statistics over multiscale sliding windows
In this article, we propose a new multiscale sliding window model which differentiates data items in different time periods of the data stream, based on a reasonable monotonicity of resolution assumption. Our model, as a well-motivated extension of the ...
Maintaining Stream Statistics over Sliding Windows
We consider the problem of maintaining aggregates and statistics over data streams, with respect to the last N data elements seen so far. We refer to this model as the sliding window model. We consider the following basic problem: Given a stream of bits,...
Variance estimation over sliding windows
PODS '07: Proceedings of the twenty-sixth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsCapturing characteristics of large data streams has received considerable attention. The constraints in space and time restrict the data stream processing to only one pass (or a small number of passes). Processing data streams over sliding windows make ...
Comments