skip to main content
10.1145/773153.773176acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
Article

Maintaining variance and k-medians over data stream windows

Published:09 June 2003Publication History

ABSTRACT

The sliding window model is useful for discounting stale data in data stream applications. In this model, data elements arrive continually and only the most recent N elements are used when answering queries. We present a novel technique for solving two important and related problems in the sliding window model---maintaining variance and maintaining a k--median clustering. Our solution to the problem of maintaining variance provides a continually updated estimate of the variance of the last N values in a data stream with relative error of at most ε using O(1/ε2 log N) memory. We present a constant-factor approximation algorithm which maintains an approximate k--median solution for the last N data points using O(k/τ4 N log2 N) memory, where τ < 1/2 is a parameter which trades off the space bound with the approximation factor of O(2O(1/τ)).

References

  1. V. Arya, N. Garg, R. Khandekar, V. Pandit, A. Meyerson, and K. Munagala. "Local Search Heuristics for k--median and Facility Location Problems." In Proc. 33rd ACM Symp. on Theory of Computing (STOC), 2001, pages 21--29. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. "Models and Issues in Data Stream Systems." In Proceedings of the 21st ACM Symposium on Principles of Databases Systems (PODS), 2002, pages 1--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. B. Babcock, M. Datar, and R. Motwani. "Sampling from a Moving Window over Streaming Data." In Proceedings of the 13th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002, pages 633--634. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. S. Bradley, U. M. Fayyad, and C. Reina. "Scaling clustering algorithms to large databases." In Proceedings of the 4th International Conference on Knowledge Discovery and Data Mining (KDD), 1998, pages 9--15.Google ScholarGoogle Scholar
  5. M. Charikar and S. Guha. "Improved Combinatorial Algorithms for the Facility Location and k--median Problems." In Proc. of the 40th Annual IEEE Symposium on Foundations of Computer Science(FOCS), 1999, pages 378--388. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Charikar, L. O'Callaghan, and R. Panigrahy. "Better Streaming Algorithms for Clustering Problems." In Proc. of 35th ACM Symposium on Theory of Computing (STOC), 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Datar, A. Gionis, P. Indyk, and R. Motwani. "Maintaining Stream Statistics over Sliding Windows." In Proceedings of Thirteenth Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2002, pages 635--644. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Domingos and G. Hulten. "Mining High-speed Data Streams." In Proceedings of the 6th International Conference on Knowledge Discovery and Data Mining (KDD), 2000, pages 71--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. P. Domingos, G. Hulten, and L. Spencer. "Mining Time-changing Data Streams." In Proceedings of the 7th International Conference on Knowledge Discovery and Data Mining (KDD), 2001, pages 97--106. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Gibbons, and S. Tirthapura. "Distributed Streams Algorithms for Sliding Windows" In Proc. of ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. "Surfing Wavelets on Streams: One-pass Summaries for Approximate Aggregate Queries." In Proc. 27th Conf. on Very Large Data Bases (VLDB), 2001, pages 79--88. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. "Clustering Data Streams." In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2000, pages 359--366. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Guha, R. Rastogi, and K. Shim. "CURE: An efficient clustering algorithm for large databases." In Proc. of the ACM SIGMOD Intl. Conference on Management of Data (SIGMOD), 1998, pages 73--84. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M.R. Henzinger, P. Raghavan, and S. Rajagopalan "Computing on Data Streams.", Technical Report 1998--011, Compaq Systems Research Center, Palo Alto, CA, May, 1998.Google ScholarGoogle Scholar
  15. P. Indyk. "Sublinear Time Algorithms for Metric Space Problems." In Proceedings of the 31st Annual ACM Symposium on Theory of Computing (STOC), 1999, pages 428--434. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. K. Jain and V. Vazirani. "Primal-Dual Approximation Algorithms for Metric Facility Location and k--median Problems." In Proc. 40th IEEE Symp. on Foundations of Computer Science (FOCS), 1999, pages 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. R. Mettu and C. G. Plaxton. "The Online k--median Problem." In Proceedings of the 41st Annual IEEE Symposium on Foundations of Computer Science (FOCS), 2000, pages 339--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. N. Mishra, D. Oblinger, and L. Pitt. "Sublinear Time Approximate Clustering." In Proceedings of the 12th Annual ACM-SIAM Symposium on Discrete Algorithms (SODA), 2001, pages 439--447. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. "Streaming-Data Algorithms for High-Quality Clustering." In Proceedings of the 18th Annual IEEE International Conference on Data Engineering (ICDE), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. C. R. Palmer and C. Faloutsos. "Density biased sampling: an improved method for data mining and clustering." In Proc. of the ACM SIGMOD Intl. Conference on Management of Data (SIGMOD), 2000, pages 82--92. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Pelleg and A. W. Moore. "Accelerating exact k--means algorithms with geometric reasoning." In Proceedings of the 5th International Conference on Knowledge Discovery and Data Mining (KDD), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. T. Zhang, R. Ramakrishnan, and M. Livny. "BIRCH: an efficient data clustering method for very large databases." In Proc. of the ACM SIGMOD Intl. Conference on Management of Data (SIGMOD), 1996, pages 103--114. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Maintaining variance and k-medians over data stream windows

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            PODS '03: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
            June 2003
            291 pages
            ISBN:1581136706
            DOI:10.1145/773153
            • Conference Chair:
            • Frank Neven,
            • General Chair:
            • Catriel Beeri,
            • Program Chair:
            • Tova Milo

            Copyright © 2003 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 9 June 2003

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • Article

            Acceptance Rates

            PODS '03 Paper Acceptance Rate27of136submissions,20%Overall Acceptance Rate642of2,707submissions,24%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader