skip to main content
article

Statistical grid-based clustering over data streams

Published:01 March 2004Publication History
Skip Abstract Section

Abstract

A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, most algorithms for data streams sacrifice the correctness of their results for fast processing time. The processing time is greatly influenced by the amount of information that should be maintained. This paper proposes a statistical grid-based approach to clustering data elements of a data stream. Initially, the multidimensional data space of a data stream is partitioned into a set of mutually exclusive equal-size initial cells. When the support of a cell becomes high enough, the cell is dynamically divided into two mutually exclusive intermediate cells based on its distribution statistics. Three different ways of partitioning a dense cell are introduced. Eventually, a dense region of each initial cell is recursively partitioned until it becomes the smallest cell called a unit cell. A cluster of a data stream is a group of adjacent dense unit cells. In order to minimize the number of cells, a sparse intermediate or unit cell is pruned if its support becomes much less than a minimum support. Furthermore, in order to confine the usage of memory space, the size of a unit cell is dynamically minimized such that the result of clustering becomes as accurate as possible. The proposed algorithm is analyzed by a series of experiments to identify its various characteristics.

References

  1. M. Datar, A. Gionis, P. Indyk and R. Motwani. Maintaining stream statistics over sliding windows. In Proc. Of the 13th Annual ACM-SIAM Symp. on Discrete Algorithms, Jan. 2002 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. Of the 28th Int'l Conference on Very Large Databases, Hong Kong, China, Aug. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Garofalakis, J. Gehrke and R. Rastogi. Querying and mining data streams: you only get one look. In the tutorial notes of the 28th Int'l Conference on Very Large Databases, Hong Kong, China, Aug. 2002.Google ScholarGoogle Scholar
  4. L. Kaufman and P. J. Rousseeuw. Finding Groups in Data. An Introduction to Cluster Analysis. Wiley, New York, 1990.Google ScholarGoogle Scholar
  5. S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proc. SIGMOD, pages 73--84, 1998 Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases, 1996.Google ScholarGoogle Scholar
  7. M. Ester, H. Kriegel, J. Sander, M. Wimmer, and X. Xu. Incremental clustering for mining in a data warehousing environment, In Proc. VLDB 24th, New York, 1998 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Liadan O'Callaghan, Nina Mishra, Adam Meyerson, Sudipto Guha, and Rajeev Motwani. STREAM-data algorithms for high-quality clustering. In Proc. of IEEE International Conference on Data Engineering, March 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Cheng, C., Fu, A., and Zhang, Y. Entropy-based subspace clustering for mining numerical data. KDD-99, 84--93, San Diego, August 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGMOD Record
    ACM SIGMOD Record  Volume 33, Issue 1
    March 2004
    135 pages
    ISSN:0163-5808
    DOI:10.1145/974121
    Issue’s Table of Contents

    Copyright © 2004 Authors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 1 March 2004

    Check for updates

    Qualifiers

    • article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader