article

Statistical grid-based clustering over data streams

Authors:
Nam Hun Park

Yonsei University, Seoul, Korea

Yonsei University, Seoul, Korea
View Profile

,
Won Suk Lee

Yonsei University, Seoul, Korea

Yonsei University, Seoul, Korea
View Profile

Authors Info & Claims

ACM SIGMOD Record Volume 33 Issue 1March 2004pp 32–37https://doi.org/10.1145/974121.974127

Published:01 March 2004Publication History

ACM SIGMOD Record

Abstract

A data stream is a massive unbounded sequence of data elements continuously generated at a rapid rate. Due to this reason, most algorithms for data streams sacrifice the correctness of their results for fast processing time. The processing time is greatly influenced by the amount of information that should be maintained. This paper proposes a statistical grid-based approach to clustering data elements of a data stream. Initially, the multidimensional data space of a data stream is partitioned into a set of mutually exclusive equal-size initial cells. When the support of a cell becomes high enough, the cell is dynamically divided into two mutually exclusive intermediate cells based on its distribution statistics. Three different ways of partitioning a dense cell are introduced. Eventually, a dense region of each initial cell is recursively partitioned until it becomes the smallest cell called a unit cell. A cluster of a data stream is a group of adjacent dense unit cells. In order to minimize the number of cells, a sparse intermediate or unit cell is pruned if its support becomes much less than a minimum support. Furthermore, in order to confine the usage of memory space, the size of a unit cell is dynamically minimized such that the result of clustering becomes as accurate as possible. The proposed algorithm is analyzed by a series of experiments to identify its various characteristics.

References

M. Datar, A. Gionis, P. Indyk and R. Motwani. Maintaining stream statistics over sliding windows. In Proc. Of the 13th Annual ACM-SIAM Symp. on Discrete Algorithms, Jan. 2002 Google ScholarDigital Library
G. S. Manku and R. Motwani. Approximate frequency counts over data streams. In Proc. Of the 28th Int'l Conference on Very Large Databases, Hong Kong, China, Aug. 2002. Google ScholarDigital Library
M. Garofalakis, J. Gehrke and R. Rastogi. Querying and mining data streams: you only get one look. In the tutorial notes of the 28th Int'l Conference on Very Large Databases, Hong Kong, China, Aug. 2002.Google Scholar
L. Kaufman and P. J. Rousseeuw. Finding Groups in Data. An Introduction to Cluster Analysis. Wiley, New York, 1990.Google Scholar
S. Guha, R. Rastogi, and K. Shim. CURE: An efficient clustering algorithm for large databases. In Proc. SIGMOD, pages 73--84, 1998 Google ScholarDigital Library
M. Ester, H. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases, 1996.Google Scholar
M. Ester, H. Kriegel, J. Sander, M. Wimmer, and X. Xu. Incremental clustering for mining in a data warehousing environment, In Proc. VLDB 24th, New York, 1998 Google ScholarDigital Library
Liadan O'Callaghan, Nina Mishra, Adam Meyerson, Sudipto Guha, and Rajeev Motwani. STREAM-data algorithms for high-quality clustering. In Proc. of IEEE International Conference on Data Engineering, March 2002. Google ScholarDigital Library
Cheng, C., Fu, A., and Zhang, Y. Entropy-based subspace clustering for mining numerical data. KDD-99, 84--93, San Diego, August 1999. Google ScholarDigital Library

Recommendations

Clustering data streams using grid-based synopsis

Continually advancing technology has made it feasible to capture data online for onward transmission as a steady flow of newly generated data points, termed as data stream. Continuity and unboundedness of data streams make storage of data and multiple ...
Read More
Clustering data streams
FOCS '00: Proceedings of the 41st Annual Symposium on Foundations of Computer Science

We study clustering under the data stream model of computation where: given a sequence of points, the objective is to maintain a consistently good clustering of the sequence observed so far, using a small amount of memory and time. The data stream model ...
Read More
Subspace clustering of data streams: new algorithms and effective evaluation measures

Nowadays, most streaming data sources are becoming high dimensional. Accordingly, subspace stream clustering, which aims at finding evolving clusters within subgroups of dimensions, has gained a significant importance. However, in spite of the rich ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM SIGMOD Record Volume 33, Issue 1
March 2004
135 pages
ISSN:0163-5808
DOI:10.1145/974121
Issue’s Table of Contents

Copyright © 2004 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 March 2004
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 87
  Total Citations
  View Citations
- 829
  Total Downloads
- Downloads (Last 12 months)18
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Statistical grid-based clustering over data streams

ACM SIGMOD Record

Abstract

References

Cited By

Recommendations

Clustering data streams using grid-based synopsis

Clustering data streams

Subspace clustering of data streams: new algorithms and effective evaluation measures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Statistical grid-based clustering over data streams

ACM SIGMOD Record

Abstract

References

Cited By

Recommendations

Clustering data streams using grid-based synopsis

Clustering data streams

Subspace clustering of data streams: new algorithms and effective evaluation measures

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media