Article

Clustering binary data streams with K-means

Author:
Carlos Ordonez

Teradata, a division of NCR, San Diego, CA

Teradata, a division of NCR, San Diego, CA
View Profile

DMKD '03: Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discoveryJune 2003Pages 12–19https://doi.org/10.1145/882082.882087

Published:13 June 2003Publication History

DMKD '03: Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery

Pages 12–19

ABSTRACT

Clustering data streams is an interesting Data Mining problem. This article presents three variants of the K-means algorithm to cluster binary data streams. The variants include On-line K-means, Scalable K-means, and Incremental K-means, a proposed variant introduced that finds higher quality solutions in less time. Higher quality of solutions are obtained with a mean-based initialization and incremental learning. The speedup is achieved through a simplified set of sufficient statistics and operations with sparse matrices. A summary table of clusters is maintained on-line. The K-means variants are compared with respect to quality of results and speed. The proposed algorithms can be used to monitor transactions.

References

C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, and Jong Park. Fast algorithms for projected clustering. In ACM SIGMOD Conference, 1999. Google ScholarDigital Library
C. Aggarwal and P. Yu. Finding generalized projected clusters in dimensional spaces. In ACM SIGMOD Conference, 2000. Google ScholarDigital Library
R. Agrawal, J. Gehrke, D. Gunopolos, and Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In ACM SIGMOD Conference, 1998. Google ScholarDigital Library
R. Agrawal, T. Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In ACM SIGMOD Conference, pages 207--216, 1993. Google ScholarDigital Library
R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB Conference, 1994. Google ScholarDigital Library
P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In ACM KDD Conference, 1998.Google Scholar
M. Breunig, H. P. Kriegel, P. Kroger, and J. Sander. Data bubbles: Quality preserving performance boosting for hierarchical clustering. In ACM SIGMOD Conference, 2001. Google ScholarDigital Library
R. Dubes and A. K. Jain. Clustering Methodologies in Exploratory Data Analysis, pages 10--35. Academic Press, New York, 1980.Google Scholar
R. Duda and P. Hart. Pattern Classification and Scene Analysis, pages 10--45. J. Wiley and Sons, 1973.Google Scholar
F. Fanstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51--57, June 2000. Google ScholarDigital Library
B. Fritzke. The LBG-U method for vector quantization -- an improvement over LBG inspired from neural networks. Neural Processing Letters, 5(1):35--45, 1997. Google ScholarDigital Library
V. Ganti, J. Gehrke, and R. Ramakrishnan. Cactus-clustering categorical data using summaries. In ACM KDD Conference, 1999. Google ScholarDigital Library
S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In FOCS, 2000. Google ScholarDigital Library
S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. In SIGMOD Conference, 1998. Google ScholarDigital Library
S. Guha, R. Rastogi, and K. Shim. Rock: A robust clustering algorithm for categorical attributes. In ICDE Conference, 1999. Google ScholarDigital Library
J. Han, J. Pei, and Yiwei Yun. Mining frequent patterns without candidate generation. In ACM SIGMOD Conference, 2000. Google ScholarDigital Library
A. Hinneburg and D. Keim. Optimal grid-clustering: Towards breaking the curse of dimensionality. In VLDB Conference, 1999. Google ScholarDigital Library
Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 1998. Google ScholarDigital Library
J. B. MacQueen. Some method for classification and analysis of multivariate observations. In Proc. of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967.Google Scholar
A. Nanopoulos, Y. Theodoridis, and Y. Manolopoulos. C2p: Clustering based on closest pairs. In VLDB Conference, 2001. Google ScholarDigital Library
L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. Streaming-data algorithms for high quality clustering. In IEEE ICDE, 2002.Google ScholarCross Ref
C. Ordonez and E. Omiecinski. FREM: Fast and robust EM clustering for large data sets. In ACM CIKM Conference, 2002. Google ScholarDigital Library
C. Ordonez, E. Omiecinski, and Norberto Ezquerra. A fast algorithm to cluster high dimensional basket data. In IEEE ICDM Conference, 2001. Google ScholarDigital Library
S. Roweis and Z. Ghahramani. A unifying review of Linear Gaussian Models. Neural Computation, 1999. Google ScholarDigital Library
M. Sato and S. Ishii. On-line EM algorithm for the normalized Gaussian network. Neural Computation, 12(2), 2000. Google ScholarDigital Library
T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for very large databases. In ACM SIGMOD Conference, 1996. Google ScholarDigital Library

Clustering binary data streams with K-means
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Read More
Improved k- means clustering algorithm for two dimensional data
CCSEIT '12: Proceedings of the Second International Conference on Computational Science, Engineering and Information Technology

Clustering is a procedure of organizing the objects in groups whose member exhibits some kind of similarity. So a cluster is a collection of objects which are alike and are different from the objects belonging to other clusters. K-Means is one of ...
Read More
RK-Means Clustering: K-Means with Reliability

This paper presents an RK-means clustering algorithm which is developed for reliable data grouping by introducing a new reliability evaluation to the K-means clustering algorithm. The conventional K-means clustering algorithm has two shortfalls: 1) the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DMKD '03: Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
June 2003
103 pages
ISBN:9781450374224
DOI:10.1145/882082
Conference Chairs:
Mohammed J. Zaki
Rensselaer Polytechnic Institute, Troy, New York
,
Charu C. Aggarwal
IBM T.J. Watson Research Center, Yorktown Heights, New York
Copyright © 2003 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 June 2003
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 117
  Total Citations
  View Citations
- 4,017
  Total Downloads
- Downloads (Last 12 months)65
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Clustering binary data streams with K-means

DMKD '03: Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery

ABSTRACT

References

Cited By

Recommendations

Hybrid Bisect K-Means Clustering Algorithm

Improved k- means clustering algorithm for two dimensional data

RK-Means Clustering: K-Means with Reliability