ABSTRACT
Clustering data streams is an interesting Data Mining problem. This article presents three variants of the K-means algorithm to cluster binary data streams. The variants include On-line K-means, Scalable K-means, and Incremental K-means, a proposed variant introduced that finds higher quality solutions in less time. Higher quality of solutions are obtained with a mean-based initialization and incremental learning. The speedup is achieved through a simplified set of sufficient statistics and operations with sparse matrices. A summary table of clusters is maintained on-line. The K-means variants are compared with respect to quality of results and speed. The proposed algorithms can be used to monitor transactions.
- C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, and Jong Park. Fast algorithms for projected clustering. In ACM SIGMOD Conference, 1999. Google ScholarDigital Library
- C. Aggarwal and P. Yu. Finding generalized projected clusters in dimensional spaces. In ACM SIGMOD Conference, 2000. Google ScholarDigital Library
- R. Agrawal, J. Gehrke, D. Gunopolos, and Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In ACM SIGMOD Conference, 1998. Google ScholarDigital Library
- R. Agrawal, T. Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In ACM SIGMOD Conference, pages 207--216, 1993. Google ScholarDigital Library
- R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB Conference, 1994. Google ScholarDigital Library
- P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In ACM KDD Conference, 1998.Google Scholar
- M. Breunig, H. P. Kriegel, P. Kroger, and J. Sander. Data bubbles: Quality preserving performance boosting for hierarchical clustering. In ACM SIGMOD Conference, 2001. Google ScholarDigital Library
- R. Dubes and A. K. Jain. Clustering Methodologies in Exploratory Data Analysis, pages 10--35. Academic Press, New York, 1980.Google Scholar
- R. Duda and P. Hart. Pattern Classification and Scene Analysis, pages 10--45. J. Wiley and Sons, 1973.Google Scholar
- F. Fanstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51--57, June 2000. Google ScholarDigital Library
- B. Fritzke. The LBG-U method for vector quantization -- an improvement over LBG inspired from neural networks. Neural Processing Letters, 5(1):35--45, 1997. Google ScholarDigital Library
- V. Ganti, J. Gehrke, and R. Ramakrishnan. Cactus-clustering categorical data using summaries. In ACM KDD Conference, 1999. Google ScholarDigital Library
- S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In FOCS, 2000. Google ScholarDigital Library
- S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. In SIGMOD Conference, 1998. Google ScholarDigital Library
- S. Guha, R. Rastogi, and K. Shim. Rock: A robust clustering algorithm for categorical attributes. In ICDE Conference, 1999. Google ScholarDigital Library
- J. Han, J. Pei, and Yiwei Yun. Mining frequent patterns without candidate generation. In ACM SIGMOD Conference, 2000. Google ScholarDigital Library
- A. Hinneburg and D. Keim. Optimal grid-clustering: Towards breaking the curse of dimensionality. In VLDB Conference, 1999. Google ScholarDigital Library
- Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 1998. Google ScholarDigital Library
- J. B. MacQueen. Some method for classification and analysis of multivariate observations. In Proc. of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967.Google Scholar
- A. Nanopoulos, Y. Theodoridis, and Y. Manolopoulos. C2p: Clustering based on closest pairs. In VLDB Conference, 2001. Google ScholarDigital Library
- L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. Streaming-data algorithms for high quality clustering. In IEEE ICDE, 2002.Google ScholarCross Ref
- C. Ordonez and E. Omiecinski. FREM: Fast and robust EM clustering for large data sets. In ACM CIKM Conference, 2002. Google ScholarDigital Library
- C. Ordonez, E. Omiecinski, and Norberto Ezquerra. A fast algorithm to cluster high dimensional basket data. In IEEE ICDM Conference, 2001. Google ScholarDigital Library
- S. Roweis and Z. Ghahramani. A unifying review of Linear Gaussian Models. Neural Computation, 1999. Google ScholarDigital Library
- M. Sato and S. Ishii. On-line EM algorithm for the normalized Gaussian network. Neural Computation, 12(2), 2000. Google ScholarDigital Library
- T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for very large databases. In ACM SIGMOD Conference, 1996. Google ScholarDigital Library
- Clustering binary data streams with K-means
Recommendations
Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global InformatizationIn this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Improved k- means clustering algorithm for two dimensional data
CCSEIT '12: Proceedings of the Second International Conference on Computational Science, Engineering and Information TechnologyClustering is a procedure of organizing the objects in groups whose member exhibits some kind of similarity. So a cluster is a collection of objects which are alike and are different from the objects belonging to other clusters. K-Means is one of ...
RK-Means Clustering: K-Means with Reliability
This paper presents an RK-means clustering algorithm which is developed for reliable data grouping by introducing a new reliability evaluation to the K-means clustering algorithm. The conventional K-means clustering algorithm has two shortfalls: 1) the ...
Comments