skip to main content
10.1145/882082.882087acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Clustering binary data streams with K-means

Published:13 June 2003Publication History

ABSTRACT

Clustering data streams is an interesting Data Mining problem. This article presents three variants of the K-means algorithm to cluster binary data streams. The variants include On-line K-means, Scalable K-means, and Incremental K-means, a proposed variant introduced that finds higher quality solutions in less time. Higher quality of solutions are obtained with a mean-based initialization and incremental learning. The speedup is achieved through a simplified set of sufficient statistics and operations with sparse matrices. A summary table of clusters is maintained on-line. The K-means variants are compared with respect to quality of results and speed. The proposed algorithms can be used to monitor transactions.

References

  1. C. Aggarwal, C. Procopiuc, J. Wolf, P. Yu, and Jong Park. Fast algorithms for projected clustering. In ACM SIGMOD Conference, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. C. Aggarwal and P. Yu. Finding generalized projected clusters in dimensional spaces. In ACM SIGMOD Conference, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Agrawal, J. Gehrke, D. Gunopolos, and Prabhakar Raghavan. Automatic subspace clustering of high dimensional data for data mining applications. In ACM SIGMOD Conference, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. R. Agrawal, T. Imielinski, and Arun Swami. Mining association rules between sets of items in large databases. In ACM SIGMOD Conference, pages 207--216, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB Conference, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Bradley, U. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In ACM KDD Conference, 1998.Google ScholarGoogle Scholar
  7. M. Breunig, H. P. Kriegel, P. Kroger, and J. Sander. Data bubbles: Quality preserving performance boosting for hierarchical clustering. In ACM SIGMOD Conference, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Dubes and A. K. Jain. Clustering Methodologies in Exploratory Data Analysis, pages 10--35. Academic Press, New York, 1980.Google ScholarGoogle Scholar
  9. R. Duda and P. Hart. Pattern Classification and Scene Analysis, pages 10--45. J. Wiley and Sons, 1973.Google ScholarGoogle Scholar
  10. F. Fanstrom, J. Lewis, and C. Elkan. Scalability for clustering algorithms revisited. SIGKDD Explorations, 2(1):51--57, June 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Fritzke. The LBG-U method for vector quantization -- an improvement over LBG inspired from neural networks. Neural Processing Letters, 5(1):35--45, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. V. Ganti, J. Gehrke, and R. Ramakrishnan. Cactus-clustering categorical data using summaries. In ACM KDD Conference, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Guha, N. Mishra, R. Motwani, and L. O'Callaghan. Clustering data streams. In FOCS, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Guha, R. Rastogi, and K. Shim. Cure: An efficient clustering algorithm for large databases. In SIGMOD Conference, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Guha, R. Rastogi, and K. Shim. Rock: A robust clustering algorithm for categorical attributes. In ICDE Conference, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Han, J. Pei, and Yiwei Yun. Mining frequent patterns without candidate generation. In ACM SIGMOD Conference, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Hinneburg and D. Keim. Optimal grid-clustering: Towards breaking the curse of dimensionality. In VLDB Conference, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Z. Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Mining and Knowledge Discovery, 2(3), 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. B. MacQueen. Some method for classification and analysis of multivariate observations. In Proc. of the 5th Berkeley Symposium on Mathematical Statistics and Probability, 1967.Google ScholarGoogle Scholar
  20. A. Nanopoulos, Y. Theodoridis, and Y. Manolopoulos. C2p: Clustering based on closest pairs. In VLDB Conference, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. O'Callaghan, N. Mishra, A. Meyerson, S. Guha, and R. Motwani. Streaming-data algorithms for high quality clustering. In IEEE ICDE, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  22. C. Ordonez and E. Omiecinski. FREM: Fast and robust EM clustering for large data sets. In ACM CIKM Conference, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. C. Ordonez, E. Omiecinski, and Norberto Ezquerra. A fast algorithm to cluster high dimensional basket data. In IEEE ICDM Conference, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Roweis and Z. Ghahramani. A unifying review of Linear Gaussian Models. Neural Computation, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Sato and S. Ishii. On-line EM algorithm for the normalized Gaussian network. Neural Computation, 12(2), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for very large databases. In ACM SIGMOD Conference, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. Clustering binary data streams with K-means

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        DMKD '03: Proceedings of the 8th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery
        June 2003
        103 pages
        ISBN:9781450374224
        DOI:10.1145/882082

        Copyright © 2003 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 13 June 2003

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader