skip to main content
10.1145/2020408.2020555acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
poster

An effective evaluation measure for clustering on evolving data streams

Published:21 August 2011Publication History

ABSTRACT

Due to the ever growing presence of data streams, there has been a considerable amount of research on stream mining algorithms. While many algorithms have been introduced that tackle the problem of clustering on evolving data streams, hardly any attention has been paid to appropriate evaluation measures. Measures developed for static scenarios, namely structural measures and ground-truth-based measures, cannot correctly reflect errors attributable to emerging, splitting, or moving clusters. These situations are inherent to the streaming context due to the dynamic changes in the data distribution. In this paper we develop a novel evaluation measure for stream clustering called Cluster Mapping Measure (CMM). CMM effectively indicates different types of errors by taking the important properties of evolving data streams into account. We show in extensive experiments on real and synthetic data that CMM is a robust measure for stream clustering evaluation.

References

  1. C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. In VLDB, pages 81--92, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. F. B. Baker and L. J. Hubert. Measuring the power of hierarchical cluster analysis. Journal of the American Statistical Association (JASA), 70(349):31--38, 1975.Google ScholarGoogle Scholar
  3. D. Barbará and P. Chen. Using the fractal dimension to cluster datasets. In ACM SIGKDD, pages 260--264, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà. New ensemble methods for evolving data streams. In ACM SIGKDD, pages 139--148, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Bifet, G. Holmes, B. Pfahringer, P. Kranen, H. Kremer, T. Jansen, and T. Seidl. MOA: Massive online analysis, a framework for stream classification and clustering. In JMLR, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Bouguessa, S. Wang, and H. Sun. An objective approach to cluster validation. Pattern Recognition Letters, 27(13):1419--1430, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. M. Brun, C. Sima, J. Hua, J. Lowey, B. Carroll, E. Suh, and E. R. Dougherty. Model-based evaluation of clustering validation measures. Pattern Recognition, 40(3):807--824, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. F. Cao, M. Ester, W. Qian, and A. Zhou. Density-based clustering over an evolving data stream with noise. In SIAM SDM, pages 328--339, 2006.Google ScholarGoogle ScholarCross RefCross Ref
  9. Y. Chen and L. Tu. Density-based clustering for real-time stream data. In ACM SIGKDD, pages 133--142, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. Cover and J. Thomas. Elements of Information Theory (2nd Edition). Wiley-Interscience, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dunn. Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4:95--104, 1974.Google ScholarGoogle ScholarCross RefCross Ref
  12. E. Folkes and C. Mallows. A method for comparing two hierarchical clusterings. JASA, 78:553--569, 1983.Google ScholarGoogle ScholarCross RefCross Ref
  13. B. Gartner. Fast and robust smallest enclosing balls. In ESA, pages 325--338. Springer, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Halkidi and M. Vazirgiannis. A density-based cluster validity approach using multi-representatives. Pattern Recognition Letters, 29(6):773--786, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. A. Hartigan. Clustering Algorithms. Wiley, 1975. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193--218, 1985.Google ScholarGoogle ScholarCross RefCross Ref
  18. L. J. Hubert and J. R. Levin. A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin, 83(6):1072--1080, 1976.Google ScholarGoogle ScholarCross RefCross Ref
  19. A. Jain, Z. Zhang, and E. Y. Chang. Adaptive non-linear clustering in data streams. In ACM CIKM, pages 122--131, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM CS, 31(3):264--323, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. L. Kaufmann and P. Rousseeuw. Finding Groups in Data: an Introduct. to Cluster Analysis. Wiley, 1990.Google ScholarGoogle Scholar
  23. P. Kranen, I. Assent, C. Baldauf, and T. Seidl. Self- adaptive anytime stream clustering. In IEEE ICDM, pages 249--258, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. P. Kranen, H. Kremer, T. Jansen, T. Seidl, A. Bifet, G. Holmes, and B. Pfahringer. Clustering performance on evolving data streams: Assessing algorithms and evaluation measures within MOA. In IEEE ICDMW, pages 1400--1403, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Y. Liu, Z. Li, H. Xiong, X. Gao, and J. Wu. Understanding of internal clustering validation measures. In IEEE ICDM, pages 911--916, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Meila. Comparing clusterings: an axiomatic view. In ICML, pages 577--584, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. G. W. Milligan. An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45(3):325--342, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  28. G. W. Milligan. A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 46(2):187--199, 1981.Google ScholarGoogle ScholarCross RefCross Ref
  29. MOA project. http://moa.cs.waikato.ac.nz.Google ScholarGoogle Scholar
  30. S. G. Mojaveri, E. Mirzaeian, Z. Bornaee, and S. Ayat. New approach in data stream association rule mining based on graph structure. In IEEE ICDM, pages 158--164, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L. O'Callaghan, A. Meyerson, R. Motwani, N. Mishra, and S. Guha. Streaming-data algorithms for high-quality clustering. In ICDE, pages 685--694, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. W. Rand. Objective criteria for the evaluation of clustering methods. JASA, 66:846--850, 1971.Google ScholarGoogle ScholarCross RefCross Ref
  33. C. Rijsbergen. Information Retrieval (2nd Edition). Butterworths, London, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. F. J. Rohlf. Methods for comparing classifications. Annual Review of Ecology and Sys., 5:101--113, 1974.Google ScholarGoogle ScholarCross RefCross Ref
  35. A. Rosenberg and J. Hirschberg. V-measure: A conditional entropy-based external cluster evaluation measure. In EMNLP-CoNLL, pages 410--420, 2007.Google ScholarGoogle Scholar
  36. V. Roth, M. L. Braun, T. Lange, and J. M. Buhmann. Stability-based model order selection in clustering with applications to gene expression data. In ICANN, pages 633--640. Springer, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Saitta, B. Raphael, and I. F. C. Smith. A comprehensive validity index for clustering. Intell. Data Anal. (IDA), 12(6):529--548, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. J. Song and L. Zhang. Comparison of cluster representations from partial second- to full fourth-order cross moments for data stream clustering. In IEEE ICDM, pages 560--569, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. Van Dongen. Performance criteria for graph clustering and markov cluster experiments. Report-Information systems, (12):1--36, 2000.Google ScholarGoogle Scholar
  40. L. Wang, U. T. V. Nguyen, J. C. Bezdek, C. Leckie, and K. Ramamohanarao. iVAT and aVAT: Enhanced visual analysis for cluster tendency assessment. In PAKDD (1), pages 16--27. Springer, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. J. Wu, H. Xiong, and J. Chen. Adapting the right measures for k-means clustering. In ACM SIGKDD, pages 877--886, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. K. Y. Yeung, D. R. Haynor, and W. L. Ruzzo. Validating clustering for gene expression data. Bioinformatics, 17(4):309--318, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  43. Y. Zhao and G. Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering. ML, 55(3):311--331, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. An effective evaluation measure for clustering on evolving data streams

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
        August 2011
        1446 pages
        ISBN:9781450308137
        DOI:10.1145/2020408

        Copyright © 2011 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 21 August 2011

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • poster

        Acceptance Rates

        Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader