ABSTRACT
Due to the ever growing presence of data streams, there has been a considerable amount of research on stream mining algorithms. While many algorithms have been introduced that tackle the problem of clustering on evolving data streams, hardly any attention has been paid to appropriate evaluation measures. Measures developed for static scenarios, namely structural measures and ground-truth-based measures, cannot correctly reflect errors attributable to emerging, splitting, or moving clusters. These situations are inherent to the streaming context due to the dynamic changes in the data distribution. In this paper we develop a novel evaluation measure for stream clustering called Cluster Mapping Measure (CMM). CMM effectively indicates different types of errors by taking the important properties of evolving data streams into account. We show in extensive experiments on real and synthetic data that CMM is a robust measure for stream clustering evaluation.
- C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. In VLDB, pages 81--92, 2003. Google ScholarDigital Library
- F. B. Baker and L. J. Hubert. Measuring the power of hierarchical cluster analysis. Journal of the American Statistical Association (JASA), 70(349):31--38, 1975.Google Scholar
- D. Barbará and P. Chen. Using the fractal dimension to cluster datasets. In ACM SIGKDD, pages 260--264, 2000. Google ScholarDigital Library
- A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà. New ensemble methods for evolving data streams. In ACM SIGKDD, pages 139--148, 2009. Google ScholarDigital Library
- A. Bifet, G. Holmes, B. Pfahringer, P. Kranen, H. Kremer, T. Jansen, and T. Seidl. MOA: Massive online analysis, a framework for stream classification and clustering. In JMLR, 2010. Google ScholarDigital Library
- M. Bouguessa, S. Wang, and H. Sun. An objective approach to cluster validation. Pattern Recognition Letters, 27(13):1419--1430, 2006. Google ScholarDigital Library
- M. Brun, C. Sima, J. Hua, J. Lowey, B. Carroll, E. Suh, and E. R. Dougherty. Model-based evaluation of clustering validation measures. Pattern Recognition, 40(3):807--824, 2007. Google ScholarDigital Library
- F. Cao, M. Ester, W. Qian, and A. Zhou. Density-based clustering over an evolving data stream with noise. In SIAM SDM, pages 328--339, 2006.Google ScholarCross Ref
- Y. Chen and L. Tu. Density-based clustering for real-time stream data. In ACM SIGKDD, pages 133--142, 2007. Google ScholarDigital Library
- T. Cover and J. Thomas. Elements of Information Theory (2nd Edition). Wiley-Interscience, 2006. Google ScholarDigital Library
- J. Dunn. Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4:95--104, 1974.Google ScholarCross Ref
- E. Folkes and C. Mallows. A method for comparing two hierarchical clusterings. JASA, 78:553--569, 1983.Google ScholarCross Ref
- B. Gartner. Fast and robust smallest enclosing balls. In ESA, pages 325--338. Springer, 1999. Google ScholarDigital Library
- M. Halkidi and M. Vazirgiannis. A density-based cluster validity approach using multi-representatives. Pattern Recognition Letters, 29(6):773--786, 2008. Google ScholarDigital Library
- J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001. Google ScholarDigital Library
- J. A. Hartigan. Clustering Algorithms. Wiley, 1975. Google ScholarDigital Library
- L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193--218, 1985.Google ScholarCross Ref
- L. J. Hubert and J. R. Levin. A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin, 83(6):1072--1080, 1976.Google ScholarCross Ref
- A. Jain, Z. Zhang, and E. Y. Chang. Adaptive non-linear clustering in data streams. In ACM CIKM, pages 122--131, 2006. Google ScholarDigital Library
- A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, 1988. Google ScholarDigital Library
- A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM CS, 31(3):264--323, 1999. Google ScholarDigital Library
- L. Kaufmann and P. Rousseeuw. Finding Groups in Data: an Introduct. to Cluster Analysis. Wiley, 1990.Google Scholar
- P. Kranen, I. Assent, C. Baldauf, and T. Seidl. Self- adaptive anytime stream clustering. In IEEE ICDM, pages 249--258, 2009. Google ScholarDigital Library
- P. Kranen, H. Kremer, T. Jansen, T. Seidl, A. Bifet, G. Holmes, and B. Pfahringer. Clustering performance on evolving data streams: Assessing algorithms and evaluation measures within MOA. In IEEE ICDMW, pages 1400--1403, 2010. Google ScholarDigital Library
- Y. Liu, Z. Li, H. Xiong, X. Gao, and J. Wu. Understanding of internal clustering validation measures. In IEEE ICDM, pages 911--916, 2010. Google ScholarDigital Library
- M. Meila. Comparing clusterings: an axiomatic view. In ICML, pages 577--584, 2005. Google ScholarDigital Library
- G. W. Milligan. An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45(3):325--342, 1980.Google ScholarCross Ref
- G. W. Milligan. A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 46(2):187--199, 1981.Google ScholarCross Ref
- MOA project. http://moa.cs.waikato.ac.nz.Google Scholar
- S. G. Mojaveri, E. Mirzaeian, Z. Bornaee, and S. Ayat. New approach in data stream association rule mining based on graph structure. In IEEE ICDM, pages 158--164, 2010. Google ScholarDigital Library
- L. O'Callaghan, A. Meyerson, R. Motwani, N. Mishra, and S. Guha. Streaming-data algorithms for high-quality clustering. In ICDE, pages 685--694, 2002. Google ScholarDigital Library
- W. Rand. Objective criteria for the evaluation of clustering methods. JASA, 66:846--850, 1971.Google ScholarCross Ref
- C. Rijsbergen. Information Retrieval (2nd Edition). Butterworths, London, 1979. Google ScholarDigital Library
- F. J. Rohlf. Methods for comparing classifications. Annual Review of Ecology and Sys., 5:101--113, 1974.Google ScholarCross Ref
- A. Rosenberg and J. Hirschberg. V-measure: A conditional entropy-based external cluster evaluation measure. In EMNLP-CoNLL, pages 410--420, 2007.Google Scholar
- V. Roth, M. L. Braun, T. Lange, and J. M. Buhmann. Stability-based model order selection in clustering with applications to gene expression data. In ICANN, pages 633--640. Springer, 2002. Google ScholarDigital Library
- S. Saitta, B. Raphael, and I. F. C. Smith. A comprehensive validity index for clustering. Intell. Data Anal. (IDA), 12(6):529--548, 2008. Google ScholarDigital Library
- M. J. Song and L. Zhang. Comparison of cluster representations from partial second- to full fourth-order cross moments for data stream clustering. In IEEE ICDM, pages 560--569, 2008. Google ScholarDigital Library
- S. Van Dongen. Performance criteria for graph clustering and markov cluster experiments. Report-Information systems, (12):1--36, 2000.Google Scholar
- L. Wang, U. T. V. Nguyen, J. C. Bezdek, C. Leckie, and K. Ramamohanarao. iVAT and aVAT: Enhanced visual analysis for cluster tendency assessment. In PAKDD (1), pages 16--27. Springer, 2010. Google ScholarDigital Library
- J. Wu, H. Xiong, and J. Chen. Adapting the right measures for k-means clustering. In ACM SIGKDD, pages 877--886, 2009. Google ScholarDigital Library
- K. Y. Yeung, D. R. Haynor, and W. L. Ruzzo. Validating clustering for gene expression data. Bioinformatics, 17(4):309--318, 2001.Google ScholarCross Ref
- Y. Zhao and G. Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering. ML, 55(3):311--331, 2004. Google ScholarDigital Library
Index Terms
- An effective evaluation measure for clustering on evolving data streams
Recommendations
Subspace clustering of data streams: new algorithms and effective evaluation measures
Nowadays, most streaming data sources are becoming high dimensional. Accordingly, subspace stream clustering, which aims at finding evolving clusters within subgroups of dimensions, has gained a significant importance. However, in spite of the rich ...
Clustering data streams using grid-based synopsis
Continually advancing technology has made it feasible to capture data online for onward transmission as a steady flow of newly generated data points, termed as data stream. Continuity and unboundedness of data streams make storage of data and multiple ...
On clustering massive text and categorical data streams
In this paper, we will study the data stream clustering problem in the context of text and categorical data domains. While the clustering problem has been studied recently for numeric data streams, the problems of text and categorical data present ...
Comments