poster

An effective evaluation measure for clustering on evolving data streams

Authors:
Hardy Kremer

RWTH Aachen University, Aachen, Germany

RWTH Aachen University, Aachen, Germany
View Profile

,
Philipp Kranen

RWTH Aachen University, Aachen, Germany

RWTH Aachen University, Aachen, Germany
View Profile

,
Timm Jansen

RWTH Aachen University, Aachen, Germany

RWTH Aachen University, Aachen, Germany
View Profile

,
Thomas Seidl

RWTH Aachen University, Hamilton, Germany

RWTH Aachen University, Hamilton, Germany
View Profile

,
Albert Bifet

University of Waikato Hamilton, Hamilton, New Zealand

University of Waikato Hamilton, Hamilton, New Zealand
View Profile

,
Geoff Holmes

University of Waikato Hamilton, Hamilton, New Zealand

University of Waikato Hamilton, Hamilton, New Zealand
View Profile

,
Bernhard Pfahringer

University of Waikato Hamilton, Hamilton, New Zealand

University of Waikato Hamilton, Hamilton, New Zealand
View Profile

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data miningAugust 2011Pages 868–876https://doi.org/10.1145/2020408.2020555

Published:21 August 2011Publication History

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 868–876

ABSTRACT

Due to the ever growing presence of data streams, there has been a considerable amount of research on stream mining algorithms. While many algorithms have been introduced that tackle the problem of clustering on evolving data streams, hardly any attention has been paid to appropriate evaluation measures. Measures developed for static scenarios, namely structural measures and ground-truth-based measures, cannot correctly reflect errors attributable to emerging, splitting, or moving clusters. These situations are inherent to the streaming context due to the dynamic changes in the data distribution. In this paper we develop a novel evaluation measure for stream clustering called Cluster Mapping Measure (CMM). CMM effectively indicates different types of errors by taking the important properties of evolving data streams into account. We show in extensive experiments on real and synthetic data that CMM is a robust measure for stream clustering evaluation.

References

C. C. Aggarwal, J. Han, J. Wang, and P. S. Yu. A framework for clustering evolving data streams. In VLDB, pages 81--92, 2003. Google ScholarDigital Library
F. B. Baker and L. J. Hubert. Measuring the power of hierarchical cluster analysis. Journal of the American Statistical Association (JASA), 70(349):31--38, 1975.Google Scholar
D. Barbará and P. Chen. Using the fractal dimension to cluster datasets. In ACM SIGKDD, pages 260--264, 2000. Google ScholarDigital Library
A. Bifet, G. Holmes, B. Pfahringer, R. Kirkby, and R. Gavaldà. New ensemble methods for evolving data streams. In ACM SIGKDD, pages 139--148, 2009. Google ScholarDigital Library
A. Bifet, G. Holmes, B. Pfahringer, P. Kranen, H. Kremer, T. Jansen, and T. Seidl. MOA: Massive online analysis, a framework for stream classification and clustering. In JMLR, 2010. Google ScholarDigital Library
M. Bouguessa, S. Wang, and H. Sun. An objective approach to cluster validation. Pattern Recognition Letters, 27(13):1419--1430, 2006. Google ScholarDigital Library
M. Brun, C. Sima, J. Hua, J. Lowey, B. Carroll, E. Suh, and E. R. Dougherty. Model-based evaluation of clustering validation measures. Pattern Recognition, 40(3):807--824, 2007. Google ScholarDigital Library
F. Cao, M. Ester, W. Qian, and A. Zhou. Density-based clustering over an evolving data stream with noise. In SIAM SDM, pages 328--339, 2006.Google ScholarCross Ref
Y. Chen and L. Tu. Density-based clustering for real-time stream data. In ACM SIGKDD, pages 133--142, 2007. Google ScholarDigital Library
T. Cover and J. Thomas. Elements of Information Theory (2nd Edition). Wiley-Interscience, 2006. Google ScholarDigital Library
J. Dunn. Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4:95--104, 1974.Google ScholarCross Ref
E. Folkes and C. Mallows. A method for comparing two hierarchical clusterings. JASA, 78:553--569, 1983.Google ScholarCross Ref
B. Gartner. Fast and robust smallest enclosing balls. In ESA, pages 325--338. Springer, 1999. Google ScholarDigital Library
M. Halkidi and M. Vazirgiannis. A density-based cluster validity approach using multi-representatives. Pattern Recognition Letters, 29(6):773--786, 2008. Google ScholarDigital Library
J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2001. Google ScholarDigital Library
J. A. Hartigan. Clustering Algorithms. Wiley, 1975. Google ScholarDigital Library
L. Hubert and P. Arabie. Comparing partitions. Journal of Classification, 2:193--218, 1985.Google ScholarCross Ref
L. J. Hubert and J. R. Levin. A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin, 83(6):1072--1080, 1976.Google ScholarCross Ref
A. Jain, Z. Zhang, and E. Y. Chang. Adaptive non-linear clustering in data streams. In ACM CIKM, pages 122--131, 2006. Google ScholarDigital Library
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, 1988. Google ScholarDigital Library
A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM CS, 31(3):264--323, 1999. Google ScholarDigital Library
L. Kaufmann and P. Rousseeuw. Finding Groups in Data: an Introduct. to Cluster Analysis. Wiley, 1990.Google Scholar
P. Kranen, I. Assent, C. Baldauf, and T. Seidl. Self- adaptive anytime stream clustering. In IEEE ICDM, pages 249--258, 2009. Google ScholarDigital Library
P. Kranen, H. Kremer, T. Jansen, T. Seidl, A. Bifet, G. Holmes, and B. Pfahringer. Clustering performance on evolving data streams: Assessing algorithms and evaluation measures within MOA. In IEEE ICDMW, pages 1400--1403, 2010. Google ScholarDigital Library
Y. Liu, Z. Li, H. Xiong, X. Gao, and J. Wu. Understanding of internal clustering validation measures. In IEEE ICDM, pages 911--916, 2010. Google ScholarDigital Library
M. Meila. Comparing clusterings: an axiomatic view. In ICML, pages 577--584, 2005. Google ScholarDigital Library
G. W. Milligan. An examination of the effect of six types of error perturbation on fifteen clustering algorithms. Psychometrika, 45(3):325--342, 1980.Google ScholarCross Ref
G. W. Milligan. A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 46(2):187--199, 1981.Google ScholarCross Ref
MOA project. http://moa.cs.waikato.ac.nz.Google Scholar
S. G. Mojaveri, E. Mirzaeian, Z. Bornaee, and S. Ayat. New approach in data stream association rule mining based on graph structure. In IEEE ICDM, pages 158--164, 2010. Google ScholarDigital Library
L. O'Callaghan, A. Meyerson, R. Motwani, N. Mishra, and S. Guha. Streaming-data algorithms for high-quality clustering. In ICDE, pages 685--694, 2002. Google ScholarDigital Library
W. Rand. Objective criteria for the evaluation of clustering methods. JASA, 66:846--850, 1971.Google ScholarCross Ref
C. Rijsbergen. Information Retrieval (2nd Edition). Butterworths, London, 1979. Google ScholarDigital Library
F. J. Rohlf. Methods for comparing classifications. Annual Review of Ecology and Sys., 5:101--113, 1974.Google ScholarCross Ref
A. Rosenberg and J. Hirschberg. V-measure: A conditional entropy-based external cluster evaluation measure. In EMNLP-CoNLL, pages 410--420, 2007.Google Scholar
V. Roth, M. L. Braun, T. Lange, and J. M. Buhmann. Stability-based model order selection in clustering with applications to gene expression data. In ICANN, pages 633--640. Springer, 2002. Google ScholarDigital Library
S. Saitta, B. Raphael, and I. F. C. Smith. A comprehensive validity index for clustering. Intell. Data Anal. (IDA), 12(6):529--548, 2008. Google ScholarDigital Library
M. J. Song and L. Zhang. Comparison of cluster representations from partial second- to full fourth-order cross moments for data stream clustering. In IEEE ICDM, pages 560--569, 2008. Google ScholarDigital Library
S. Van Dongen. Performance criteria for graph clustering and markov cluster experiments. Report-Information systems, (12):1--36, 2000.Google Scholar
L. Wang, U. T. V. Nguyen, J. C. Bezdek, C. Leckie, and K. Ramamohanarao. iVAT and aVAT: Enhanced visual analysis for cluster tendency assessment. In PAKDD (1), pages 16--27. Springer, 2010. Google ScholarDigital Library
J. Wu, H. Xiong, and J. Chen. Adapting the right measures for k-means clustering. In ACM SIGKDD, pages 877--886, 2009. Google ScholarDigital Library
K. Y. Yeung, D. R. Haynor, and W. L. Ruzzo. Validating clustering for gene expression data. Bioinformatics, 17(4):309--318, 2001.Google ScholarCross Ref
Y. Zhao and G. Karypis. Empirical and theoretical comparisons of selected criterion functions for document clustering. ML, 55(3):311--331, 2004. Google ScholarDigital Library

Index Terms

An effective evaluation measure for clustering on evolving data streams
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
        Cluster analysis
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Subspace clustering of data streams: new algorithms and effective evaluation measures

Nowadays, most streaming data sources are becoming high dimensional. Accordingly, subspace stream clustering, which aims at finding evolving clusters within subgroups of dimensions, has gained a significant importance. However, in spite of the rich ...
Read More
Clustering data streams using grid-based synopsis

Continually advancing technology has made it feasible to capture data online for onward transmission as a steady flow of newly generated data points, termed as data stream. Continuity and unboundedness of data streams make storage of data and multiple ...
Read More
On clustering massive text and categorical data streams

In this paper, we will study the data stream clustering problem in the context of text and categorical data domains. While the clustering problem has been studied recently for numeric data streams, the problems of text and categorical data present ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2011
1446 pages
ISBN:9781450308137
DOI:10.1145/2020408
General Chair:
Chid Apte
IBM Research
,
Program Chairs:
Joydeep Ghosh
UT Austin
,
Padhraic Smyth
UC Irvine
Copyright © 2011 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 August 2011
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
evaluation measure
stream clustering
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate1,133of8,635submissions,13%
Upcoming Conference
KDD '24

Sponsor:

sigkdd

sigkdd

The 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 25 - 29, 2024

Barcelona , Spain
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 69
  Total Citations
  View Citations
- 987
  Total Downloads
- Downloads (Last 12 months)29
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

An effective evaluation measure for clustering on evolving data streams

KDD '11: Proceedings of the 17th ACM SIGKDD international conference on Knowledge discovery and data mining

ABSTRACT

References

Cited By

Index Terms

Recommendations

Subspace clustering of data streams: new algorithms and effective evaluation measures

Clustering data streams using grid-based synopsis

On clustering massive text and categorical data streams