research-article

On the combination of relative clustering validity criteria

Authors:
Lucas Vendramin

Universidade de São Paulo (USP), São Carlos, São Paulo, Brazil

Universidade de São Paulo (USP), São Carlos, São Paulo, Brazil
View Profile

,
Pablo A. Jaskowiak

Universidade de São Paulo (USP), São Carlos, São Paulo, Brazil

Universidade de São Paulo (USP), São Carlos, São Paulo, Brazil
View Profile

,
Ricardo J. G. B. Campello

Universidade de São Paulo (USP), São Carlos, São Paulo, Brazil

Universidade de São Paulo (USP), São Carlos, São Paulo, Brazil
View Profile

SSDBM '13: Proceedings of the 25th International Conference on Scientific and Statistical Database ManagementJuly 2013Article No.: 4Pages 1–12https://doi.org/10.1145/2484838.2484844

Published:29 July 2013Publication History

SSDBM '13: Proceedings of the 25th International Conference on Scientific and Statistical Database Management

Pages 1–12

ABSTRACT

Many different relative clustering validity criteria exist that are very useful as quantitative measures for assessing the quality of data partitions. These criteria are endowed with particular features that may make each of them more suitable for specific classes of problems. Nevertheless, the performance of each criterion is usually unknown a priori by the user. Hence, choosing a specific criterion is not a trivial task. A possible approach to circumvent this drawback consists of combining different relative criteria in order to obtain more robust evaluations. However, this approach has so far been applied in an ad-hoc fashion only; its real potential is actually not well-understood. In this paper, we present an extensive study on the combination of relative criteria considering both synthetic and real datasets. The experiments involved 28 criteria and 4 different combination strategies applied to a varied collection of data partitions produced by 5 clustering algorithms. In total, 427,680 partitions of 972 synthetic datasets and 14,000 partitions of a collection of 400 image datasets were considered. Based on the results, we discuss the shortcomings and possible benefits of combining different relative criteria into a committee.

References

A. Albalate and D. Suendermann. A combination approach to cluster validation based on statistical quantiles. In International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing --- IJCBS, pages 549--555, 2009. Google ScholarDigital Library
J. C. Bezdek and N. R. Pal. Some new indexes of cluster validity. IEEE Transactions on Systems, Man and Cybernetics, Part B, 28(3):301--315, 1998. Google ScholarDigital Library
N. Bolshakova and F. Azuaje. Cluster validation techniques for genome expression data. Signal Processing, 83(4):825--833, 2003. Google ScholarDigital Library
M. B. Brown and A. B. Forsythe. Robust tests for the equality of variances. Journal of the American Statistical Association, 69(346):364--367, 1974.Google ScholarCross Ref
R. B. Calinski and J. Harabasz. A dentrite method for cluster analysis. Communications in Statistics, 3:1--27, 1974.Google Scholar
R. J. G. B. Campello and E. R. Hruschka. On comparing two sequences of numbers and its applications to clustering analysis. Inf. Sciences, 179:1025--1039, 2009. Google ScholarDigital Library
G. Casella and R. L. Berger. Statistical Inference. Duxbury Press, 2th edition, 2001.Google Scholar
D. L. Davies and D. W. Bouldin. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1:224--227, 1979. Google ScholarDigital Library
J. C. Dunn. Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4:95--104, 1974.Google ScholarCross Ref
B. S. Everitt, S. Landau, and M. Leese. Cluster Analysis. Arnold, 4th edition, 2001. Google ScholarDigital Library
M. Friedman. A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11(1):86--92, 1940.Google ScholarCross Ref
J. Gao and P.-N. Tan. Converting output scores from outlier detection algorithms into probability estimates. In IEEE International Conference on Data Mining --- ICDM, pages 212--221, 2006. Google ScholarDigital Library
J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders. The Amsterdam library of object images. International Journal of Computer Vision, 61(1):103--112, 2005. Google ScholarDigital Library
J. Ghosh and A. Acharya. Cluster ensembles. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4):305--315, 2011. Google ScholarDigital Library
M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. Journal of Intelligent Information Systems, 17:107--145, 2001. Google ScholarDigital Library
R. S. Hill. A stopping rule for partitioning dendrograms. Botanical Gazette, 141:321--324, 1980.Google ScholarCross Ref
D. Horta and R. J. G. B. Campello. Automatic aspect discrimination in data clustering. Pattern Recognition, 45(12):4370--4388, 2012. Google ScholarDigital Library
E. R. Hruschka, R. J. G. B. Campello, and L. N. Castro. Improving the efficiency of a clustering genetic algorithm. In Ibero-American Conference on Artificial Intelligence --- IBERAMIA, volume 3315, pages 861--870. 2004.Google ScholarCross Ref
E. R. Hruschka, R. J. G. B. Campello, and L. N. Castro. Evolving clusters in gene-expression data. Information Sciences, 176:1898--1927, 2006. Google ScholarDigital Library
L. J. Hubert and J. R. Levin. A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin, 10:1072--1080, 1976.Google ScholarCross Ref
A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31:651--666, 2010. Google ScholarDigital Library
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. Google ScholarDigital Library
A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31:264--323, 1999. Google ScholarDigital Library
L. Kaufman and P. Rousseeuw. Finding Groups in Data. Wiley, 1990.Google ScholarCross Ref
H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifying outlier scores. SIAM International Conference on Data Mining --- SDM, pages 13--24, 2011.Google ScholarCross Ref
A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In ACM International Conference on Knowledge Discovery and Data Mining --- KDD, pages 157--166, 2005. Google ScholarDigital Library
J. B. Machado, R. J. G. B. Campello, and W. C. Amaral. Design of OBF-TS fuzzy models based on multiple clustering validity criteria. In International Conference on Tools with Artificial Intelligence --- ICTAI, pages 336--339, 2007. Google ScholarDigital Library
J. B. McQueen. Some methods of classification and analysis of multivariate observations. 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297, 1967.Google Scholar
G. W. Milligan. A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 46(2):187--199, 1981.Google ScholarCross Ref
G. W. Milligan and M. C. Cooper. An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2):159--179, 1985.Google ScholarCross Ref
H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In International Conference on Database Systems for Advanced Applications --- DASFAA, pages 368--383, 2010. Google ScholarDigital Library
M. K. Pakhira, S. Bandyopadhyay, and U. Maulik. Validity index for crisp and fuzzy clusters. Pattern Recognition, 37:487--501, 2004.Google ScholarCross Ref
V. Pihur, S. Datta, and S. Datta. Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach. Bioinformatics, 23(13):1607--1615, 2007. Google ScholarDigital Library
R. Rabbany, M. Takaffoli, J. Fagnan, O. R. Zaiane, and R. J. G. B. Campello. Relative validity criteria for community mining algorithms. IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining --- ASONAM, pages 258--265, 2012. Google ScholarDigital Library
D. A. Ratkowsky and G. N. Lance. A criterion for determining the number of groups in a classification. Australian Computer Journal, 10:115--117, 1978.Google Scholar
L. Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33:1--39, 2010. Google ScholarDigital Library
P. J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53--65, 1987. Google ScholarDigital Library
E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. On evaluation of outlier rankings and outlier scores. SIAM International Conference on Data Mining --- SDM, pages 1047--1058, 2012.Google ScholarCross Ref
W. Sheng, S. Swift, L. Zhang, and X. Liu. A weighted sum validity function for clustering with a hybrid niching genetic algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B., 35(6):1156--1167, 2005. Google ScholarDigital Library
L. Vendramin, R. J. G. B. Campello, and E. R. Hruschka. On the comparison of relative clustering validity criteria. SIAM International Conference on Data Mining --- SDM, pages 733--744, 2009.Google ScholarCross Ref
L. Vendramin, R. J. G. B. Campello, and E. R. Hruschka. Relative clustering validity criteria: A comparative overview. Statistical Analysis and Data Mining, 3(4):209--335, 2010. Google ScholarDigital Library
R. Xu and D. C. Wunsch II. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16:645--678, 2005. Google ScholarDigital Library

Index Terms

On the combination of relative clustering validity criteria
1. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Clustering validity checking methods: part II

Clustering results validation is an important topic in the context of pattern recognition. We review approaches and systems in this context. In the first part of this paper we presented clustering validity checking approaches based on internal and ...
Read More
On strategies for building effective ensembles of relative clustering validity criteria

Evaluation and validation are essential tasks for achieving meaningful clustering results. Relative validity criteria are measures usually employed in practice to select and validate clustering solutions, as they enable the evaluation of single ...
Read More
Relative clustering validity criteria: A comparative overview

Many different relative clustering validity criteria exist that are very useful in practice as quantitative measures for evaluating the quality of data partitions, and new criteria have still been proposed from time to time. These criteria are endowed ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SSDBM '13: Proceedings of the 25th International Conference on Scientific and Statistical Database Management
July 2013
401 pages
ISBN:9781450319218
DOI:10.1145/2484838
Editors:
Alex Szalay,
Tamas Budavari,
Magdalena Balazinska,
Alexandra Meliou,
Ahmet Sacan
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 29 July 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
clustering validation
combinations of validity criteria
relative validity criteria
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate56of146submissions,38%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 20
  Total Citations
  View Citations
- 277
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)3
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

On the combination of relative clustering validity criteria

SSDBM '13: Proceedings of the 25th International Conference on Scientific and Statistical Database Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Clustering validity checking methods: part II

On strategies for building effective ensembles of relative clustering validity criteria

Relative clustering validity criteria: A comparative overview