ABSTRACT
Many different relative clustering validity criteria exist that are very useful as quantitative measures for assessing the quality of data partitions. These criteria are endowed with particular features that may make each of them more suitable for specific classes of problems. Nevertheless, the performance of each criterion is usually unknown a priori by the user. Hence, choosing a specific criterion is not a trivial task. A possible approach to circumvent this drawback consists of combining different relative criteria in order to obtain more robust evaluations. However, this approach has so far been applied in an ad-hoc fashion only; its real potential is actually not well-understood. In this paper, we present an extensive study on the combination of relative criteria considering both synthetic and real datasets. The experiments involved 28 criteria and 4 different combination strategies applied to a varied collection of data partitions produced by 5 clustering algorithms. In total, 427,680 partitions of 972 synthetic datasets and 14,000 partitions of a collection of 400 image datasets were considered. Based on the results, we discuss the shortcomings and possible benefits of combining different relative criteria into a committee.
- A. Albalate and D. Suendermann. A combination approach to cluster validation based on statistical quantiles. In International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing --- IJCBS, pages 549--555, 2009. Google ScholarDigital Library
- J. C. Bezdek and N. R. Pal. Some new indexes of cluster validity. IEEE Transactions on Systems, Man and Cybernetics, Part B, 28(3):301--315, 1998. Google ScholarDigital Library
- N. Bolshakova and F. Azuaje. Cluster validation techniques for genome expression data. Signal Processing, 83(4):825--833, 2003. Google ScholarDigital Library
- M. B. Brown and A. B. Forsythe. Robust tests for the equality of variances. Journal of the American Statistical Association, 69(346):364--367, 1974.Google ScholarCross Ref
- R. B. Calinski and J. Harabasz. A dentrite method for cluster analysis. Communications in Statistics, 3:1--27, 1974.Google Scholar
- R. J. G. B. Campello and E. R. Hruschka. On comparing two sequences of numbers and its applications to clustering analysis. Inf. Sciences, 179:1025--1039, 2009. Google ScholarDigital Library
- G. Casella and R. L. Berger. Statistical Inference. Duxbury Press, 2th edition, 2001.Google Scholar
- D. L. Davies and D. W. Bouldin. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1:224--227, 1979. Google ScholarDigital Library
- J. C. Dunn. Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4:95--104, 1974.Google ScholarCross Ref
- B. S. Everitt, S. Landau, and M. Leese. Cluster Analysis. Arnold, 4th edition, 2001. Google ScholarDigital Library
- M. Friedman. A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11(1):86--92, 1940.Google ScholarCross Ref
- J. Gao and P.-N. Tan. Converting output scores from outlier detection algorithms into probability estimates. In IEEE International Conference on Data Mining --- ICDM, pages 212--221, 2006. Google ScholarDigital Library
- J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders. The Amsterdam library of object images. International Journal of Computer Vision, 61(1):103--112, 2005. Google ScholarDigital Library
- J. Ghosh and A. Acharya. Cluster ensembles. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4):305--315, 2011. Google ScholarDigital Library
- M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. Journal of Intelligent Information Systems, 17:107--145, 2001. Google ScholarDigital Library
- R. S. Hill. A stopping rule for partitioning dendrograms. Botanical Gazette, 141:321--324, 1980.Google ScholarCross Ref
- D. Horta and R. J. G. B. Campello. Automatic aspect discrimination in data clustering. Pattern Recognition, 45(12):4370--4388, 2012. Google ScholarDigital Library
- E. R. Hruschka, R. J. G. B. Campello, and L. N. Castro. Improving the efficiency of a clustering genetic algorithm. In Ibero-American Conference on Artificial Intelligence --- IBERAMIA, volume 3315, pages 861--870. 2004.Google ScholarCross Ref
- E. R. Hruschka, R. J. G. B. Campello, and L. N. Castro. Evolving clusters in gene-expression data. Information Sciences, 176:1898--1927, 2006. Google ScholarDigital Library
- L. J. Hubert and J. R. Levin. A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin, 10:1072--1080, 1976.Google ScholarCross Ref
- A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31:651--666, 2010. Google ScholarDigital Library
- A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. Google ScholarDigital Library
- A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31:264--323, 1999. Google ScholarDigital Library
- L. Kaufman and P. Rousseeuw. Finding Groups in Data. Wiley, 1990.Google ScholarCross Ref
- H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifying outlier scores. SIAM International Conference on Data Mining --- SDM, pages 13--24, 2011.Google ScholarCross Ref
- A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In ACM International Conference on Knowledge Discovery and Data Mining --- KDD, pages 157--166, 2005. Google ScholarDigital Library
- J. B. Machado, R. J. G. B. Campello, and W. C. Amaral. Design of OBF-TS fuzzy models based on multiple clustering validity criteria. In International Conference on Tools with Artificial Intelligence --- ICTAI, pages 336--339, 2007. Google ScholarDigital Library
- J. B. McQueen. Some methods of classification and analysis of multivariate observations. 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297, 1967.Google Scholar
- G. W. Milligan. A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 46(2):187--199, 1981.Google ScholarCross Ref
- G. W. Milligan and M. C. Cooper. An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2):159--179, 1985.Google ScholarCross Ref
- H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In International Conference on Database Systems for Advanced Applications --- DASFAA, pages 368--383, 2010. Google ScholarDigital Library
- M. K. Pakhira, S. Bandyopadhyay, and U. Maulik. Validity index for crisp and fuzzy clusters. Pattern Recognition, 37:487--501, 2004.Google ScholarCross Ref
- V. Pihur, S. Datta, and S. Datta. Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach. Bioinformatics, 23(13):1607--1615, 2007. Google ScholarDigital Library
- R. Rabbany, M. Takaffoli, J. Fagnan, O. R. Zaiane, and R. J. G. B. Campello. Relative validity criteria for community mining algorithms. IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining --- ASONAM, pages 258--265, 2012. Google ScholarDigital Library
- D. A. Ratkowsky and G. N. Lance. A criterion for determining the number of groups in a classification. Australian Computer Journal, 10:115--117, 1978.Google Scholar
- L. Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33:1--39, 2010. Google ScholarDigital Library
- P. J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53--65, 1987. Google ScholarDigital Library
- E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. On evaluation of outlier rankings and outlier scores. SIAM International Conference on Data Mining --- SDM, pages 1047--1058, 2012.Google ScholarCross Ref
- W. Sheng, S. Swift, L. Zhang, and X. Liu. A weighted sum validity function for clustering with a hybrid niching genetic algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B., 35(6):1156--1167, 2005. Google ScholarDigital Library
- L. Vendramin, R. J. G. B. Campello, and E. R. Hruschka. On the comparison of relative clustering validity criteria. SIAM International Conference on Data Mining --- SDM, pages 733--744, 2009.Google ScholarCross Ref
- L. Vendramin, R. J. G. B. Campello, and E. R. Hruschka. Relative clustering validity criteria: A comparative overview. Statistical Analysis and Data Mining, 3(4):209--335, 2010. Google ScholarDigital Library
- R. Xu and D. C. Wunsch II. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16:645--678, 2005. Google ScholarDigital Library
Index Terms
- On the combination of relative clustering validity criteria
Recommendations
Clustering validity checking methods: part II
Clustering results validation is an important topic in the context of pattern recognition. We review approaches and systems in this context. In the first part of this paper we presented clustering validity checking approaches based on internal and ...
On strategies for building effective ensembles of relative clustering validity criteria
Evaluation and validation are essential tasks for achieving meaningful clustering results. Relative validity criteria are measures usually employed in practice to select and validate clustering solutions, as they enable the evaluation of single ...
Relative clustering validity criteria: A comparative overview
Many different relative clustering validity criteria exist that are very useful in practice as quantitative measures for evaluating the quality of data partitions, and new criteria have still been proposed from time to time. These criteria are endowed ...
Comments