skip to main content
10.1145/2484838.2484844acmotherconferencesArticle/Chapter ViewAbstractPublication PagesssdbmConference Proceedingsconference-collections
research-article

On the combination of relative clustering validity criteria

Authors Info & Claims
Published:29 July 2013Publication History

ABSTRACT

Many different relative clustering validity criteria exist that are very useful as quantitative measures for assessing the quality of data partitions. These criteria are endowed with particular features that may make each of them more suitable for specific classes of problems. Nevertheless, the performance of each criterion is usually unknown a priori by the user. Hence, choosing a specific criterion is not a trivial task. A possible approach to circumvent this drawback consists of combining different relative criteria in order to obtain more robust evaluations. However, this approach has so far been applied in an ad-hoc fashion only; its real potential is actually not well-understood. In this paper, we present an extensive study on the combination of relative criteria considering both synthetic and real datasets. The experiments involved 28 criteria and 4 different combination strategies applied to a varied collection of data partitions produced by 5 clustering algorithms. In total, 427,680 partitions of 972 synthetic datasets and 14,000 partitions of a collection of 400 image datasets were considered. Based on the results, we discuss the shortcomings and possible benefits of combining different relative criteria into a committee.

References

  1. A. Albalate and D. Suendermann. A combination approach to cluster validation based on statistical quantiles. In International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing --- IJCBS, pages 549--555, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. C. Bezdek and N. R. Pal. Some new indexes of cluster validity. IEEE Transactions on Systems, Man and Cybernetics, Part B, 28(3):301--315, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. Bolshakova and F. Azuaje. Cluster validation techniques for genome expression data. Signal Processing, 83(4):825--833, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. B. Brown and A. B. Forsythe. Robust tests for the equality of variances. Journal of the American Statistical Association, 69(346):364--367, 1974.Google ScholarGoogle ScholarCross RefCross Ref
  5. R. B. Calinski and J. Harabasz. A dentrite method for cluster analysis. Communications in Statistics, 3:1--27, 1974.Google ScholarGoogle Scholar
  6. R. J. G. B. Campello and E. R. Hruschka. On comparing two sequences of numbers and its applications to clustering analysis. Inf. Sciences, 179:1025--1039, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Casella and R. L. Berger. Statistical Inference. Duxbury Press, 2th edition, 2001.Google ScholarGoogle Scholar
  8. D. L. Davies and D. W. Bouldin. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1:224--227, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. C. Dunn. Well separated clusters and optimal fuzzy partitions. Journal of Cybernetics, 4:95--104, 1974.Google ScholarGoogle ScholarCross RefCross Ref
  10. B. S. Everitt, S. Landau, and M. Leese. Cluster Analysis. Arnold, 4th edition, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Friedman. A comparison of alternative tests of significance for the problem of m rankings. The Annals of Mathematical Statistics, 11(1):86--92, 1940.Google ScholarGoogle ScholarCross RefCross Ref
  12. J. Gao and P.-N. Tan. Converting output scores from outlier detection algorithms into probability estimates. In IEEE International Conference on Data Mining --- ICDM, pages 212--221, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. M. Geusebroek, G. J. Burghouts, and A. W. M. Smeulders. The Amsterdam library of object images. International Journal of Computer Vision, 61(1):103--112, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Ghosh and A. Acharya. Cluster ensembles. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 1(4):305--315, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. M. Halkidi, Y. Batistakis, and M. Vazirgiannis. On clustering validation techniques. Journal of Intelligent Information Systems, 17:107--145, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. S. Hill. A stopping rule for partitioning dendrograms. Botanical Gazette, 141:321--324, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  17. D. Horta and R. J. G. B. Campello. Automatic aspect discrimination in data clustering. Pattern Recognition, 45(12):4370--4388, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. R. Hruschka, R. J. G. B. Campello, and L. N. Castro. Improving the efficiency of a clustering genetic algorithm. In Ibero-American Conference on Artificial Intelligence --- IBERAMIA, volume 3315, pages 861--870. 2004.Google ScholarGoogle ScholarCross RefCross Ref
  19. E. R. Hruschka, R. J. G. B. Campello, and L. N. Castro. Evolving clusters in gene-expression data. Information Sciences, 176:1898--1927, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. J. Hubert and J. R. Levin. A general statistical framework for assessing categorical clustering in free recall. Psychological Bulletin, 10:1072--1080, 1976.Google ScholarGoogle ScholarCross RefCross Ref
  21. A. K. Jain. Data clustering: 50 years beyond k-means. Pattern Recognition Letters, 31:651--666, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice Hall, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. K. Jain, M. N. Murty, and P. J. Flynn. Data clustering: A review. ACM Computing Surveys, 31:264--323, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. L. Kaufman and P. Rousseeuw. Finding Groups in Data. Wiley, 1990.Google ScholarGoogle ScholarCross RefCross Ref
  25. H.-P. Kriegel, P. Kröger, E. Schubert, and A. Zimek. Interpreting and unifying outlier scores. SIAM International Conference on Data Mining --- SDM, pages 13--24, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  26. A. Lazarevic and V. Kumar. Feature bagging for outlier detection. In ACM International Conference on Knowledge Discovery and Data Mining --- KDD, pages 157--166, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. B. Machado, R. J. G. B. Campello, and W. C. Amaral. Design of OBF-TS fuzzy models based on multiple clustering validity criteria. In International Conference on Tools with Artificial Intelligence --- ICTAI, pages 336--339, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. B. McQueen. Some methods of classification and analysis of multivariate observations. 5th Berkeley Symposium on Mathematical Statistics and Probability, pages 281--297, 1967.Google ScholarGoogle Scholar
  29. G. W. Milligan. A monte carlo study of thirty internal criterion measures for cluster analysis. Psychometrika, 46(2):187--199, 1981.Google ScholarGoogle ScholarCross RefCross Ref
  30. G. W. Milligan and M. C. Cooper. An examination of procedures for determining the number of clusters in a data set. Psychometrika, 50(2):159--179, 1985.Google ScholarGoogle ScholarCross RefCross Ref
  31. H. V. Nguyen, H. H. Ang, and V. Gopalkrishnan. Mining outliers with ensemble of heterogeneous detectors on random subspaces. In International Conference on Database Systems for Advanced Applications --- DASFAA, pages 368--383, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. K. Pakhira, S. Bandyopadhyay, and U. Maulik. Validity index for crisp and fuzzy clusters. Pattern Recognition, 37:487--501, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  33. V. Pihur, S. Datta, and S. Datta. Weighted rank aggregation of cluster validation measures: a Monte Carlo cross-entropy approach. Bioinformatics, 23(13):1607--1615, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. R. Rabbany, M. Takaffoli, J. Fagnan, O. R. Zaiane, and R. J. G. B. Campello. Relative validity criteria for community mining algorithms. IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining --- ASONAM, pages 258--265, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. D. A. Ratkowsky and G. N. Lance. A criterion for determining the number of groups in a classification. Australian Computer Journal, 10:115--117, 1978.Google ScholarGoogle Scholar
  36. L. Rokach. Ensemble-based classifiers. Artificial Intelligence Review, 33:1--39, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. P. J. Rousseeuw. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20:53--65, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. E. Schubert, R. Wojdanowski, A. Zimek, and H.-P. Kriegel. On evaluation of outlier rankings and outlier scores. SIAM International Conference on Data Mining --- SDM, pages 1047--1058, 2012.Google ScholarGoogle ScholarCross RefCross Ref
  39. W. Sheng, S. Swift, L. Zhang, and X. Liu. A weighted sum validity function for clustering with a hybrid niching genetic algorithm. IEEE Transactions on Systems, Man, and Cybernetics, Part B., 35(6):1156--1167, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. L. Vendramin, R. J. G. B. Campello, and E. R. Hruschka. On the comparison of relative clustering validity criteria. SIAM International Conference on Data Mining --- SDM, pages 733--744, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  41. L. Vendramin, R. J. G. B. Campello, and E. R. Hruschka. Relative clustering validity criteria: A comparative overview. Statistical Analysis and Data Mining, 3(4):209--335, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. R. Xu and D. C. Wunsch II. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16:645--678, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. On the combination of relative clustering validity criteria

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      SSDBM '13: Proceedings of the 25th International Conference on Scientific and Statistical Database Management
      July 2013
      401 pages
      ISBN:9781450319218
      DOI:10.1145/2484838

      Copyright © 2013 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 29 July 2013

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate56of146submissions,38%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader