Skip to main content
Log in

Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets

  • Original Research
  • Published:
SN Computer Science Aims and scope Submit manuscript

Abstract

Clustering validation is one of the most important and challenging parts of clustering analysis, as there is no ground truth knowledge to compare the results with. Up till now, the evaluation methods for clustering algorithms have been used for determining the optimal number of clusters in the data, assessing the quality of clustering results through various validity criteria, comparison of results with other clustering schemes, etc. It is also often practically important to build a model on a large amount of training data and then apply the model repeatedly to smaller amounts of new data. This is similar to assigning new data points to existing clusters which are constructed on the training set. However, very little practical guidance is available to measure the prediction strength of the constructed model to predict cluster labels for new samples. In this study, we proposed an extension of the cross-validation procedure to evaluate the quality of the clustering model in predicting cluster membership for new data points. The performance score was measured in terms of the root mean squared error based on the information from multiple labels of the training and testing samples. The principal component analysis (PCA) followed by k-means clustering algorithm was used to evaluate the proposed method. The clustering model was tested using three benchmark multi-label datasets and has shown promising results with overall RMSE of less than 0.075 and MAPE of less than 12.5% in three datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Ben-David S, Von Luxburg U. Relating clustering stability to properties of cluster boundaries. In: 21st Annual Conference on Learning Theory, COLT 2008. 2008.

  2. Bengio Y, et al. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell. 2013;35(8):1798–828. https://doi.org/10.1109/TPAMI.2013.50.

    Article  Google Scholar 

  3. Caliñski T, Harabasz J. A Dendrite method foe cluster analysis. Commun Stat. 1974. https://doi.org/10.1080/03610927408827101.

    Article  MATH  Google Scholar 

  4. Chakraborty S et al. Entropy regularized power k-means clustering. 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020), Palermo, Italy; 2020. http://arxiv.org/abs/2001.03452.

  5. Chakraborty S, Das S. K-Means clustering with a new divergence-based distance metric: convergence and performance analysis. Pattern Recogn Lett. 2017. https://doi.org/10.1016/j.patrec.2017.09.025.

    Article  Google Scholar 

  6. Cordeiro De Amorim R, Mirkin B. Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering. Pattern Recogn. 2012;45:1061. https://doi.org/10.1016/j.patcog.2011.08.012.

    Article  Google Scholar 

  7. Davies DL, Bouldin DW. A cluster separation measure. IEEE Trans Pattern Anal Mach Intell. 1979. https://doi.org/10.1109/TPAMI.1979.4766909.

    Article  Google Scholar 

  8. Do JH, Choi DK. Normalization of microarray data: single-labeled and dual-labeled arrays. Mole Cells. 2006;22(3):254–61.

    Google Scholar 

  9. Dokmanic I, et al. Euclidean distance matrices: essential theory, algorithms, and applications. IEEE Signal Process Mag. 2015. https://doi.org/10.1109/MSP.2015.2398954.

    Article  Google Scholar 

  10. Elisseeff A, Weston J. A kernel method for multi-labelled classification. In: Advances in neural information processing systems. Cambridge: The MIT Press; 2002. https://doi.org/10.7551/mitpress/1120.003.0092.

  11. Estivill-Castro V. Why so many clustering algorithms. ACM SIGKDD Explor Newsl. 2002. https://doi.org/10.1145/568574.568575.

    Article  Google Scholar 

  12. Goran Petrović ŽĆ. Comparison of clustering methods for failure data analysis: a real life application. In: Proceedings of the XV international scientific conference on industrial systems (IS’11). pp. 297–300; 2011.

  13. Hassani M, Seidl T. Using internal evaluation measures to validate the quality of diverse stream clustering algorithms. Vietnam J Comput Sci. 2017. https://doi.org/10.1007/s40595-016-0086-9.

    Article  Google Scholar 

  14. Hennig C, et al. Handbook of cluster analysis. 2015. https://doi.org/10.1201/b19706.

    Book  Google Scholar 

  15. Jain AK. Data clustering: 50 years beyond K-means. Pattern Recogn Lett. 2010;31(8):651–66. https://doi.org/10.1016/j.patrec.2009.09.011.

    Article  Google Scholar 

  16. Jin J, Wang W. Influential features PCA for high dimensional clustering. Ann Stat. 2016. https://doi.org/10.1214/15-AOS1423.

    Article  MathSciNet  MATH  Google Scholar 

  17. Kleinberg J. An impossibility theorem for clustering. In: Advances in neural information processing systems (NIPS).pp. 446–453. MIT Press, Cambridge;2002.

  18. Lewis CD. Industrial and business forecasting methods: a practical guide to exponential smoothing and curve fitting. Oxford: Butterworth Scientific; 1982. https://doi.org/10.1002/for.3980010202.

    Book  Google Scholar 

  19. Li W, et al. Application of t-SNE to human genetic data. J Bioinf Comput Biol. 2017;15(04):1750017. https://doi.org/10.1142/S0219720017500172.

    Article  Google Scholar 

  20. Lv Y, et al. An efficient and scalable density-based clustering algorithm for datasets with complex structures. Neurocomputing. 2016. https://doi.org/10.1016/j.neucom.2015.05.109.

    Article  Google Scholar 

  21. Miljkovic D. Brief review of self-organizing maps. In: 2017 40th International convention on information and communication technology, electronics and microelectronics, MIPRO 2017—Proceedings; 2017. https://doi.org/10.23919/MIPRO.2017.7973581.

  22. Moulavi D et al. Density-based clustering validation. In: Proceedings of the 2014 SIAM international conference on data mining. pp. 839–847 Society for Industrial and Applied Mathematics, Philadelphia, PA; 2014. https://doi.org/10.1137/1.9781611973440.96.

  23. Napoleon D, Pavalakodi S. A new method for dimensionality reduction using K means clustering algorithm for high dimensional data set. Int J Comput Appl. 2011;13(7):41–6. https://doi.org/10.5120/1789-2471.

    Article  Google Scholar 

  24. Olukanmi P, et al. Rethinking k-means clustering in the age of massive datasets: a constant-time approach. Neural Comput Appl. 2019. https://doi.org/10.1007/s00521-019-04673-0.

    Article  Google Scholar 

  25. Rakhlin A, Caponnetto A. Stability of K-means clustering. In: Advances in neural information processing systems; 2007. https://doi.org/10.1007/978-3-540-72927-3_4.

  26. Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971. https://doi.org/10.1080/01621459.1971.10482356.

    Article  Google Scholar 

  27. Rendón E, et al. Internal versus external cluster validation indexes. Int J Comput Commun. 2011;5(1):27–34.

    Google Scholar 

  28. Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–655. https://doi.org/10.1016/0377-0427(87)90125-7.

    Article  MATH  Google Scholar 

  29. Sahu L, Mohan BR. An improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop. In: 9th International conference on industrial and information systems, ICIIS 2014; 2015. https://doi.org/10.1109/ICIINFS.2014.7036661.

  30. Sidhu RS, et al. A subtractive clustering based approach for early prediction of fault proneness in software modules. World Acad Sci. Eng Technol. 2010;. https://doi.org/10.5281/zenodo.1331265.

    Article  Google Scholar 

  31. Silverman BW. Density estimation: for statistics and data analysis. 2018. https://doi.org/10.1201/9781315140919.

    Book  Google Scholar 

  32. Syms C. Principal components analysis. In: Encyclopedia of ecology. Amsterdam: Elsevier; 2018. https://doi.org/10.1016/B978-0-12-409548-9.11152-2.

  33. Tan P-N et al. Chap 8: Cluster analysis: basic concepts and algorithms. Introduction to data mining. 2005. https://doi.org/10.1016/0022-4405(81)90007-8.

  34. Tarekegn A, et al. Predictive Modeling for Frailty Conditions in Elderly People: Machine Learning Approaches. JMIR medical informatics. 2020;8:e16678. http://www.ncbi.nlm.nih.gov/pubmed/32442149.

    Google Scholar 

  35. Tarekegn A et al. Detection of frailty using genetic programming. Presented at the (2020). https://doi.org/10.1007/978-3-030-44094-7_15.

  36. Tibshirani R, Walther G. Cluster validation by prediction strength. J Comput Graph Stat. 2005. https://doi.org/10.1198/106186005X59243.

    Article  MathSciNet  Google Scholar 

  37. Trohidis K et al. Multi-label classification of music into emotions. In: ISMIR 2008—9th international conference on music information retrieval. 2008.

  38. Vinh NX et al. Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res. 2010;11(95):2837−2854.

  39. Wang J. Consistent selection of the number of clusters via crossvalidation. Biometrika. 2010. https://doi.org/10.1093/biomet/asq061.

    Article  MathSciNet  MATH  Google Scholar 

  40. Wilks DS. Cluster analysis. Int Geophys. 2011;100:603–616. https://doi.org/10.1016/B978-0-12-385022-5.00015-4.

    Article  Google Scholar 

  41. Witten DM, Tibshirani R. A framework for feature selection in clustering. J Am Stat Assoc. 2010. https://doi.org/10.1198/jasa.2010.tm09415.

    Article  MathSciNet  MATH  Google Scholar 

  42. Xu R, WunschII D. Survey of clustering algorithms. IEEE Trans Neural Netw. 2005;16(3):645–78. https://doi.org/10.1109/TNN.2005.845141.

    Article  Google Scholar 

  43. Zhang X, et al. A novel deep neural network model for multi-label chronic disease prediction. Front Genet. 2019. https://doi.org/10.3389/fgene.2019.00351.

    Article  Google Scholar 

Download references

Acknowledgements

The author would like to thank the reviewers of this paper for their supportive comments.

Funding

No funding was received for this study.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adane Nega Tarekegn.

Ethics declarations

Conflict of interest

The author declares no competing interests.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tarekegn, A.N., Michalak, K. & Giacobini, M. Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets. SN COMPUT. SCI. 1, 263 (2020). https://doi.org/10.1007/s42979-020-00283-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s42979-020-00283-z

Keywords

Navigation