Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets

Tarekegn, Adane Nega; Michalak, Krzysztof; Giacobini, Mario

doi:10.1007/s42979-020-00283-z

Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets

Original Research
Published: 11 August 2020

Volume 1, article number 263, (2020)
Cite this article

SN Computer Science Aims and scope Submit manuscript

Adane Nega Tarekegn¹,
Krzysztof Michalak² &
Mario Giacobini³

3213 Accesses
9 Citations
7 Altmetric
1 Mention
Explore all metrics

Abstract

Clustering validation is one of the most important and challenging parts of clustering analysis, as there is no ground truth knowledge to compare the results with. Up till now, the evaluation methods for clustering algorithms have been used for determining the optimal number of clusters in the data, assessing the quality of clustering results through various validity criteria, comparison of results with other clustering schemes, etc. It is also often practically important to build a model on a large amount of training data and then apply the model repeatedly to smaller amounts of new data. This is similar to assigning new data points to existing clusters which are constructed on the training set. However, very little practical guidance is available to measure the prediction strength of the constructed model to predict cluster labels for new samples. In this study, we proposed an extension of the cross-validation procedure to evaluate the quality of the clustering model in predicting cluster membership for new data points. The performance score was measured in terms of the root mean squared error based on the information from multiple labels of the training and testing samples. The principal component analysis (PCA) followed by k-means clustering algorithm was used to evaluate the proposed method. The clustering model was tested using three benchmark multi-label datasets and has shown promising results with overall RMSE of less than 0.075 and MAPE of less than 12.5% in three datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Clustering

K-Means and BIRCH: A Comparative Analysis Study

A Comparative Study on k-means Clustering Method and Analysis

References

Ben-David S, Von Luxburg U. Relating clustering stability to properties of cluster boundaries. In: 21st Annual Conference on Learning Theory, COLT 2008. 2008.
Bengio Y, et al. Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell. 2013;35(8):1798–828. https://doi.org/10.1109/TPAMI.2013.50.
Article Google Scholar
Caliñski T, Harabasz J. A Dendrite method foe cluster analysis. Commun Stat. 1974. https://doi.org/10.1080/03610927408827101.
Article MATH Google Scholar
Chakraborty S et al. Entropy regularized power k-means clustering. 23rd International Conference on Artificial Intelligence and Statistics (AISTATS 2020), Palermo, Italy; 2020. http://arxiv.org/abs/2001.03452.
Chakraborty S, Das S. K-Means clustering with a new divergence-based distance metric: convergence and performance analysis. Pattern Recogn Lett. 2017. https://doi.org/10.1016/j.patrec.2017.09.025.
Article Google Scholar
Cordeiro De Amorim R, Mirkin B. Minkowski metric, feature weighting and anomalous cluster initializing in K-Means clustering. Pattern Recogn. 2012;45:1061. https://doi.org/10.1016/j.patcog.2011.08.012.
Article Google Scholar
Davies DL, Bouldin DW. A cluster separation measure. IEEE Trans Pattern Anal Mach Intell. 1979. https://doi.org/10.1109/TPAMI.1979.4766909.
Article Google Scholar
Do JH, Choi DK. Normalization of microarray data: single-labeled and dual-labeled arrays. Mole Cells. 2006;22(3):254–61.
Google Scholar
Dokmanic I, et al. Euclidean distance matrices: essential theory, algorithms, and applications. IEEE Signal Process Mag. 2015. https://doi.org/10.1109/MSP.2015.2398954.
Article Google Scholar
Elisseeff A, Weston J. A kernel method for multi-labelled classification. In: Advances in neural information processing systems. Cambridge: The MIT Press; 2002. https://doi.org/10.7551/mitpress/1120.003.0092.
Estivill-Castro V. Why so many clustering algorithms. ACM SIGKDD Explor Newsl. 2002. https://doi.org/10.1145/568574.568575.
Article Google Scholar
Goran Petrović ŽĆ. Comparison of clustering methods for failure data analysis: a real life application. In: Proceedings of the XV international scientific conference on industrial systems (IS’11). pp. 297–300; 2011.
Hassani M, Seidl T. Using internal evaluation measures to validate the quality of diverse stream clustering algorithms. Vietnam J Comput Sci. 2017. https://doi.org/10.1007/s40595-016-0086-9.
Article Google Scholar
Hennig C, et al. Handbook of cluster analysis. 2015. https://doi.org/10.1201/b19706.
Book Google Scholar
Jain AK. Data clustering: 50 years beyond K-means. Pattern Recogn Lett. 2010;31(8):651–66. https://doi.org/10.1016/j.patrec.2009.09.011.
Article Google Scholar
Jin J, Wang W. Influential features PCA for high dimensional clustering. Ann Stat. 2016. https://doi.org/10.1214/15-AOS1423.
Article MathSciNet MATH Google Scholar
Kleinberg J. An impossibility theorem for clustering. In: Advances in neural information processing systems (NIPS).pp. 446–453. MIT Press, Cambridge;2002.
Lewis CD. Industrial and business forecasting methods: a practical guide to exponential smoothing and curve fitting. Oxford: Butterworth Scientific; 1982. https://doi.org/10.1002/for.3980010202.
Book Google Scholar
Li W, et al. Application of t-SNE to human genetic data. J Bioinf Comput Biol. 2017;15(04):1750017. https://doi.org/10.1142/S0219720017500172.
Article Google Scholar
Lv Y, et al. An efficient and scalable density-based clustering algorithm for datasets with complex structures. Neurocomputing. 2016. https://doi.org/10.1016/j.neucom.2015.05.109.
Article Google Scholar
Miljkovic D. Brief review of self-organizing maps. In: 2017 40th International convention on information and communication technology, electronics and microelectronics, MIPRO 2017—Proceedings; 2017. https://doi.org/10.23919/MIPRO.2017.7973581.
Moulavi D et al. Density-based clustering validation. In: Proceedings of the 2014 SIAM international conference on data mining. pp. 839–847 Society for Industrial and Applied Mathematics, Philadelphia, PA; 2014. https://doi.org/10.1137/1.9781611973440.96.
Napoleon D, Pavalakodi S. A new method for dimensionality reduction using K means clustering algorithm for high dimensional data set. Int J Comput Appl. 2011;13(7):41–6. https://doi.org/10.5120/1789-2471.
Article Google Scholar
Olukanmi P, et al. Rethinking k-means clustering in the age of massive datasets: a constant-time approach. Neural Comput Appl. 2019. https://doi.org/10.1007/s00521-019-04673-0.
Article Google Scholar
Rakhlin A, Caponnetto A. Stability of K-means clustering. In: Advances in neural information processing systems; 2007. https://doi.org/10.1007/978-3-540-72927-3_4.
Rand WM. Objective criteria for the evaluation of clustering methods. J Am Stat Assoc. 1971. https://doi.org/10.1080/01621459.1971.10482356.
Article Google Scholar
Rendón E, et al. Internal versus external cluster validation indexes. Int J Comput Commun. 2011;5(1):27–34.
Google Scholar
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–655. https://doi.org/10.1016/0377-0427(87)90125-7.
Article MATH Google Scholar
Sahu L, Mohan BR. An improved K-means algorithm using modified cosine distance measure for document clustering using Mahout with Hadoop. In: 9th International conference on industrial and information systems, ICIIS 2014; 2015. https://doi.org/10.1109/ICIINFS.2014.7036661.
Sidhu RS, et al. A subtractive clustering based approach for early prediction of fault proneness in software modules. World Acad Sci. Eng Technol. 2010;. https://doi.org/10.5281/zenodo.1331265.
Article Google Scholar
Silverman BW. Density estimation: for statistics and data analysis. 2018. https://doi.org/10.1201/9781315140919.
Book Google Scholar
Syms C. Principal components analysis. In: Encyclopedia of ecology. Amsterdam: Elsevier; 2018. https://doi.org/10.1016/B978-0-12-409548-9.11152-2.
Tan P-N et al. Chap 8: Cluster analysis: basic concepts and algorithms. Introduction to data mining. 2005. https://doi.org/10.1016/0022-4405(81)90007-8.
Tarekegn A, et al. Predictive Modeling for Frailty Conditions in Elderly People: Machine Learning Approaches. JMIR medical informatics. 2020;8:e16678. http://www.ncbi.nlm.nih.gov/pubmed/32442149.
Google Scholar
Tarekegn A et al. Detection of frailty using genetic programming. Presented at the (2020). https://doi.org/10.1007/978-3-030-44094-7_15.
Tibshirani R, Walther G. Cluster validation by prediction strength. J Comput Graph Stat. 2005. https://doi.org/10.1198/106186005X59243.
Article MathSciNet Google Scholar
Trohidis K et al. Multi-label classification of music into emotions. In: ISMIR 2008—9th international conference on music information retrieval. 2008.
Vinh NX et al. Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res. 2010;11(95):2837−2854.
Wang J. Consistent selection of the number of clusters via crossvalidation. Biometrika. 2010. https://doi.org/10.1093/biomet/asq061.
Article MathSciNet MATH Google Scholar
Wilks DS. Cluster analysis. Int Geophys. 2011;100:603–616. https://doi.org/10.1016/B978-0-12-385022-5.00015-4.
Article Google Scholar
Witten DM, Tibshirani R. A framework for feature selection in clustering. J Am Stat Assoc. 2010. https://doi.org/10.1198/jasa.2010.tm09415.
Article MathSciNet MATH Google Scholar
Xu R, WunschII D. Survey of clustering algorithms. IEEE Trans Neural Netw. 2005;16(3):645–78. https://doi.org/10.1109/TNN.2005.845141.
Article Google Scholar
Zhang X, et al. A novel deep neural network model for multi-label chronic disease prediction. Front Genet. 2019. https://doi.org/10.3389/fgene.2019.00351.
Article Google Scholar

Download references

Acknowledgements

The author would like to thank the reviewers of this paper for their supportive comments.

Funding

No funding was received for this study.

Author information

Authors and Affiliations

Modelling and Data Science, Department of Mathematics, University of Turin, Turin, Italy
Adane Nega Tarekegn
Department of Information Technologies, Wroclaw University of Economics, Wroclaw, Poland
Krzysztof Michalak
Data Analysis and Modeling Unit, Department of Veterinary Sciences, University of Turin, Turin, Italy
Mario Giacobini

Authors

Adane Nega Tarekegn
View author publications
You can also search for this author in PubMed Google Scholar
Krzysztof Michalak
View author publications
You can also search for this author in PubMed Google Scholar
Mario Giacobini
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adane Nega Tarekegn.

Ethics declarations

Conflict of interest

The author declares no competing interests.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tarekegn, A.N., Michalak, K. & Giacobini, M. Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets. SN COMPUT. SCI. 1, 263 (2020). https://doi.org/10.1007/s42979-020-00283-z

Download citation

Received: 13 March 2020
Accepted: 30 July 2020
Published: 11 August 2020
DOI: https://doi.org/10.1007/s42979-020-00283-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets

Abstract

Access this article

Similar content being viewed by others

Clustering

K-Means and BIRCH: A Comparative Analysis Study

A Comparative Study on k-means Clustering Method and Analysis

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Cross-Validation Approach to Evaluate Clustering Algorithms: An Experimental Study Using Multi-Label Datasets

Abstract

Access this article

Similar content being viewed by others

Clustering

K-Means and BIRCH: A Comparative Analysis Study

A Comparative Study on k-means Clustering Method and Analysis

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation