skip to main content
survey

Experimental Comparisons of Clustering Approaches for Data Representation

Published:30 March 2022Publication History
Skip Abstract Section

Abstract

Clustering approaches are extensively used by many areas such as IR, Data Integration, Document Classification, Web Mining, Query Processing, and many other domains and disciplines. Nowadays, much literature describes clustering algorithms on multivariate data sets. However, there is limited literature that presented them with exhaustive and extensive theoretical analysis as well as experimental comparisons. This experimental survey paper deals with the basic principle, and techniques used, including important characteristics, application areas, run-time performance, internal, external, and stability validity of cluster quality, etc., on five different data sets of eleven clustering algorithms. This paper analyses how these algorithms behave with five different multivariate data sets in data representation. To answer this question, we compared the efficiency of eleven clustering approaches on five different data sets using three validity metrics-internal, external, and stability and found the optimal score to know the feasible solution of each algorithm. In addition, we have also included four popular and modern clustering algorithms with only their theoretical discussion. Our experimental results for only traditional clustering algorithms showed that different algorithms performed different behavior on different data sets in terms of running time (speed), accuracy and, the size of data set. This study emphasized the need for more adaptive algorithms and a deliberate balance between the running time and accuracy with their theoretical as well as implementation aspects.

REFERENCES

  1. [1] Abbas Osama Abu. 2008. Comparisons between data clustering algorithms. International Arab Journal of Information Technology (IAJIT) 5, 3 (2008).Google ScholarGoogle Scholar
  2. [2] Affeldt Séverine, Labiod Lazhar, and Nadif Mohamed. 2020. Spectral clustering via ensemble deep autoencoder learning (SC-EDAE). Pattern Recognition 108 (2020), 107522.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Ashok P., Kadhar G. M., Elayaraja E., and Vadivel V.. 2013. Fuzzy based clustering method on yeast dataset with different fuzzification methods. In 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT). IEEE, 16.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Bai Liang, Cheng Xueqi, Liang Jiye, Shen Huawei, and Guo Yike. 2017. Fast density clustering strategies based on the k-means algorithm. Pattern Recognition 71 (2017), 375386.Google ScholarGoogle ScholarCross RefCross Ref
  5. [5] Bhatia Surbhi. 2014. New improved technique for initial cluster centers of K means clustering using genetic algorithm. In International Conference for Convergence for Technology-2014. IEEE, 14.Google ScholarGoogle ScholarCross RefCross Ref
  6. [6] Bhattacharjee Arup Kumar, Dey Mantrita, Dutta Debalina, Sett Sudeepa, Mukherjee Soumen, and Deyasi Arpan. 2019. Comparative study and improvement of various clustering techniques in statistical programming environment. In Contemporary Advances in Innovative and Applicable Information Technology. Springer, 145152.Google ScholarGoogle ScholarCross RefCross Ref
  7. [7] Bih Joseph. 2006. Paradigm shift-an introduction to fuzzy logic. IEEE Potentials 25, 1 (2006), 621.Google ScholarGoogle ScholarCross RefCross Ref
  8. [8] Björkelund Anders and Kuhn Jonas. 2014. Learning structured perceptrons for coreference resolution with latent antecedents and non-local features. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 4757.Google ScholarGoogle ScholarCross RefCross Ref
  9. [9] Bo Deyu, Wang Xiao, Shi Chuan, Zhu Meiqi, Lu Emiao, and Cui Peng. 2020. Structural deep clustering network. In Proceedings of The Web Conference 2020. 14001410.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Boryczka Urszula. 2009. Finding groups in data: Cluster analysis with ants. Applied Soft Computing 9, 1 (2009), 6170.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. [11] Brock Guy, Pihur Vasyl, Datta Susmita, Datta Somnath, et al. 2011. clValid, an R package for cluster validation. Journal of Statistical Software (Brock et al., March 2008) (2011).Google ScholarGoogle Scholar
  12. [12] Brusco Michael J., Steinley Douglas, Stevens Jordan, and Cradit J. Dennis. 2019. Affinity propagation: An exemplar-based tool for clustering in psychological research. Brit. J. Math. Statist. Psych. 72, 1 (2019), 155182.Google ScholarGoogle ScholarCross RefCross Ref
  13. [13] Burewar Sairaj L.. 2018. Voice controlled robotic system by using FFT. In 2018 4th International Conference for Convergence in Technology (I2CT). IEEE, 14.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Chang Kai-Wei, Samdani Rajhans, Rozovskaya Alla, Sammons Mark, and Roth Dan. 2012. Illinois-Coref: The UI system in the CoNLL-2012 shared task. In Joint Conference on EMNLP and CoNLL-Shared Task. 113117.Google ScholarGoogle Scholar
  15. [15] Clark Kevin and Manning Christopher D.. 2016. Improving coreference resolution by learning entity-level distributed representations. arXiv preprint arXiv:1606.01323 (2016).Google ScholarGoogle Scholar
  16. [16] Davis Jesse and Goadrich Mark. 2006. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning. 233240.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Du Mingjing, Ding Shifei, and Jia Hongjie. 2016. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowledge-Based Systems 99 (2016), 135145.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Dua D. and Graff C.. 2019. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science, zuletzt abgerufen am: 14.09.2019. Google Scholar (2019).Google ScholarGoogle Scholar
  19. [19] Duran Benjamin S. and Odell Patrick L.. 2013. Cluster Analysis: A Survey. Vol. 100. Springer Science & Business Media.Google ScholarGoogle Scholar
  20. [20] Fernandes Eraldo, Santos Cicero dos, and Milidiú Ruy Luiz. 2012. Latent structure perceptron with feature induction for unrestricted coreference resolution. In Joint Conference on EMNLP and CoNLL-Shared Task. 4148.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. [21] Fernandes Eraldo Rezende, Santos Cícero Nogueira dos, and Milidiú Ruy Luiz. 2014. Latent trees for coreference resolution. Computational Linguistics 40, 4 (2014), 801835.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Filippone Maurizio, Camastra Francesco, Masulli Francesco, and Rovetta Stefano. 2008. A survey of kernel and spectral methods for clustering. Pattern Recognition 41, 1 (2008), 176190.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Fraley Chris and Raftery Adrian E.. 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal 41, 8 (1998), 578588.Google ScholarGoogle ScholarCross RefCross Ref
  24. [24] Frey Brendan J. and Dueck Delbert. 2007. Clustering by passing messages between data points. Science 315, 5814 (2007), 972976.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Greene Derek, Cunningham Pádraig, and Mayer Rudolf. 2008. Unsupervised learning and clustering. In Machine Learning Techniques for Multimedia. Springer, 5190.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Guha Sudipto, Rastogi Rajeev, and Shim Kyuseok. 1998. CURE: An efficient clustering algorithm for large databases. ACM Sigmod record 27, 2 (1998), 7384.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. [27] Guha Sudipto, Rastogi Rajeev, and Shim Kyuseok. 2000. ROCK: A robust clustering algorithm for categorical attributes. Information Systems 25, 5 (2000), 345366.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Han Jiawei and Kamber Micheline. 2001. Data Mining Concepts and Techniques, Morgan Kaufmann Publishers. San Francisco, CA (2001), 335391.Google ScholarGoogle Scholar
  29. [29] Haponchyk Iryna. 2018. Advanced Models of Supervised Structural Clustering. Ph.D. Dissertation. University of Trento.Google ScholarGoogle Scholar
  30. [30] Haponchyk Iryna and Moschitti Alessandro. 2021. Supervised neural clustering via latent structured output learning: application to question intents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 33643374.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Herrero Javier, Valencia Alfonso, and Dopazo Joaquın. 2001. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17, 2 (2001), 126136.Google ScholarGoogle ScholarCross RefCross Ref
  32. [32] Hong Xuezhen, Wang Jun, and Qi Guande. 2014. Comparison of spectral clustering, K-clustering and hierarchical clustering on e-nose datasets: Application to the recognition of material freshness, adulteration levels and pretreatment approaches for tomato juices. Chemometrics and Intelligent Laboratory Systems 133 (2014), 1724.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] James Gareth, Witten Daniela, Hastie Trevor, and Tibshirani Robert. 2013. An Introduction to Statistical Learning, Vol. 112. Springer.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Karypis George, Han Eui-Hong, and Kumar Vipin. 1999. Chameleon: Hierarchical clustering using dynamic modeling. Computer 32, 8 (1999), 6875.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Kaufman Leonard and Rousseeuw Peter J.. 2009. Finding Groups in Data: An Introduction to Cluster Analysis, Vol. 344. John Wiley & Sons.Google ScholarGoogle Scholar
  36. [36] Kim Hyunsoo and Park Haesun. 2007. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23, 12 (2007), 14951502.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Kumar K. Mahesh and Reddy A. Rama Mohan. 2016. A fast DBSCAN clustering algorithm by accelerating neighbor searching using Groups method. Pattern Recognition 58 (2016), 3948.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. [38] Lalkhen Abdul Ghaaliq and McCluskey Anthony. 2008. Clinical tests: Sensitivity and specificity. Continuing Education in Anaesthesia Critical Care & Pain 8, 6 (2008), 221223.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Lassila Ora and McGuinness Deborah. 2001. The role of frame-based representation on the semantic web. Linköping Electronic Articles in Computer and Information Science 6, 5 (2001), 2001.Google ScholarGoogle Scholar
  40. [40] Lee Kenton, He Luheng, Lewis Mike, and Zettlemoyer Luke. 2017. End-to-end neural coreference resolution. arXiv preprint arXiv:1707.07045 (2017).Google ScholarGoogle Scholar
  41. [41] Lehman Jill Fain. 1992. System architecture and knowledge representation. In Adaptive Parsing. Springer, 4566.Google ScholarGoogle Scholar
  42. [42] Li Zejian and Tang Yongchuan. 2018. Comparative density peaks clustering. Expert Systems with Applications 95 (2018), 236247.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Lior Rokach and Maimon Oded. 2005. Clustering methods. Data Mining and Knowledge Discovery Handbook (2005), 321352.Google ScholarGoogle Scholar
  44. [44] Liu Jialu and Han Jiawei. 2018. Spectral clustering. In Data Clustering. Chapman and Hall/CRC, 177200.Google ScholarGoogle Scholar
  45. [45] Martinez Wendy L. and Martinez Angel R.. 2015. Computational Statistics Handbook with MATLAB, Vol. 22. CRC Press.Google ScholarGoogle ScholarCross RefCross Ref
  46. [46] Martschat Sebastian and Strube Michael. 2015. Latent structures for coreference resolution. Transactions of the Association for Computational Linguistics 3 (2015), 405418.Google ScholarGoogle ScholarCross RefCross Ref
  47. [47] McDaid Aaron F., Greene Derek, and Hurley Neil. 2011. Normalized mutual information to evaluate overlapping community finding algorithms. arXiv preprint arXiv:1110.2515 (2011).Google ScholarGoogle Scholar
  48. [48] McNeil Donald R.. 1977. Interactive Data Analysis: A Practical Primer. (1977).Google ScholarGoogle Scholar
  49. [49] Meilă Marina. 2003. Comparing clusterings by the variation of information. In Learning Theory and Kernel Machines. Springer, 173187.Google ScholarGoogle ScholarCross RefCross Ref
  50. [50] Min Erxue, Guo Xifeng, Liu Qiang, Zhang Gen, Cui Jianjing, and Long Jun. 2018. A survey of clustering with deep learning: From the perspective of network architecture. IEEE Access 6 (2018), 3950139514.Google ScholarGoogle ScholarCross RefCross Ref
  51. [51] Mingoti Sueli A. and Lima Joab O.. 2006. Comparing SOM neural network with Fuzzy c-means, K-means and traditional hierarchical clustering algorithms. European Journal of Operational Research 174, 3 (2006), 17421759.Google ScholarGoogle ScholarCross RefCross Ref
  52. [52] Moiane André Fenias and MacHado Álvaro Muriel Lima. 2018. Evaluation of the clustering performance of affinity propagation algorithm considering the influence of preference parameter and damping factor. Boletim de Ciências Geodésicas 24 (2018), 426441.Google ScholarGoogle ScholarCross RefCross Ref
  53. [53] Nock Richard and Nielsen Frank. 2006. On weighting clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 8 (2006), 12231235.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. [54] Ogbuabor Godwin and Ugwoke F. N.. 2018. Clustering algorithm for a healthcare dataset using silhouette score value. International Journal of Computer Science & Information Technology 10, 2 (2018), 2737.Google ScholarGoogle ScholarCross RefCross Ref
  55. [55] Park Hae-Sang and Jun Chi-Hyuck. 2009. A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications 36, 2 (2009), 33363341.Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. [56] Pedregosa Fabian, Varoquaux Gaël, Gramfort Alexandre, Michel Vincent, Thirion Bertrand, Grisel Olivier, Blondel Mathieu, Prettenhofer Peter, Weiss Ron, Dubourg Vincent, et al. 2011. Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research 12 (2011), 28252830.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. [57] Quattrone Giovanni, Capra Licia, Meo Pasquale De, Ferrara Emilio, and Ursino Domenico. 2011. Effective retrieval of resources in folksonomies using a new tag similarity measure. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 545550.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. [58] Rodriguez Alex and Laio Alessandro. 2014. Clustering by fast search and find of density peaks. Science 344, 6191 (2014), 14921496.Google ScholarGoogle ScholarCross RefCross Ref
  59. [59] Sammut Claude and Webb Geoffrey I.. 2011. Encyclopedia of Machine Learning. Springer Science & Business Media.Google ScholarGoogle ScholarCross RefCross Ref
  60. [60] Schubert Erich and Rousseeuw Peter J.. 2019. Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS algorithms. In International Conference on Similarity Search and Applications. Springer, 171187.Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. [61] Schubert Erich and Zimek Arthur. 2019. ELKI: A large open-source library for data analysis-ELKI Release 0.7. 5” Heidelberg”. arXiv preprint arXiv:1902.03616 (2019).Google ScholarGoogle Scholar
  62. [62] Shaham Uri, Stanton Kelly, Li Henry, Nadler Boaz, Basri Ronen, and Kluger Yuval. 2018. Spectralnet: Spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587 (2018).Google ScholarGoogle Scholar
  63. [63] Shieh Horng-Lin, Chang Po-Lun, and Lee Chien-Nan. 2013. An efficient method for estimating cluster radius of subtractive clustering based on genetic algorithm. In 2013 IEEE International Symposium on Consumer Electronics (ISCE). IEEE, 139140.Google ScholarGoogle ScholarCross RefCross Ref
  64. [64] Sowa John F.. 1987. Semantic Networks. (1987).Google ScholarGoogle Scholar
  65. [65] Sruthi S. and Shalini L.. 2014. Sentence clustering in text document using fuzzy clustering algorithm. In 2014 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT). IEEE, 14731476.Google ScholarGoogle ScholarCross RefCross Ref
  66. [66] Stevens Robert, Goble Carole A., and Bechhofer Sean. 2000. Ontology-based knowledge representation for bioinformatics. Briefings in Bioinformatics 1, 4 (2000), 398414.Google ScholarGoogle ScholarCross RefCross Ref
  67. [67] Tan Pang-Ning, Steinbach Micahel, and Kumar Vipin. 2006. Introduction to data mining, Pearson Education. Inc., New Delhi (2006).Google ScholarGoogle Scholar
  68. [68] Team R. Core et al. 2013. R: A language and environment for statistical computing. (2013).Google ScholarGoogle Scholar
  69. [69] Ting K. M.. 2017. Confusion Matrix, Encyclopedia of Machine Learning and Data Mining.Google ScholarGoogle Scholar
  70. [70] Tsai Cheng-Fa and Huang Tang-Wei. 2012. QIDBSCAN: A quick density-based clustering technique. In 2012 International Symposium on Computer, Consumer and Control. IEEE, 638641.Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. [71] Verma Deepak and Meila Marina. 2003. A comparison of spectral clustering algorithms. University of Washington Tech Rep UWCSE030501 1 (2003), 118.Google ScholarGoogle Scholar
  72. [72] Visalakshi N. Karthikeyani and Thangavel K.. 2009. Distributed data clustering: A comparative analysis. In Foundations of Computational Intelligence Volume 6. Springer, 371397.Google ScholarGoogle Scholar
  73. [73] Luxburg Ulrike Von. 2007. A tutorial on spectral clustering. Statistics and Computing 17, 4 (2007), 395416.Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. [74] Wagner Silke and Wagner Dorothea. 2007. Comparing clusterings: An overview. Universität Karlsruhe, Fakultät für Informatik Karlsruhe.Google ScholarGoogle Scholar
  75. [75] Wang Limin, Zheng Kaiyue, Tao Xing, and Han Xuming. 2018. Affinity propagation clustering algorithm based on large-scale data-set. International Journal of Computers and Applications 40, 3 (2018), 16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. [76] Wei Chih-Ping, Lee Yen-Hsien, and Hsu Che-Ming. 2000. Empirical comparison of fast clustering algorithms for large data sets. In Proceedings of the 33rd Annual Hawaii International Conference on System Sciences. IEEE, 10–pp.Google ScholarGoogle Scholar
  77. [77] Wiseman Sam, Rush Alexander M., and Shieber Stuart M.. 2016. Learning global features for coreference resolution. arXiv preprint arXiv:1604.03035 (2016).Google ScholarGoogle Scholar
  78. [78] Xu Dongkuan and Tian Yingjie. 2015. A comprehensive survey of clustering algorithms. Annals of Data Science 2, 2 (2015), 165193.Google ScholarGoogle ScholarCross RefCross Ref
  79. [79] Zhan Xiaohang, Xie Jiahao, Liu Ziwei, Ong Yew-Soon, and Loy Chen Change. 2020. Online deep clustering for unsupervised representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 66886697.Google ScholarGoogle ScholarCross RefCross Ref
  80. [80] Zhang Rui, Santos Cicero Nogueira dos, Yasunaga Michihiro, Xiang Bing, and Radev Dragomir. 2018. Neural coreference resolution with deep biaffine attention by joint mention detection and mention clustering. arXiv preprint arXiv:1805.04893 (2018).Google ScholarGoogle Scholar

Index Terms

  1. Experimental Comparisons of Clustering Approaches for Data Representation

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Computing Surveys
          ACM Computing Surveys  Volume 55, Issue 3
          March 2023
          772 pages
          ISSN:0360-0300
          EISSN:1557-7341
          DOI:10.1145/3514180
          Issue’s Table of Contents

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 30 March 2022
          • Accepted: 1 October 2021
          • Revised: 1 September 2021
          • Received: 1 December 2019
          Published in csur Volume 55, Issue 3

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • survey
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Full Text

        View this article in Full Text.

        View Full Text

        HTML Format

        View this article in HTML Format .

        View HTML Format