survey

Experimental Comparisons of Clustering Approaches for Data Representation

Authors:
Sanjay Kumar Anand

NSUT East Campus (Formerly AIACTR), GGSIP University, New Delhi, India

NSUT East Campus (Formerly AIACTR), GGSIP University, New Delhi, India
View Profile

,
Suresh Kumar

NSUT, Main Campus, New Delhi, India

NSUT, Main Campus, New Delhi, India
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 55 Issue 3Article No.: 45pp 1–33https://doi.org/10.1145/3490384

Published:30 March 2022Publication History

ACM Computing Surveys

Abstract

Clustering approaches are extensively used by many areas such as IR, Data Integration, Document Classification, Web Mining, Query Processing, and many other domains and disciplines. Nowadays, much literature describes clustering algorithms on multivariate data sets. However, there is limited literature that presented them with exhaustive and extensive theoretical analysis as well as experimental comparisons. This experimental survey paper deals with the basic principle, and techniques used, including important characteristics, application areas, run-time performance, internal, external, and stability validity of cluster quality, etc., on five different data sets of eleven clustering algorithms. This paper analyses how these algorithms behave with five different multivariate data sets in data representation. To answer this question, we compared the efficiency of eleven clustering approaches on five different data sets using three validity metrics-internal, external, and stability and found the optimal score to know the feasible solution of each algorithm. In addition, we have also included four popular and modern clustering algorithms with only their theoretical discussion. Our experimental results for only traditional clustering algorithms showed that different algorithms performed different behavior on different data sets in terms of running time (speed), accuracy and, the size of data set. This study emphasized the need for more adaptive algorithms and a deliberate balance between the running time and accuracy with their theoretical as well as implementation aspects.

REFERENCES

[1] Abbas Osama Abu. 2008. Comparisons between data clustering algorithms. International Arab Journal of Information Technology (IAJIT) 5, 3 (2008).Google Scholar
[2] Affeldt Séverine, Labiod Lazhar, and Nadif Mohamed. 2020. Spectral clustering via ensemble deep autoencoder learning (SC-EDAE). Pattern Recognition 108 (2020), 107522.Google ScholarCross Ref
[3] Ashok P., Kadhar G. M., Elayaraja E., and Vadivel V.. 2013. Fuzzy based clustering method on yeast dataset with different fuzzification methods. In 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT). IEEE, 1–6.Google ScholarCross Ref
[4] Bai Liang, Cheng Xueqi, Liang Jiye, Shen Huawei, and Guo Yike. 2017. Fast density clustering strategies based on the k-means algorithm. Pattern Recognition 71 (2017), 375–386.Google ScholarCross Ref
[5] Bhatia Surbhi. 2014. New improved technique for initial cluster centers of K means clustering using genetic algorithm. In International Conference for Convergence for Technology-2014. IEEE, 1–4.Google ScholarCross Ref
[6] Bhattacharjee Arup Kumar, Dey Mantrita, Dutta Debalina, Sett Sudeepa, Mukherjee Soumen, and Deyasi Arpan. 2019. Comparative study and improvement of various clustering techniques in statistical programming environment. In Contemporary Advances in Innovative and Applicable Information Technology. Springer, 145–152.Google ScholarCross Ref
[7] Bih Joseph. 2006. Paradigm shift-an introduction to fuzzy logic. IEEE Potentials 25, 1 (2006), 6–21.Google ScholarCross Ref
[8] Björkelund Anders and Kuhn Jonas. 2014. Learning structured perceptrons for coreference resolution with latent antecedents and non-local features. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 47–57.Google ScholarCross Ref
[9] Bo Deyu, Wang Xiao, Shi Chuan, Zhu Meiqi, Lu Emiao, and Cui Peng. 2020. Structural deep clustering network. In Proceedings of The Web Conference 2020. 1400–1410.Google ScholarDigital Library
[10] Boryczka Urszula. 2009. Finding groups in data: Cluster analysis with ants. Applied Soft Computing 9, 1 (2009), 61–70.Google ScholarDigital Library
[11] Brock Guy, Pihur Vasyl, Datta Susmita, Datta Somnath, et al. 2011. clValid, an R package for cluster validation. Journal of Statistical Software (Brock et al., March 2008) (2011).Google Scholar
[12] Brusco Michael J., Steinley Douglas, Stevens Jordan, and Cradit J. Dennis. 2019. Affinity propagation: An exemplar-based tool for clustering in psychological research. Brit. J. Math. Statist. Psych. 72, 1 (2019), 155–182.Google ScholarCross Ref
[13] Burewar Sairaj L.. 2018. Voice controlled robotic system by using FFT. In 2018 4th International Conference for Convergence in Technology (I2CT). IEEE, 1–4.Google ScholarCross Ref
[14] Chang Kai-Wei, Samdani Rajhans, Rozovskaya Alla, Sammons Mark, and Roth Dan. 2012. Illinois-Coref: The UI system in the CoNLL-2012 shared task. In Joint Conference on EMNLP and CoNLL-Shared Task. 113–117.Google Scholar
[15] Clark Kevin and Manning Christopher D.. 2016. Improving coreference resolution by learning entity-level distributed representations. arXiv preprint arXiv:1606.01323 (2016).Google Scholar
[16] Davis Jesse and Goadrich Mark. 2006. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning. 233–240.Google ScholarDigital Library
[17] Du Mingjing, Ding Shifei, and Jia Hongjie. 2016. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowledge-Based Systems 99 (2016), 135–145.Google ScholarDigital Library
[18] Dua D. and Graff C.. 2019. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science, zuletzt abgerufen am: 14.09.2019. Google Scholar (2019).Google Scholar
[19] Duran Benjamin S. and Odell Patrick L.. 2013. Cluster Analysis: A Survey. Vol. 100. Springer Science & Business Media.Google Scholar
[20] Fernandes Eraldo, Santos Cicero dos, and Milidiú Ruy Luiz. 2012. Latent structure perceptron with feature induction for unrestricted coreference resolution. In Joint Conference on EMNLP and CoNLL-Shared Task. 41–48.Google ScholarDigital Library
[21] Fernandes Eraldo Rezende, Santos Cícero Nogueira dos, and Milidiú Ruy Luiz. 2014. Latent trees for coreference resolution. Computational Linguistics 40, 4 (2014), 801–835.Google ScholarDigital Library
[22] Filippone Maurizio, Camastra Francesco, Masulli Francesco, and Rovetta Stefano. 2008. A survey of kernel and spectral methods for clustering. Pattern Recognition 41, 1 (2008), 176–190.Google ScholarDigital Library
[23] Fraley Chris and Raftery Adrian E.. 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal 41, 8 (1998), 578–588.Google ScholarCross Ref
[24] Frey Brendan J. and Dueck Delbert. 2007. Clustering by passing messages between data points. Science 315, 5814 (2007), 972–976.Google ScholarCross Ref
[25] Greene Derek, Cunningham Pádraig, and Mayer Rudolf. 2008. Unsupervised learning and clustering. In Machine Learning Techniques for Multimedia. Springer, 51–90.Google ScholarCross Ref
[26] Guha Sudipto, Rastogi Rajeev, and Shim Kyuseok. 1998. CURE: An efficient clustering algorithm for large databases. ACM Sigmod record 27, 2 (1998), 73–84.Google ScholarDigital Library
[27] Guha Sudipto, Rastogi Rajeev, and Shim Kyuseok. 2000. ROCK: A robust clustering algorithm for categorical attributes. Information Systems 25, 5 (2000), 345–366.Google ScholarDigital Library
[28] Han Jiawei and Kamber Micheline. 2001. Data Mining Concepts and Techniques, Morgan Kaufmann Publishers. San Francisco, CA (2001), 335–391.Google Scholar
[29] Haponchyk Iryna. 2018. Advanced Models of Supervised Structural Clustering. Ph.D. Dissertation. University of Trento.Google Scholar
[30] Haponchyk Iryna and Moschitti Alessandro. 2021. Supervised neural clustering via latent structured output learning: application to question intents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3364–3374.Google ScholarCross Ref
[31] Herrero Javier, Valencia Alfonso, and Dopazo Joaquın. 2001. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17, 2 (2001), 126–136.Google ScholarCross Ref
[32] Hong Xuezhen, Wang Jun, and Qi Guande. 2014. Comparison of spectral clustering, K-clustering and hierarchical clustering on e-nose datasets: Application to the recognition of material freshness, adulteration levels and pretreatment approaches for tomato juices. Chemometrics and Intelligent Laboratory Systems 133 (2014), 17–24.Google ScholarCross Ref
[33] James Gareth, Witten Daniela, Hastie Trevor, and Tibshirani Robert. 2013. An Introduction to Statistical Learning, Vol. 112. Springer.Google ScholarCross Ref
[34] Karypis George, Han Eui-Hong, and Kumar Vipin. 1999. Chameleon: Hierarchical clustering using dynamic modeling. Computer 32, 8 (1999), 68–75.Google ScholarDigital Library
[35] Kaufman Leonard and Rousseeuw Peter J.. 2009. Finding Groups in Data: An Introduction to Cluster Analysis, Vol. 344. John Wiley & Sons.Google Scholar
[36] Kim Hyunsoo and Park Haesun. 2007. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23, 12 (2007), 1495–1502.Google ScholarCross Ref
[37] Kumar K. Mahesh and Reddy A. Rama Mohan. 2016. A fast DBSCAN clustering algorithm by accelerating neighbor searching using Groups method. Pattern Recognition 58 (2016), 39–48.Google ScholarDigital Library
[38] Lalkhen Abdul Ghaaliq and McCluskey Anthony. 2008. Clinical tests: Sensitivity and specificity. Continuing Education in Anaesthesia Critical Care & Pain 8, 6 (2008), 221–223.Google ScholarCross Ref
[39] Lassila Ora and McGuinness Deborah. 2001. The role of frame-based representation on the semantic web. Linköping Electronic Articles in Computer and Information Science 6, 5 (2001), 2001.Google Scholar
[40] Lee Kenton, He Luheng, Lewis Mike, and Zettlemoyer Luke. 2017. End-to-end neural coreference resolution. arXiv preprint arXiv:1707.07045 (2017).Google Scholar
[41] Lehman Jill Fain. 1992. System architecture and knowledge representation. In Adaptive Parsing. Springer, 45–66.Google Scholar
[42] Li Zejian and Tang Yongchuan. 2018. Comparative density peaks clustering. Expert Systems with Applications 95 (2018), 236–247.Google ScholarCross Ref
[43] Lior Rokach and Maimon Oded. 2005. Clustering methods. Data Mining and Knowledge Discovery Handbook (2005), 321–352.Google Scholar
[44] Liu Jialu and Han Jiawei. 2018. Spectral clustering. In Data Clustering. Chapman and Hall/CRC, 177–200.Google Scholar
[45] Martinez Wendy L. and Martinez Angel R.. 2015. Computational Statistics Handbook with MATLAB, Vol. 22. CRC Press.Google ScholarCross Ref
[46] Martschat Sebastian and Strube Michael. 2015. Latent structures for coreference resolution. Transactions of the Association for Computational Linguistics 3 (2015), 405–418.Google ScholarCross Ref
[47] McDaid Aaron F., Greene Derek, and Hurley Neil. 2011. Normalized mutual information to evaluate overlapping community finding algorithms. arXiv preprint arXiv:1110.2515 (2011).Google Scholar
[48] McNeil Donald R.. 1977. Interactive Data Analysis: A Practical Primer. (1977).Google Scholar
[49] Meilă Marina. 2003. Comparing clusterings by the variation of information. In Learning Theory and Kernel Machines. Springer, 173–187.Google ScholarCross Ref
[50] Min Erxue, Guo Xifeng, Liu Qiang, Zhang Gen, Cui Jianjing, and Long Jun. 2018. A survey of clustering with deep learning: From the perspective of network architecture. IEEE Access 6 (2018), 39501–39514.Google ScholarCross Ref
[51] Mingoti Sueli A. and Lima Joab O.. 2006. Comparing SOM neural network with Fuzzy c-means, K-means and traditional hierarchical clustering algorithms. European Journal of Operational Research 174, 3 (2006), 1742–1759.Google ScholarCross Ref
[52] Moiane André Fenias and MacHado Álvaro Muriel Lima. 2018. Evaluation of the clustering performance of affinity propagation algorithm considering the influence of preference parameter and damping factor. Boletim de Ciências Geodésicas 24 (2018), 426–441.Google ScholarCross Ref
[53] Nock Richard and Nielsen Frank. 2006. On weighting clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 8 (2006), 1223–1235.Google ScholarDigital Library
[54] Ogbuabor Godwin and Ugwoke F. N.. 2018. Clustering algorithm for a healthcare dataset using silhouette score value. International Journal of Computer Science & Information Technology 10, 2 (2018), 27–37.Google ScholarCross Ref
[55] Park Hae-Sang and Jun Chi-Hyuck. 2009. A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications 36, 2 (2009), 3336–3341.Google ScholarDigital Library
[56] Pedregosa Fabian, Varoquaux Gaël, Gramfort Alexandre, Michel Vincent, Thirion Bertrand, Grisel Olivier, Blondel Mathieu, Prettenhofer Peter, Weiss Ron, Dubourg Vincent, et al. 2011. Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research 12 (2011), 2825–2830.Google ScholarDigital Library
[57] Quattrone Giovanni, Capra Licia, Meo Pasquale De, Ferrara Emilio, and Ursino Domenico. 2011. Effective retrieval of resources in folksonomies using a new tag similarity measure. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 545–550.Google ScholarDigital Library
[58] Rodriguez Alex and Laio Alessandro. 2014. Clustering by fast search and find of density peaks. Science 344, 6191 (2014), 1492–1496.Google ScholarCross Ref
[59] Sammut Claude and Webb Geoffrey I.. 2011. Encyclopedia of Machine Learning. Springer Science & Business Media.Google ScholarCross Ref
[60] Schubert Erich and Rousseeuw Peter J.. 2019. Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS algorithms. In International Conference on Similarity Search and Applications. Springer, 171–187.Google ScholarDigital Library
[61] Schubert Erich and Zimek Arthur. 2019. ELKI: A large open-source library for data analysis-ELKI Release 0.7. 5” Heidelberg”. arXiv preprint arXiv:1902.03616 (2019).Google Scholar
[62] Shaham Uri, Stanton Kelly, Li Henry, Nadler Boaz, Basri Ronen, and Kluger Yuval. 2018. Spectralnet: Spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587 (2018).Google Scholar
[63] Shieh Horng-Lin, Chang Po-Lun, and Lee Chien-Nan. 2013. An efficient method for estimating cluster radius of subtractive clustering based on genetic algorithm. In 2013 IEEE International Symposium on Consumer Electronics (ISCE). IEEE, 139–140.Google ScholarCross Ref
[64] Sowa John F.. 1987. Semantic Networks. (1987).Google Scholar
[65] Sruthi S. and Shalini L.. 2014. Sentence clustering in text document using fuzzy clustering algorithm. In 2014 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT). IEEE, 1473–1476.Google ScholarCross Ref
[66] Stevens Robert, Goble Carole A., and Bechhofer Sean. 2000. Ontology-based knowledge representation for bioinformatics. Briefings in Bioinformatics 1, 4 (2000), 398–414.Google ScholarCross Ref
[67] Tan Pang-Ning, Steinbach Micahel, and Kumar Vipin. 2006. Introduction to data mining, Pearson Education. Inc., New Delhi (2006).Google Scholar
[68] Team R. Core et al. 2013. R: A language and environment for statistical computing. (2013).Google Scholar
[69] Ting K. M.. 2017. Confusion Matrix, Encyclopedia of Machine Learning and Data Mining.Google Scholar
[70] Tsai Cheng-Fa and Huang Tang-Wei. 2012. QIDBSCAN: A quick density-based clustering technique. In 2012 International Symposium on Computer, Consumer and Control. IEEE, 638–641.Google ScholarDigital Library
[71] Verma Deepak and Meila Marina. 2003. A comparison of spectral clustering algorithms. University of Washington Tech Rep UWCSE030501 1 (2003), 1–18.Google Scholar
[72] Visalakshi N. Karthikeyani and Thangavel K.. 2009. Distributed data clustering: A comparative analysis. In Foundations of Computational Intelligence Volume 6. Springer, 371–397.Google Scholar
[73] Luxburg Ulrike Von. 2007. A tutorial on spectral clustering. Statistics and Computing 17, 4 (2007), 395–416.Google ScholarDigital Library
[74] Wagner Silke and Wagner Dorothea. 2007. Comparing clusterings: An overview. Universität Karlsruhe, Fakultät für Informatik Karlsruhe.Google Scholar
[75] Wang Limin, Zheng Kaiyue, Tao Xing, and Han Xuming. 2018. Affinity propagation clustering algorithm based on large-scale data-set. International Journal of Computers and Applications 40, 3 (2018), 1–6.Google ScholarDigital Library
[76] Wei Chih-Ping, Lee Yen-Hsien, and Hsu Che-Ming. 2000. Empirical comparison of fast clustering algorithms for large data sets. In Proceedings of the 33rd Annual Hawaii International Conference on System Sciences. IEEE, 10–pp.Google Scholar
[77] Wiseman Sam, Rush Alexander M., and Shieber Stuart M.. 2016. Learning global features for coreference resolution. arXiv preprint arXiv:1604.03035 (2016).Google Scholar
[78] Xu Dongkuan and Tian Yingjie. 2015. A comprehensive survey of clustering algorithms. Annals of Data Science 2, 2 (2015), 165–193.Google ScholarCross Ref
[79] Zhan Xiaohang, Xie Jiahao, Liu Ziwei, Ong Yew-Soon, and Loy Chen Change. 2020. Online deep clustering for unsupervised representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6688–6697.Google ScholarCross Ref
[80] Zhang Rui, Santos Cicero Nogueira dos, Yasunaga Michihiro, Xiang Bing, and Radev Dragomir. 2018. Neural coreference resolution with deep biaffine attention by joint mention detection and mention clustering. arXiv preprint arXiv:1805.04893 (2018).Google Scholar

Index Terms

Experimental Comparisons of Clustering Approaches for Data Representation
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
    1. Redundancy
  2. Embedded and cyber-physical systems
    1. Embedded systems
    2. Robotics

Recommendations

A new index for clustering validation with overlapped clusters

An index to compare clustering solutions with overlapped groups is proposed.The index is carefully designed with an intuitive probabilistic approach.Results with standard datasets for benchmarking are included.It has been applied also to a real ...
Read More
Improvement in k-Means Clustering Algorithm Using Data Clustering
ICCUBEA '15: Proceedings of the 2015 International Conference on Computing Communication Control and Automation

The set of objects having same characteristics are organized in groups and clusters of these objects reformed known as Data Clustering. It is an unsupervisedlearning technique for classification of data. K-means algorithm is widely used and famous ...
Read More
On cluster tree for nested and multi-density data clustering

Clustering is one of the important data mining tasks. Nested clusters or clusters of multi-density are very prevalent in data sets. In this paper, we develop a hierarchical clustering approach-a cluster tree to determine such cluster structure and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Computing Surveys Volume 55, Issue 3
March 2023
772 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/3514180
Editor:
Albert Zomaya
University of Sydney, Australia
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 30 March 2022
- Accepted: 1 October 2021
- Revised: 1 September 2021
- Received: 1 December 2019
Published in csur Volume 55, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Clustering approach
internal validation
external validation
stability validation
optimal score
Qualifiers
- survey
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 7
  Total Citations
  View Citations
- 2,143
  Total Downloads
- Downloads (Last 12 months)390
- Downloads (Last 6 weeks)52
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

HTML Format

View this article in HTML Format .

View HTML Format

Experimental Comparisons of Clustering Approaches for Data Representation

ACM Computing Surveys

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

A new index for clustering validation with overlapped clusters

Improvement in k-Means Clustering Algorithm Using Data Clustering

On cluster tree for nested and multi-density data clustering