Abstract
Clustering approaches are extensively used by many areas such as IR, Data Integration, Document Classification, Web Mining, Query Processing, and many other domains and disciplines. Nowadays, much literature describes clustering algorithms on multivariate data sets. However, there is limited literature that presented them with exhaustive and extensive theoretical analysis as well as experimental comparisons. This experimental survey paper deals with the basic principle, and techniques used, including important characteristics, application areas, run-time performance, internal, external, and stability validity of cluster quality, etc., on five different data sets of eleven clustering algorithms. This paper analyses how these algorithms behave with five different multivariate data sets in data representation. To answer this question, we compared the efficiency of eleven clustering approaches on five different data sets using three validity metrics-internal, external, and stability and found the optimal score to know the feasible solution of each algorithm. In addition, we have also included four popular and modern clustering algorithms with only their theoretical discussion. Our experimental results for only traditional clustering algorithms showed that different algorithms performed different behavior on different data sets in terms of running time (speed), accuracy and, the size of data set. This study emphasized the need for more adaptive algorithms and a deliberate balance between the running time and accuracy with their theoretical as well as implementation aspects.
- [1] . 2008. Comparisons between data clustering algorithms. International Arab Journal of Information Technology (IAJIT) 5, 3 (2008).Google Scholar
- [2] . 2020. Spectral clustering via ensemble deep autoencoder learning (SC-EDAE). Pattern Recognition 108 (2020), 107522.Google ScholarCross Ref
- [3] . 2013. Fuzzy based clustering method on yeast dataset with different fuzzification methods. In 2013 Fourth International Conference on Computing, Communications and Networking Technologies (ICCCNT). IEEE, 1–6.Google ScholarCross Ref
- [4] . 2017. Fast density clustering strategies based on the k-means algorithm. Pattern Recognition 71 (2017), 375–386.Google ScholarCross Ref
- [5] . 2014. New improved technique for initial cluster centers of K means clustering using genetic algorithm. In International Conference for Convergence for Technology-2014. IEEE, 1–4.Google ScholarCross Ref
- [6] . 2019. Comparative study and improvement of various clustering techniques in statistical programming environment. In Contemporary Advances in Innovative and Applicable Information Technology. Springer, 145–152.Google ScholarCross Ref
- [7] . 2006. Paradigm shift-an introduction to fuzzy logic. IEEE Potentials 25, 1 (2006), 6–21.Google ScholarCross Ref
- [8] . 2014. Learning structured perceptrons for coreference resolution with latent antecedents and non-local features. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 47–57.Google ScholarCross Ref
- [9] . 2020. Structural deep clustering network. In Proceedings of The Web Conference 2020. 1400–1410.Google ScholarDigital Library
- [10] . 2009. Finding groups in data: Cluster analysis with ants. Applied Soft Computing 9, 1 (2009), 61–70.Google ScholarDigital Library
- [11] . 2011. clValid, an R package for cluster validation. Journal of Statistical Software (Brock et al., March 2008) (2011).Google Scholar
- [12] . 2019. Affinity propagation: An exemplar-based tool for clustering in psychological research. Brit. J. Math. Statist. Psych. 72, 1 (2019), 155–182.Google ScholarCross Ref
- [13] . 2018. Voice controlled robotic system by using FFT. In 2018 4th International Conference for Convergence in Technology (I2CT). IEEE, 1–4.Google ScholarCross Ref
- [14] . 2012. Illinois-Coref: The UI system in the CoNLL-2012 shared task. In Joint Conference on EMNLP and CoNLL-Shared Task. 113–117.Google Scholar
- [15] . 2016. Improving coreference resolution by learning entity-level distributed representations. arXiv preprint arXiv:1606.01323 (2016).Google Scholar
- [16] . 2006. The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning. 233–240.Google ScholarDigital Library
- [17] . 2016. Study on density peaks clustering based on k-nearest neighbors and principal component analysis. Knowledge-Based Systems 99 (2016), 135–145.Google ScholarDigital Library
- [18] . 2019. UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science, zuletzt abgerufen am: 14.09.2019. Google Scholar (2019).Google Scholar
- [19] . 2013. Cluster Analysis: A Survey. Vol. 100. Springer Science & Business Media.Google Scholar
- [20] . 2012. Latent structure perceptron with feature induction for unrestricted coreference resolution. In Joint Conference on EMNLP and CoNLL-Shared Task. 41–48.Google ScholarDigital Library
- [21] . 2014. Latent trees for coreference resolution. Computational Linguistics 40, 4 (2014), 801–835.Google ScholarDigital Library
- [22] . 2008. A survey of kernel and spectral methods for clustering. Pattern Recognition 41, 1 (2008), 176–190.Google ScholarDigital Library
- [23] . 1998. How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal 41, 8 (1998), 578–588.Google ScholarCross Ref
- [24] . 2007. Clustering by passing messages between data points. Science 315, 5814 (2007), 972–976.Google ScholarCross Ref
- [25] . 2008. Unsupervised learning and clustering. In Machine Learning Techniques for Multimedia. Springer, 51–90.Google ScholarCross Ref
- [26] . 1998. CURE: An efficient clustering algorithm for large databases. ACM Sigmod record 27, 2 (1998), 73–84.Google ScholarDigital Library
- [27] . 2000. ROCK: A robust clustering algorithm for categorical attributes. Information Systems 25, 5 (2000), 345–366.Google ScholarDigital Library
- [28] . 2001. Data Mining Concepts and Techniques, Morgan Kaufmann Publishers. San Francisco, CA (2001), 335–391.Google Scholar
- [29] . 2018. Advanced Models of Supervised Structural Clustering. Ph.D. Dissertation. University of Trento.Google Scholar
- [30] . 2021. Supervised neural clustering via latent structured output learning: application to question intents. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 3364–3374.Google ScholarCross Ref
- [31] . 2001. A hierarchical unsupervised growing neural network for clustering gene expression patterns. Bioinformatics 17, 2 (2001), 126–136.Google ScholarCross Ref
- [32] . 2014. Comparison of spectral clustering, K-clustering and hierarchical clustering on e-nose datasets: Application to the recognition of material freshness, adulteration levels and pretreatment approaches for tomato juices. Chemometrics and Intelligent Laboratory Systems 133 (2014), 17–24.Google ScholarCross Ref
- [33] . 2013. An Introduction to Statistical Learning, Vol. 112. Springer.Google ScholarCross Ref
- [34] . 1999. Chameleon: Hierarchical clustering using dynamic modeling. Computer 32, 8 (1999), 68–75.Google ScholarDigital Library
- [35] . 2009. Finding Groups in Data: An Introduction to Cluster Analysis, Vol. 344. John Wiley & Sons.Google Scholar
- [36] . 2007. Sparse non-negative matrix factorizations via alternating non-negativity-constrained least squares for microarray data analysis. Bioinformatics 23, 12 (2007), 1495–1502.Google ScholarCross Ref
- [37] . 2016. A fast DBSCAN clustering algorithm by accelerating neighbor searching using Groups method. Pattern Recognition 58 (2016), 39–48.Google ScholarDigital Library
- [38] . 2008. Clinical tests: Sensitivity and specificity. Continuing Education in Anaesthesia Critical Care & Pain 8, 6 (2008), 221–223.Google ScholarCross Ref
- [39] . 2001. The role of frame-based representation on the semantic web. Linköping Electronic Articles in Computer and Information Science 6, 5 (2001), 2001.Google Scholar
- [40] . 2017. End-to-end neural coreference resolution. arXiv preprint arXiv:1707.07045 (2017).Google Scholar
- [41] . 1992. System architecture and knowledge representation. In Adaptive Parsing. Springer, 45–66.Google Scholar
- [42] . 2018. Comparative density peaks clustering. Expert Systems with Applications 95 (2018), 236–247.Google ScholarCross Ref
- [43] . 2005. Clustering methods. Data Mining and Knowledge Discovery Handbook (2005), 321–352.Google Scholar
- [44] . 2018. Spectral clustering. In Data Clustering. Chapman and Hall/CRC, 177–200.Google Scholar
- [45] . 2015. Computational Statistics Handbook with MATLAB, Vol. 22. CRC Press.Google ScholarCross Ref
- [46] . 2015. Latent structures for coreference resolution. Transactions of the Association for Computational Linguistics 3 (2015), 405–418.Google ScholarCross Ref
- [47] . 2011. Normalized mutual information to evaluate overlapping community finding algorithms. arXiv preprint arXiv:1110.2515 (2011).Google Scholar
- [48] . 1977. Interactive Data Analysis: A Practical Primer. (1977).Google Scholar
- [49] . 2003. Comparing clusterings by the variation of information. In Learning Theory and Kernel Machines. Springer, 173–187.Google ScholarCross Ref
- [50] . 2018. A survey of clustering with deep learning: From the perspective of network architecture. IEEE Access 6 (2018), 39501–39514.Google ScholarCross Ref
- [51] . 2006. Comparing SOM neural network with Fuzzy c-means, K-means and traditional hierarchical clustering algorithms. European Journal of Operational Research 174, 3 (2006), 1742–1759.Google ScholarCross Ref
- [52] . 2018. Evaluation of the clustering performance of affinity propagation algorithm considering the influence of preference parameter and damping factor. Boletim de Ciências Geodésicas 24 (2018), 426–441.Google ScholarCross Ref
- [53] . 2006. On weighting clustering. IEEE Transactions on Pattern Analysis and Machine Intelligence 28, 8 (2006), 1223–1235.Google ScholarDigital Library
- [54] . 2018. Clustering algorithm for a healthcare dataset using silhouette score value. International Journal of Computer Science & Information Technology 10, 2 (2018), 27–37.Google ScholarCross Ref
- [55] . 2009. A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications 36, 2 (2009), 3336–3341.Google ScholarDigital Library
- [56] . 2011. Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research 12 (2011), 2825–2830.Google ScholarDigital Library
- [57] . 2011. Effective retrieval of resources in folksonomies using a new tag similarity measure. In Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 545–550.Google ScholarDigital Library
- [58] . 2014. Clustering by fast search and find of density peaks. Science 344, 6191 (2014), 1492–1496.Google ScholarCross Ref
- [59] . 2011. Encyclopedia of Machine Learning. Springer Science & Business Media.Google ScholarCross Ref
- [60] . 2019. Faster k-medoids clustering: Improving the PAM, CLARA, and CLARANS algorithms. In International Conference on Similarity Search and Applications. Springer, 171–187.Google ScholarDigital Library
- [61] . 2019. ELKI: A large open-source library for data analysis-ELKI Release 0.7. 5” Heidelberg”. arXiv preprint arXiv:1902.03616 (2019).Google Scholar
- [62] . 2018. Spectralnet: Spectral clustering using deep neural networks. arXiv preprint arXiv:1801.01587 (2018).Google Scholar
- [63] . 2013. An efficient method for estimating cluster radius of subtractive clustering based on genetic algorithm. In 2013 IEEE International Symposium on Consumer Electronics (ISCE). IEEE, 139–140.Google ScholarCross Ref
- [64] . 1987. Semantic Networks. (1987).Google Scholar
- [65] . 2014. Sentence clustering in text document using fuzzy clustering algorithm. In 2014 International Conference on Control, Instrumentation, Communication and Computational Technologies (ICCICCT). IEEE, 1473–1476.Google ScholarCross Ref
- [66] . 2000. Ontology-based knowledge representation for bioinformatics. Briefings in Bioinformatics 1, 4 (2000), 398–414.Google ScholarCross Ref
- [67] . 2006. Introduction to data mining, Pearson Education. Inc., New Delhi (2006).Google Scholar
- [68] . 2013. R: A language and environment for statistical computing. (2013).Google Scholar
- [69] . 2017. Confusion Matrix, Encyclopedia of Machine Learning and Data Mining.Google Scholar
- [70] . 2012. QIDBSCAN: A quick density-based clustering technique. In 2012 International Symposium on Computer, Consumer and Control. IEEE, 638–641.Google ScholarDigital Library
- [71] . 2003. A comparison of spectral clustering algorithms. University of Washington Tech Rep UWCSE030501 1 (2003), 1–18.Google Scholar
- [72] . 2009. Distributed data clustering: A comparative analysis. In Foundations of Computational Intelligence Volume 6. Springer, 371–397.Google Scholar
- [73] . 2007. A tutorial on spectral clustering. Statistics and Computing 17, 4 (2007), 395–416.Google ScholarDigital Library
- [74] . 2007. Comparing clusterings: An overview. Universität Karlsruhe, Fakultät für Informatik Karlsruhe.Google Scholar
- [75] . 2018. Affinity propagation clustering algorithm based on large-scale data-set. International Journal of Computers and Applications 40, 3 (2018), 1–6.Google ScholarDigital Library
- [76] . 2000. Empirical comparison of fast clustering algorithms for large data sets. In Proceedings of the 33rd Annual Hawaii International Conference on System Sciences. IEEE, 10–pp.Google Scholar
- [77] . 2016. Learning global features for coreference resolution. arXiv preprint arXiv:1604.03035 (2016).Google Scholar
- [78] . 2015. A comprehensive survey of clustering algorithms. Annals of Data Science 2, 2 (2015), 165–193.Google ScholarCross Ref
- [79] . 2020. Online deep clustering for unsupervised representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6688–6697.Google ScholarCross Ref
- [80] . 2018. Neural coreference resolution with deep biaffine attention by joint mention detection and mention clustering. arXiv preprint arXiv:1805.04893 (2018).Google Scholar
Index Terms
- Experimental Comparisons of Clustering Approaches for Data Representation
Recommendations
A new index for clustering validation with overlapped clusters
An index to compare clustering solutions with overlapped groups is proposed.The index is carefully designed with an intuitive probabilistic approach.Results with standard datasets for benchmarking are included.It has been applied also to a real ...
Improvement in k-Means Clustering Algorithm Using Data Clustering
ICCUBEA '15: Proceedings of the 2015 International Conference on Computing Communication Control and AutomationThe set of objects having same characteristics are organized in groups and clusters of these objects reformed known as Data Clustering. It is an unsupervisedlearning technique for classification of data. K-means algorithm is widely used and famous ...
On cluster tree for nested and multi-density data clustering
Clustering is one of the important data mining tasks. Nested clusters or clusters of multi-density are very prevalent in data sets. In this paper, we develop a hierarchical clustering approach-a cluster tree to determine such cluster structure and ...
Comments