nach oben

Erschienen in:

2021 | OriginalPaper | Buchkapitel

Benchmarking in Cluster Analysis: A Study on Spectral Clustering, DBSCAN, and K-Means

verfasst von : Nivedha Murugesan, Irene Cho, Cristina Tortora

Erschienen in: Data Analysis and Rationality in a Complex World

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

We perform a benchmarking study to identify the advantages and the drawbacks of Spectral Clustering and Density-Based Spatial Clustering of Applications with Noise (DBSCAN). We compare the two methods with the classic K-means clustering. The methods are performed on five simulated and three real data sets. The obtained clustering results are compared using external and internal indices, as well as run times. Although there is not one method that performs best on all types of data sets, we find that DBSCAN should generally be reserved for non-convex data with well-separated clusters or for data with many outliers. Spectral Clustering has better overall performance but with higher instability of the results compared to K-means, and longer run time.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Intertemporal Exploratory Analysis of E-Commerce From Greek Households from Official Statistics Data

Nächstes Kapitel Detection of Topics and Time Series Variation in Consumer Web Communication Data

Performed on a MacBook Pro 2017 - Processor: 2.3 GHz Intel Core i5; Memory: 8GB.

Link: https://irene1014.github.io/benchmarking-study-cluster-analysis/.

Barndorff-Nielsen, O.E.: Exponentially decreasing distributions for the logarithm of particle size. Proc. R. Soc. Lond. A. 353(1674), 401–419 (1977)

Becker, R.A., Wilks, A.R.: (Orig. S code), R. Brownrigg (R Version), T.P. Minka and A. Deckmyn (Enhancements), maps: Draw Geographical Maps. R package v. 3.3.0 (2018)

Becker, R.A., Wilks A.R.: (Original S code), R. Brownrigg (R Version), mapdata: Extra Map Databases. R package version 2.3.0. (2018)

Berhane, F.: Data distributions where Kmeans clustering fails: Can DBSCAN be a solution? https://datascience-enthusiast.com/Python/DBSCAN_Kmeans.html (2020)

Desgraupes, B.: clusterCrit: Clustering Indices. R package version 1.2.8. (2018)

Emadi, H.S., Mazinani, S.M.: A novel anomaly detection algorithm using DBSCAN and SVM in wireless sensor networks. Wirel. Pers Comm. 98, 2025–2035 (2018)

Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: KDD-96 Proceedings, pp. 226-231 (1996)

Fitch, J.P., Khan, N.A.M., Tortora, C.: Back pain: a spectral clustering approach. Arch. Data Sci. Ser. B 1(1) (online first) (2019)

Francis, Z., Villagrasa, C., Clairand, I.: Simulation of DNA damage clustering after proton irradiation using an adapted DBSCAN algorithm. Comput. Methods Programs Biomed. 101(3), 265–270 (2011)

Genz, A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., Hothorn, T.: mvtnorm: Multivariate Normal and t Distributions, R package version 1.0-11 (2019)

Hahsler, M., Piekenbrock, M.: dbscan: Density Based Clustering of Applications with Noise (DBSCAN) and Related Algorithms, R package version 1.1-1 (2017)

Hornik, K., Grün, B.: movMF: An R package for fitting mixtures of von mises-fisher distributions. J. Stat. Soft. 58(10), 1–31 (2014)

Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)

Kahle, D., Wickham, H.: ggmap: spatial visualization with ggplot2. R J. 5(1), 144–161 (2013)

Karatzoglou, A., Smola, A., Hornik, K., Zeileis, A.: kernlab—An S4 Package for kernel methods in R. J. Stat. Soft. 11(9), 1–20 (2004)

Kassambara, A., Mundt, F.: factoextra: Extract and Visualize the Results of Multivariate Data Analyses, R package version 1.0.5 (2017)

Leisch F., Dimitriadou, E.: mlbench: Machine Learning Benchmark Problems, R package version 2.1-1 (2010)

Liu, Y., Li, Z., Xiong, H., Gao, X. Wu, J.: Understanding of internal clustering validation measures. In: Proceedings of the 2010 IEEE International Conference on Data Mining, pp. 911-916 (2010)

MacQueen, J.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, vol. 1: Statistics, pp. 281-297. University of California Press, Berkeley, California (1967)

Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K.: cluster: Cluster Analysis Basics and Extensions, R package version 2.1.0 (2019)

McInnes, L., Healy, J., Astels, S.: Comparing python clustering algorithms. https://hdbscan.readthedocs.io/en/latest/comparing_clustering_algorithms.html

McLachlan, G.J., Basford, K.E.: Mixture Models: Inference and Applications to Clustering, Statistics, Textbooks and Monographs, vol. 84 (1988)

Mechelen, I.V., Boulesteix, A., Dangl, R., Dean, N., Guyon, I., Hennig, C., Leisch, F., Steinley, D.: Benchmarking in cluster analysis: A white paper (2018). arXiv:1809.10496v2

Murugesan, N., Cho, I., Tortora, C.: Benchmarking in cluster analysis: a study on spectral clustering, DBSCAN and K-means (github repository) https://irene1014.github.io/benchmarking-study-cluster-analysis (2020)

Paccanaro, A., Casbon, J.A., Saqi, M.A.S.: Spectral clustering of protein sequences. Nucleic. Acid Res. 34(5), 1571–1580 (2006)

Peng, H., Pavlidis, N., Eckley, I., Tsalamanis, I.: Subspace clustering of very sparse high-dimensional data. In: 2018 IEEE International Conference on Big Data, pp. 3780–3783 (2018)

Punzo, A., Mazza, A., McNicholas, P.D.: ContaminatedMixt: An R package for fitting parsimonious mixtures of multivariate contaminated normal distributions. J. Stat. Softw. 85(10), 1–25 (2018)

R Core Team: R: A language and environment for statistical computing, R Foundation for Statistical Computing, Vienna, Austria (2017)

Rousseeuw, P.J.: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987)CrossRef

Sander, J., Ester, M., Kriegel, H., Xu, X.: Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Min. Knowl. Disc. 2, 169–194 (1998)CrossRef

Schubert, E., Sander, J., Ester, M., Kriegel, H., Xu, X.: DBSCAN revisited, revisited: why and how you should (Still) use DBSCAN. ACM T Database Syst. 42(3), 19 (2017)

Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 228, 888–905 (2000)

Tortora, C., ElSherbiny, A., Browne, R.P., Franczak B.C., McNicholas, P.D.: MixGHD: Model Based Clustering, Classification and Discriminant Analysis Using the Mixture of Generalized Hyperbolic Distributions, R package version 2.1 (2017)

Tortora, C., Franczak, B.C., Browne, R.P., McNicholas, P.D.: A mixture of coalesced generalized hyperbolic distributions. J. Classif. 36, 26–57 (2019)

Tukey, J.W.: A survey of sampling from contaminated distributions. In: Olkin, I., Ghurye, S.G., Hoeffding, W., Madow, W.G., Mann, H.B. (eds.) Contributions to Probability and Statistics: Essays in Honor of Harold Hotelling, pp 448–495 (1960)

Wang, S., Chen, F., Feng, J.: Spectral clustering of high-dimensional data via nonnegative matrix factorization. In: 2015 International Joint Conference on Neural Networks (IJCNN), pp 1–8 (2015)

Wickham, H.: ggplot2: Elegant Graphics for Data Analysis. Springer, NY (2016)

Wu, S., Feng, X., Zhou, W.: Spectral clustering of high-dimensional data exploiting sparse representation vectors. Neurocomputing 135, 229–239 (2014)

Titel: Benchmarking in Cluster Analysis: A Study on Spectral Clustering, DBSCAN, and K-Means
verfasst von: Nivedha Murugesan
Irene Cho
Cristina Tortora
Verlag: Springer International Publishing
Buch: Data Analysis and Rationality in a Complex World
Print ISBN: 978-3-030-60103-4

Electronic ISBN: 978-3-030-60104-1

Copyright-Jahr: 2021
DOI: https://doi.org/10.1007/978-3-030-60104-1_20

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner