nach oben

Erschienen in:

2021 | OriginalPaper | Buchkapitel

Clustering Mixed-Type Data: A Benchmark Study on KAMILA and K-Prototypes

verfasst von : Jarrett Jimeno, Madhumita Roy, Cristina Tortora

Erschienen in: Data Analysis and Rationality in a Complex World

Verlag: Springer International Publishing

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Benchmarking in cluster analysis is the process of analyzing which clustering techniques give the best result for different types of data structures as well as setting a standard for evaluation of newer clustering methods. There are many instances of benchmarking in cluster analysis for continuous data, but only a few for mixed-type data, i.e. data sets with nominal and continuous variables. Therefore, we explore the process for benchmarking various clustering methods on simulated mixed-type data sets with varying proportions of continuous and nominal variables. For this purpose, we test a newer clustering algorithm, KAMILA, against K-prototypes and tandem analysis where data are preprocessed using multiple correspondence analysis and then clustered using K-means, fuzzy K-means, probabilistic distance clustering (PD), and Student-t mixture models.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Vorheriges Kapitel Single Imputation Via Chunk-Wise PCA

Nächstes Kapitel Exploring Social Attitudes Toward the Green Infrastructure Plan of the Drama City in Greece

https://cristinatortora.github.io/Benchmark-on-Clustering-Mixed-Type-Data/.

Ahmad, A., Khan, S.S.: Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7, 31883–31902 (2019)CrossRef

Andrews, J.L., McNicholas, P.D., Subedi. S.: Model-based classification via mixtures of multivariate t-distributions. Comput. Stat. Data An. 55(1), 520–529 (2011)

Andrews, J.L., Wickins, J.R., Boers, N.M., McNicholas, P.D.: teigen: an R package for model-based clustering and classification via the multivariate \(t\) distribution. J. Stat. Softw. 83(7), 1–32 (2018)CrossRef

Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Springer Science & Business Media (2013)

Boulesteix, A.L., Dangl, R., Dean, N., Guyon, I., Hennig, C., Leisch, F., Steinley, D., Van Mechelen, I.: Benchmarking in cluster analysis: a white paper. arXiv preprint arXiv:180910496 (2018)

Ferraro, M.B., Giordani, P.: A toolbox for fuzzy clustering using the R programming language. Fuzzy Set Syst. 279, 1–16 (2015)MathSciNetCrossRef

Foss, A.H., Markatou, M.: kamila: Clustering mixed-type data in R and Hadoop. J. Stat. Softw. 83(13), 1–45 (2018)CrossRef

Genz. A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., Hothorn, T.: mvtnorm: Multivariate Normal and \(t\) Distr. R package version 1.0-10 (2019)

Greenacre, M., Hastie, T.: The geometric interpretation of correspondence analysis. J. Am. Stat. Ass. 82(398), 437–447 (1987)MathSciNetCrossRef

Hennig, C.: What are the true clusters? Pattern Recognit. Lett. 64, 53–62 (2015)CrossRef

Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)CrossRef

Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)CrossRef

Hunt, L., Jorgensen, M.: Clustering mixed data. Wiley Int. Rev. Data Min. Knowl. Disc. 1(4), 352–361 (2011)CrossRef

Iyigun, C., Ben-Israel, A.: Probabilistic distance clustering adjusted for cluster size. Probab. Eng. Inform. Sci. 22(4), 603–621 (2008)MathSciNetCrossRef

Lê, S., Josse, J., Husson, F.: FactoMineR: a package for multivariate analysis. J. Stat. Softw. 25(1), 1–18 (2008)CrossRef

MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, CA, USA, vol. 1, pp. 281–297 (1967)

McParland, D., Gormley, I.C.: Model based clustering for mixed data: clustmd. Adv. Data Anal. Classi. 10(2), 155–169 (2016)MathSciNetCrossRef

R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2018). https://www.R-project.org/

Roy, M., Jimeno, J., Tortora, C.: (2019). https://github.com/cristinatortora/Benchmark-on-Clustering-Mixed-Type-Data

Szepannek, G.: clustmixtype: user-friendly clustering of mixed-type data in R. R J. 10(2), 200–208 (2018)CrossRef

Tortora, C., Summa, M.G., Marino, M., Palumbo, F.: Factor probabilistic distance clustering (FPDC): a new clustering method. Adv. Data Anal. Classi. 10(4), 441–464 (2016)MathSciNetCrossRef

Tortora, C., ElSherbiny, A., Browne, R.P., Franczak, B.C., McNicholas, P.D.: MixGHD: Model Based Clustering, Classification and Discriminant Analysis Using the Mixture of Generalized Hyperbolic Distributions. R package version 2.3.1 (2019)

Tortora, C., Vidales, N., McNicholas, P.D.: FPDclustering: PD-Clustering and Factor PD-Clustering. R package version 1.3 (2019)

van de Velden, M., Iodice D’Enza, A., Palumbo, F.: Cluster correspondence analysis. Psychometrika 82(1), 158–185 (2017)MathSciNetCrossRef

van de Velden, M., Iodice D’Enza, A., Markos, A.: Distance-based clustering of mixed data. Wiley Interdiscip. Rev. Comput. Stat. 11(3), e1456 (2019)MathSciNetCrossRef

Wang, K., Ng, A., McLachlan, G.: EMMIXskew: The EM Algorithm and Skew Mixture Distribution. R package version 1.0.3 (2018)

Titel: Clustering Mixed-Type Data: A Benchmark Study on KAMILA and K-Prototypes
verfasst von: Jarrett Jimeno
Madhumita Roy
Cristina Tortora
Verlag: Springer International Publishing
Buch: Data Analysis and Rationality in a Complex World
Print ISBN: 978-3-030-60103-4

Electronic ISBN: 978-3-030-60104-1

Copyright-Jahr: 2021
DOI: https://doi.org/10.1007/978-3-030-60104-1_10

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner