Skip to main content

2021 | OriginalPaper | Buchkapitel

Clustering Mixed-Type Data: A Benchmark Study on KAMILA and K-Prototypes

verfasst von : Jarrett Jimeno, Madhumita Roy, Cristina Tortora

Erschienen in: Data Analysis and Rationality in a Complex World

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Benchmarking in cluster analysis is the process of analyzing which clustering techniques give the best result for different types of data structures as well as setting a standard for evaluation of newer clustering methods. There are many instances of benchmarking in cluster analysis for continuous data, but only a few for mixed-type data, i.e. data sets with nominal and continuous variables. Therefore, we explore the process for benchmarking various clustering methods on simulated mixed-type data sets with varying proportions of continuous and nominal variables. For this purpose, we test a newer clustering algorithm, KAMILA, against K-prototypes and tandem analysis where data are preprocessed using multiple correspondence analysis and then clustered using K-means, fuzzy K-means, probabilistic distance clustering (PD), and Student-t mixture models.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Ahmad, A., Khan, S.S.: Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7, 31883–31902 (2019)CrossRef Ahmad, A., Khan, S.S.: Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7, 31883–31902 (2019)CrossRef
Zurück zum Zitat Andrews, J.L., McNicholas, P.D., Subedi. S.: Model-based classification via mixtures of multivariate t-distributions. Comput. Stat. Data An. 55(1), 520–529 (2011) Andrews, J.L., McNicholas, P.D., Subedi. S.: Model-based classification via mixtures of multivariate t-distributions. Comput. Stat. Data An. 55(1), 520–529 (2011)
Zurück zum Zitat Andrews, J.L., Wickins, J.R., Boers, N.M., McNicholas, P.D.: teigen: an R package for model-based clustering and classification via the multivariate \(t\) distribution. J. Stat. Softw. 83(7), 1–32 (2018)CrossRef Andrews, J.L., Wickins, J.R., Boers, N.M., McNicholas, P.D.: teigen: an R package for model-based clustering and classification via the multivariate \(t\) distribution. J. Stat. Softw. 83(7), 1–32 (2018)CrossRef
Zurück zum Zitat Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Springer Science & Business Media (2013) Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Springer Science & Business Media (2013)
Zurück zum Zitat Boulesteix, A.L., Dangl, R., Dean, N., Guyon, I., Hennig, C., Leisch, F., Steinley, D., Van Mechelen, I.: Benchmarking in cluster analysis: a white paper. arXiv preprint arXiv:180910496 (2018) Boulesteix, A.L., Dangl, R., Dean, N., Guyon, I., Hennig, C., Leisch, F., Steinley, D., Van Mechelen, I.: Benchmarking in cluster analysis: a white paper. arXiv preprint arXiv:​180910496 (2018)
Zurück zum Zitat Ferraro, M.B., Giordani, P.: A toolbox for fuzzy clustering using the R programming language. Fuzzy Set Syst. 279, 1–16 (2015)MathSciNetCrossRef Ferraro, M.B., Giordani, P.: A toolbox for fuzzy clustering using the R programming language. Fuzzy Set Syst. 279, 1–16 (2015)MathSciNetCrossRef
Zurück zum Zitat Foss, A.H., Markatou, M.: kamila: Clustering mixed-type data in R and Hadoop. J. Stat. Softw. 83(13), 1–45 (2018)CrossRef Foss, A.H., Markatou, M.: kamila: Clustering mixed-type data in R and Hadoop. J. Stat. Softw. 83(13), 1–45 (2018)CrossRef
Zurück zum Zitat Genz. A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., Hothorn, T.: mvtnorm: Multivariate Normal and \(t\) Distr. R package version 1.0-10 (2019) Genz. A., Bretz, F., Miwa, T., Mi, X., Leisch, F., Scheipl, F., Hothorn, T.: mvtnorm: Multivariate Normal and \(t\) Distr. R package version 1.0-10 (2019)
Zurück zum Zitat Greenacre, M., Hastie, T.: The geometric interpretation of correspondence analysis. J. Am. Stat. Ass. 82(398), 437–447 (1987)MathSciNetCrossRef Greenacre, M., Hastie, T.: The geometric interpretation of correspondence analysis. J. Am. Stat. Ass. 82(398), 437–447 (1987)MathSciNetCrossRef
Zurück zum Zitat Hennig, C.: What are the true clusters? Pattern Recognit. Lett. 64, 53–62 (2015)CrossRef Hennig, C.: What are the true clusters? Pattern Recognit. Lett. 64, 53–62 (2015)CrossRef
Zurück zum Zitat Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)CrossRef Huang, Z.: Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min. Knowl. Discov. 2(3), 283–304 (1998)CrossRef
Zurück zum Zitat Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)CrossRef Hubert, L., Arabie, P.: Comparing partitions. J. Classif. 2(1), 193–218 (1985)CrossRef
Zurück zum Zitat Hunt, L., Jorgensen, M.: Clustering mixed data. Wiley Int. Rev. Data Min. Knowl. Disc. 1(4), 352–361 (2011)CrossRef Hunt, L., Jorgensen, M.: Clustering mixed data. Wiley Int. Rev. Data Min. Knowl. Disc. 1(4), 352–361 (2011)CrossRef
Zurück zum Zitat Iyigun, C., Ben-Israel, A.: Probabilistic distance clustering adjusted for cluster size. Probab. Eng. Inform. Sci. 22(4), 603–621 (2008)MathSciNetCrossRef Iyigun, C., Ben-Israel, A.: Probabilistic distance clustering adjusted for cluster size. Probab. Eng. Inform. Sci. 22(4), 603–621 (2008)MathSciNetCrossRef
Zurück zum Zitat Lê, S., Josse, J., Husson, F.: FactoMineR: a package for multivariate analysis. J. Stat. Softw. 25(1), 1–18 (2008)CrossRef Lê, S., Josse, J., Husson, F.: FactoMineR: a package for multivariate analysis. J. Stat. Softw. 25(1), 1–18 (2008)CrossRef
Zurück zum Zitat MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, CA, USA, vol. 1, pp. 281–297 (1967) MacQueen, J., et al.: Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, CA, USA, vol. 1, pp. 281–297 (1967)
Zurück zum Zitat McParland, D., Gormley, I.C.: Model based clustering for mixed data: clustmd. Adv. Data Anal. Classi. 10(2), 155–169 (2016)MathSciNetCrossRef McParland, D., Gormley, I.C.: Model based clustering for mixed data: clustmd. Adv. Data Anal. Classi. 10(2), 155–169 (2016)MathSciNetCrossRef
Zurück zum Zitat Szepannek, G.: clustmixtype: user-friendly clustering of mixed-type data in R. R J. 10(2), 200–208 (2018)CrossRef Szepannek, G.: clustmixtype: user-friendly clustering of mixed-type data in R. R J. 10(2), 200–208 (2018)CrossRef
Zurück zum Zitat Tortora, C., Summa, M.G., Marino, M., Palumbo, F.: Factor probabilistic distance clustering (FPDC): a new clustering method. Adv. Data Anal. Classi. 10(4), 441–464 (2016)MathSciNetCrossRef Tortora, C., Summa, M.G., Marino, M., Palumbo, F.: Factor probabilistic distance clustering (FPDC): a new clustering method. Adv. Data Anal. Classi. 10(4), 441–464 (2016)MathSciNetCrossRef
Zurück zum Zitat Tortora, C., ElSherbiny, A., Browne, R.P., Franczak, B.C., McNicholas, P.D.: MixGHD: Model Based Clustering, Classification and Discriminant Analysis Using the Mixture of Generalized Hyperbolic Distributions. R package version 2.3.1 (2019) Tortora, C., ElSherbiny, A., Browne, R.P., Franczak, B.C., McNicholas, P.D.: MixGHD: Model Based Clustering, Classification and Discriminant Analysis Using the Mixture of Generalized Hyperbolic Distributions. R package version 2.3.1 (2019)
Zurück zum Zitat Tortora, C., Vidales, N., McNicholas, P.D.: FPDclustering: PD-Clustering and Factor PD-Clustering. R package version 1.3 (2019) Tortora, C., Vidales, N., McNicholas, P.D.: FPDclustering: PD-Clustering and Factor PD-Clustering. R package version 1.3 (2019)
Zurück zum Zitat van de Velden, M., Iodice D’Enza, A., Palumbo, F.: Cluster correspondence analysis. Psychometrika 82(1), 158–185 (2017)MathSciNetCrossRef van de Velden, M., Iodice D’Enza, A., Palumbo, F.: Cluster correspondence analysis. Psychometrika 82(1), 158–185 (2017)MathSciNetCrossRef
Zurück zum Zitat van de Velden, M., Iodice D’Enza, A., Markos, A.: Distance-based clustering of mixed data. Wiley Interdiscip. Rev. Comput. Stat. 11(3), e1456 (2019)MathSciNetCrossRef van de Velden, M., Iodice D’Enza, A., Markos, A.: Distance-based clustering of mixed data. Wiley Interdiscip. Rev. Comput. Stat. 11(3), e1456 (2019)MathSciNetCrossRef
Zurück zum Zitat Wang, K., Ng, A., McLachlan, G.: EMMIXskew: The EM Algorithm and Skew Mixture Distribution. R package version 1.0.3 (2018) Wang, K., Ng, A., McLachlan, G.: EMMIXskew: The EM Algorithm and Skew Mixture Distribution. R package version 1.0.3 (2018)
Metadaten
Titel
Clustering Mixed-Type Data: A Benchmark Study on KAMILA and K-Prototypes
verfasst von
Jarrett Jimeno
Madhumita Roy
Cristina Tortora
Copyright-Jahr
2021
DOI
https://doi.org/10.1007/978-3-030-60104-1_10

Premium Partner