Skip to main content
Erschienen in: Advances in Data Analysis and Classification 1/2022

06.10.2021 | Regular Article

Mixed Deep Gaussian Mixture Model: a clustering model for mixed datasets

verfasst von: Robin Fuchs, Denys Pommeret, Cinzia Viroli

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 1/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Clustering mixed data presents numerous challenges inherent to the very heterogeneous nature of the variables. A clustering algorithm should be able, despite this heterogeneity, to extract discriminant pieces of information from the variables in order to design groups. In this work we introduce a multilayer architecture model-based clustering method called Mixed Deep Gaussian Mixture Model that can be viewed as an automatic way to merge the clustering performed separately on continuous and non-continuous data. This architecture is flexible and can be adapted to mixed as well as to continuous or non-continuous data. In this sense, we generalize Generalized Linear Latent Variable Models and Deep Gaussian Mixture Models. We also design a new initialisation strategy and a data-driven method that selects the best specification of the model and the optimal number of clusters for a given dataset. Besides, our model provides continuous low-dimensional representations of the data which can be a useful tool to visualize mixed datasets. Finally, we validate the performance of our approach by comparing its results with state-of-the-art mixed data clustering models over several commonly used datasets.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Ahmad A, Khan SS (2019) Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7:31883–31902CrossRef Ahmad A, Khan SS (2019) Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7:31883–31902CrossRef
Zurück zum Zitat Akaike H (1998) Information theory and an extension of the maximum likelihood principle. In: Selected papers of Hirotugu Akaike. Springer, Berlin, pp 199–213 Akaike H (1998) Information theory and an extension of the maximum likelihood principle. In: Selected papers of Hirotugu Akaike. Springer, Berlin, pp 199–213
Zurück zum Zitat Baydin AG, Pearlmutter BA, Radul AA, Siskind JM (2017) Automatic differentiation in machine learning: a survey. J Mach Learn Res 18(1):5595–5637MathSciNetMATH Baydin AG, Pearlmutter BA, Radul AA, Siskind JM (2017) Automatic differentiation in machine learning: a survey. J Mach Learn Res 18(1):5595–5637MathSciNetMATH
Zurück zum Zitat Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate gaussian mixture models. Comput Stat Data Anal 41(3–4):561–575MathSciNetCrossRefMATH Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate gaussian mixture models. Comput Stat Data Anal 41(3–4):561–575MathSciNetCrossRefMATH
Zurück zum Zitat Cagnone S, Viroli C (2014) A factor mixture model for analyzing heterogeneity and cognitive structure of dementia. AStA Adv Stat Anal 98(1):1–20MathSciNetCrossRefMATH Cagnone S, Viroli C (2014) A factor mixture model for analyzing heterogeneity and cognitive structure of dementia. AStA Adv Stat Anal 98(1):1–20MathSciNetCrossRefMATH
Zurück zum Zitat Chiu T, Fang D, Chen J, Wang Y, Jeris C (2001) A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, pp 263–268 Chiu T, Fang D, Chen J, Wang Y, Jeris C (2001) A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, pp 263–268
Zurück zum Zitat Ester M, Kriegel HP, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231 Ester M, Kriegel HP, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231
Zurück zum Zitat Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631MathSciNetCrossRefMATH Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631MathSciNetCrossRefMATH
Zurück zum Zitat Fruehwirth-Schnatter S, Lopes HF (2018) Sparse bayesian factor analysis when the number of factors is unknown. arXiv preprint arXiv:1804.04231 Fruehwirth-Schnatter S, Lopes HF (2018) Sparse bayesian factor analysis when the number of factors is unknown. arXiv preprint arXiv:​1804.​04231
Zurück zum Zitat Ghahramani Z, Hinton GE et al (1996) The EM algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto Ghahramani Z, Hinton GE et al (1996) The EM algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto
Zurück zum Zitat Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–871CrossRef Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–871CrossRef
Zurück zum Zitat Huang Z (1997) Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Singapore, pp 21–34 Huang Z (1997) Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Singapore, pp 21–34
Zurück zum Zitat Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Disc 2(3):283–304CrossRef Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Disc 2(3):283–304CrossRef
Zurück zum Zitat Jogin M, Madhulika M, Divya G, Meghana R, Apoorva S et al (2018) Feature extraction using convolution neural networks (CNN) and deep learning. In: 2018 3rd IEEE international conference on recent trends in electronics, information & communication technology (RTEICT). IEEE, pp 2319–2323 Jogin M, Madhulika M, Divya G, Meghana R, Apoorva S et al (2018) Feature extraction using convolution neural networks (CNN) and deep learning. In: 2018 3rd IEEE international conference on recent trends in electronics, information & communication technology (RTEICT). IEEE, pp 2319–2323
Zurück zum Zitat Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480 Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480
Zurück zum Zitat Maclaurin D, Duvenaud D, Adams RP (2015) Autograd: Effortless gradients in numpy. In: ICML 2015 AutoML Workshop, vol 238, p 5 Maclaurin D, Duvenaud D, Adams RP (2015) Autograd: Effortless gradients in numpy. In: ICML 2015 AutoML Workshop, vol 238, p 5
Zurück zum Zitat McLachlan GJ, Peel D (2000) Finite mixture models. Probability and statistics–applied probability and statistics section, vol 299. Wiley, New York McLachlan GJ, Peel D (2000) Finite mixture models. Probability and statistics–applied probability and statistics section, vol 299. Wiley, New York
Zurück zum Zitat McLachlan GJ, Peel D, Bean RW (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41(3–4):379–388MathSciNetCrossRefMATH McLachlan GJ, Peel D, Bean RW (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41(3–4):379–388MathSciNetCrossRefMATH
Zurück zum Zitat Moustaki I (2003) A general class of latent variable models for ordinal manifest variables with covariate effects on the manifest and latent variables. Br J Math Stat Psychol 56(2):337–357MathSciNetCrossRef Moustaki I (2003) A general class of latent variable models for ordinal manifest variables with covariate effects on the manifest and latent variables. Br J Math Stat Psychol 56(2):337–357MathSciNetCrossRef
Zurück zum Zitat Nenadic O, Greenacre M (2005) Computation of multiple correspondence analysis, with code in r. Technical report, Universitat Pompeu Fabra Nenadic O, Greenacre M (2005) Computation of multiple correspondence analysis, with code in r. Technical report, Universitat Pompeu Fabra
Zurück zum Zitat Niku J, Brooks W, Herliansyah R, Hui FK, Taskinen S, Warton DI (2019) Efficient estimation of generalized linear latent variable models. PLoS ONE 14(5):481–497CrossRef Niku J, Brooks W, Herliansyah R, Hui FK, Taskinen S, Warton DI (2019) Efficient estimation of generalized linear latent variable models. PLoS ONE 14(5):481–497CrossRef
Zurück zum Zitat Patil DD, Wadhai V, Gokhale J (2010) Evaluation of decision tree pruning algorithms for complexity and classification accuracy. Int J Comput Appl 11(2):23–30 Patil DD, Wadhai V, Gokhale J (2010) Evaluation of decision tree pruning algorithms for complexity and classification accuracy. Int J Comput Appl 11(2):23–30
Zurück zum Zitat Philip G, Ottaway B (1983) Mixed data cluster analysis: an illustration using cypriot hooked-tang weapons. Archaeometry 25(2):119–133CrossRef Philip G, Ottaway B (1983) Mixed data cluster analysis: an illustration using cypriot hooked-tang weapons. Archaeometry 25(2):119–133CrossRef
Zurück zum Zitat Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRefMATH Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRefMATH
Zurück zum Zitat Selosse M, Gormley C, Jacques J, Biernacki C (2020) A bumpy journey: exploring deep gaussian mixture models. In: ”I Can’t Believe It’s Not Better!”NeurIPS 2020 workshop Selosse M, Gormley C, Jacques J, Biernacki C (2020) A bumpy journey: exploring deep gaussian mixture models. In: ”I Can’t Believe It’s Not Better!”NeurIPS 2020 workshop
Zurück zum Zitat Viroli C, McLachlan GJ (2019) Deep gaussian mixture models. Stat Comput 29(1):43–51 Viroli C, McLachlan GJ (2019) Deep gaussian mixture models. Stat Comput 29(1):43–51
Zurück zum Zitat Wei GC, Tanner MA (1990) A monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85(411):699–704CrossRef Wei GC, Tanner MA (1990) A monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85(411):699–704CrossRef
Zurück zum Zitat Wold S, Sjöström M, Eriksson L (2001) Pls-regression: a basic tool of chemometrics. Chemom Intell Lab Syst 58(2):109–130CrossRef Wold S, Sjöström M, Eriksson L (2001) Pls-regression: a basic tool of chemometrics. Chemom Intell Lab Syst 58(2):109–130CrossRef
Metadaten
Titel
Mixed Deep Gaussian Mixture Model: a clustering model for mixed datasets
verfasst von
Robin Fuchs
Denys Pommeret
Cinzia Viroli
Publikationsdatum
06.10.2021
Verlag
Springer Berlin Heidelberg
Erschienen in
Advances in Data Analysis and Classification / Ausgabe 1/2022
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI
https://doi.org/10.1007/s11634-021-00466-3

Weitere Artikel der Ausgabe 1/2022

Advances in Data Analysis and Classification 1/2022 Zur Ausgabe

Premium Partner