nach oben

Advances in Data Analysis and Classification

Erschienen in:

06.10.2021 | Regular Article

Mixed Deep Gaussian Mixture Model: a clustering model for mixed datasets

verfasst von: Robin Fuchs, Denys Pommeret, Cinzia Viroli

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 1/2022

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Clustering mixed data presents numerous challenges inherent to the very heterogeneous nature of the variables. A clustering algorithm should be able, despite this heterogeneity, to extract discriminant pieces of information from the variables in order to design groups. In this work we introduce a multilayer architecture model-based clustering method called Mixed Deep Gaussian Mixture Model that can be viewed as an automatic way to merge the clustering performed separately on continuous and non-continuous data. This architecture is flexible and can be adapted to mixed as well as to continuous or non-continuous data. In this sense, we generalize Generalized Linear Latent Variable Models and Deep Gaussian Mixture Models. We also design a new initialisation strategy and a data-driven method that selects the best specification of the model and the optimal number of clusters for a given dataset. Besides, our model provides continuous low-dimensional representations of the data which can be a useful tool to visualize mixed datasets. Finally, we validate the performance of our approach by comparing its results with state-of-the-art mixed data clustering models over several commonly used datasets.

Vorheriger Artikel Model-based clustering and outlier detection with missing data

Nächster Artikel Unobserved classes and extra variables in high-dimensional discriminant analysis

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Nur mit Berechtigung zugänglich

Ahmad A, Khan SS (2019) Survey of state-of-the-art mixed data clustering algorithms. IEEE Access 7:31883–31902CrossRef

Akaike H (1998) Information theory and an extension of the maximum likelihood principle. In: Selected papers of Hirotugu Akaike. Springer, Berlin, pp 199–213

Baydin AG, Pearlmutter BA, Radul AA, Siskind JM (2017) Automatic differentiation in machine learning: a survey. J Mach Learn Res 18(1):5595–5637MathSciNetMATH

Biernacki C, Celeux G, Govaert G (2003) Choosing starting values for the EM algorithm for getting the highest likelihood in multivariate gaussian mixture models. Comput Stat Data Anal 41(3–4):561–575MathSciNetCrossRefMATH

Blalock D, Ortiz JJG, Frankle J, Guttag J (2020) What is the state of neural network pruning? arXiv preprint arXiv:2003.03033

Cagnone S, Viroli C (2014) A factor mixture model for analyzing heterogeneity and cognitive structure of dementia. AStA Adv Stat Anal 98(1):1–20MathSciNetCrossRefMATH

Chiu T, Fang D, Chen J, Wang Y, Jeris C (2001) A robust and scalable clustering algorithm for mixed type attributes in large database environment. In: Proceedings of the seventh ACM SIGKDD international conference on knowledge discovery and data mining, pp 263–268

Dua D, Graff C (2017) UCI machine learning repository. http://archive.ics.uci.edu/ml

Ester M, Kriegel HP, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231

Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631MathSciNetCrossRefMATH

Fruehwirth-Schnatter S, Lopes HF (2018) Sparse bayesian factor analysis when the number of factors is unknown. arXiv preprint arXiv:1804.04231

Ghahramani Z, Hinton GE et al (1996) The EM algorithm for mixtures of factor analyzers. Technical report, Technical Report CRG-TR-96-1, University of Toronto

Gower JC (1971) A general coefficient of similarity and some of its properties. Biometrics 27(4):857–871CrossRef

Huang Z (1997) Clustering large data sets with mixed numeric and categorical values. In: Proceedings of the 1st Pacific-Asia conference on knowledge discovery and data mining (PAKDD), Singapore, pp 21–34

Huang Z (1998) Extensions to the k-means algorithm for clustering large data sets with categorical values. Data Min Knowl Disc 2(3):283–304CrossRef

Jogin M, Madhulika M, Divya G, Meghana R, Apoorva S et al (2018) Feature extraction using convolution neural networks (CNN) and deep learning. In: 2018 3rd IEEE international conference on recent trends in electronics, information & communication technology (RTEICT). IEEE, pp 2319–2323

Kohonen T (1990) The self-organizing map. Proc IEEE 78(9):1464–1480

Maclaurin D, Duvenaud D, Adams RP (2015) Autograd: Effortless gradients in numpy. In: ICML 2015 AutoML Workshop, vol 238, p 5

McLachlan GJ, Peel D (2000) Finite mixture models. Probability and statistics–applied probability and statistics section, vol 299. Wiley, New York

McLachlan GJ, Peel D, Bean RW (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41(3–4):379–388MathSciNetCrossRefMATH

Melnykov V, Maitra R et al (2010) Finite mixture models and model-based clustering. Stat Surv 4:80–116MathSciNetCrossRefMATH

Moustaki I (2003) A general class of latent variable models for ordinal manifest variables with covariate effects on the manifest and latent variables. Br J Math Stat Psychol 56(2):337–357MathSciNetCrossRef

Moustaki I, Knott M (2000) Generalized latent trait models. Psychometrika 65(3):391–411MathSciNetCrossRefMATH

Nenadic O, Greenacre M (2005) Computation of multiple correspondence analysis, with code in r. Technical report, Universitat Pompeu Fabra

Niku J, Brooks W, Herliansyah R, Hui FK, Taskinen S, Warton DI (2019) Efficient estimation of generalized linear latent variable models. PLoS ONE 14(5):481–497CrossRef

Pagès J (2014) Multiple factor analysis by example using R. CRC Press, CambridgeCrossRefMATH

Patil DD, Wadhai V, Gokhale J (2010) Evaluation of decision tree pruning algorithms for complexity and classification accuracy. Int J Comput Appl 11(2):23–30

Philip G, Ottaway B (1983) Mixed data cluster analysis: an illustration using cypriot hooked-tang weapons. Archaeometry 25(2):119–133CrossRef

Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65CrossRefMATH

Schwarz G et al (1978) Estimating the dimension of a model. Ann Stat 6(2):461–464MathSciNetCrossRefMATH

Selosse M, Gormley C, Jacques J, Biernacki C (2020) A bumpy journey: exploring deep gaussian mixture models. In: ”I Can’t Believe It’s Not Better!”NeurIPS 2020 workshop

Viroli C, McLachlan GJ (2019) Deep gaussian mixture models. Stat Comput 29(1):43–51

Wei GC, Tanner MA (1990) A monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85(411):699–704CrossRef

Wold S, Sjöström M, Eriksson L (2001) Pls-regression: a basic tool of chemometrics. Chemom Intell Lab Syst 58(2):109–130CrossRef

Titel: Mixed Deep Gaussian Mixture Model: a clustering model for mixed datasets
verfasst von: Robin Fuchs
Denys Pommeret
Cinzia Viroli
Publikationsdatum: 06.10.2021
Verlag: Springer Berlin Heidelberg
Erschienen in: Advances in Data Analysis and Classification / Ausgabe 1/2022
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI: https://doi.org/10.1007/s11634-021-00466-3

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 1/2022

An empirical comparison and characterisation of nine popular clustering methods

Strong consistency of the MLE under two-parameter Gamma mixture models with a structural scale parameter

Editorial for ADAC issue 1 of volume 16 (2022)

Robust optimal classification trees under noisy labels

Unobserved classes and extra variables in high-dimensional discriminant analysis

Multivariate cluster weighted models using skewed distributions

Premium Partner