Skip to main content
Erschienen in: Advances in Data Analysis and Classification 2/2016

01.06.2016 | Regular Article

Model based clustering for mixed data: clustMD

verfasst von: Damien McParland, Isobel Claire Gormley

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 2/2016

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

A model based clustering procedure for data of mixed type, clustMD, is developed using a latent variable model. It is proposed that a latent variable, following a mixture of Gaussian distributions, generates the observed data of mixed type. The observed data may be any combination of continuous, binary, ordinal or nominal variables. clustMD employs a parsimonious covariance structure for the latent variables, leading to a suite of six clustering models that vary in complexity and provide an elegant and unified approach to clustering mixed data. An expectation maximisation (EM) algorithm is used to estimate clustMD; in the presence of nominal data a Monte Carlo EM algorithm is required. The clustMD model is illustrated by clustering simulated mixed type data and prostate cancer patients, on whom mixed data have been recorded.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Andrews DA, Herzberg AM (1985) Data: a collection of problems from many fields for the student and research worker. Springer, New YorkCrossRefMATH Andrews DA, Herzberg AM (1985) Data: a collection of problems from many fields for the student and research worker. Springer, New YorkCrossRefMATH
Zurück zum Zitat Browne RP, McNicholas PD (2012) Model-based clustering and classification of data with mixed type. J Stat Plan Inference 142:2976–2984MathSciNetCrossRefMATH Browne RP, McNicholas PD (2012) Model-based clustering and classification of data with mixed type. J Stat Plan Inference 142:2976–2984MathSciNetCrossRefMATH
Zurück zum Zitat Byar DP, Green SB (1980) The choice of treatment for cancer patients based on covariate information: application to prostate cancer. Bull du Cancer 67:477–490 Byar DP, Green SB (1980) The choice of treatment for cancer patients based on covariate information: application to prostate cancer. Bull du Cancer 67:477–490
Zurück zum Zitat Cagnone S, Viroli C (2012) A factor mixture analysis model for multivariate binary data. Stat Model 12:257–277MathSciNetCrossRef Cagnone S, Viroli C (2012) A factor mixture analysis model for multivariate binary data. Stat Model 12:257–277MathSciNetCrossRef
Zurück zum Zitat Cai JH, Song XY, Lam KH, Ip EHS (2011) A mixture of generalized latent variable models for mixed mode and heterogeneous data. Comput Stat Data Anal 55:2889–2907MathSciNetCrossRefMATH Cai JH, Song XY, Lam KH, Ip EHS (2011) A mixture of generalized latent variable models for mixed mode and heterogeneous data. Comput Stat Data Anal 55:2889–2907MathSciNetCrossRefMATH
Zurück zum Zitat Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28(5):781–793CrossRef Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28(5):781–793CrossRef
Zurück zum Zitat Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodological) 39(1):1–38MathSciNetMATH Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodological) 39(1):1–38MathSciNetMATH
Zurück zum Zitat Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631MathSciNetCrossRefMATH Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631MathSciNetCrossRefMATH
Zurück zum Zitat Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report No. 597, Department of Statistics, University of Washington Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report No. 597, Department of Statistics, University of Washington
Zurück zum Zitat Frühwirth-Schnatter S (2006) Finite mixture and markov switching models. Springer, New YorkMATH Frühwirth-Schnatter S (2006) Finite mixture and markov switching models. Springer, New YorkMATH
Zurück zum Zitat Geweke J, Keane M, Runkle D (1994) Alternative computational approaches to inference in the multinomial probit model. Rev Econ Stat 76(4):609–632CrossRef Geweke J, Keane M, Runkle D (1994) Alternative computational approaches to inference in the multinomial probit model. Rev Econ Stat 76(4):609–632CrossRef
Zurück zum Zitat Gollini I, Murphy TB (2014) Mixture of latent trait analyzers for model-based clustering of categorical data. Stat Comput 24(4):569–588 Gollini I, Murphy TB (2014) Mixture of latent trait analyzers for model-based clustering of categorical data. Stat Comput 24(4):569–588
Zurück zum Zitat Gruhl J, Erosheva EA, Crane P (2013) A semiparametric approach to mixed outcome latent variable models: Estimating the association between cognition and regional brain volumes. Ann Appl Stat 7(2):2361–2383MathSciNetCrossRefMATH Gruhl J, Erosheva EA, Crane P (2013) A semiparametric approach to mixed outcome latent variable models: Estimating the association between cognition and regional brain volumes. Ann Appl Stat 7(2):2361–2383MathSciNetCrossRefMATH
Zurück zum Zitat Hunt L, Jorgensen M (1999) Mixture model clustering using the multimix program. Aust N Z J Stat 41:153–171CrossRefMATH Hunt L, Jorgensen M (1999) Mixture model clustering using the multimix program. Aust N Z J Stat 41:153–171CrossRefMATH
Zurück zum Zitat Johnson VE, Albert JH (1999) Ordinal data modeling. Springer, New YorkMATH Johnson VE, Albert JH (1999) Ordinal data modeling. Springer, New YorkMATH
Zurück zum Zitat Karlis D, Santourian A (2009) Model-based clustering with non-elliptically contoured distributions. Stat Comput 19(1):73–83MathSciNetCrossRef Karlis D, Santourian A (2009) Model-based clustering with non-elliptically contoured distributions. Stat Comput 19(1):73–83MathSciNetCrossRef
Zurück zum Zitat Lawrence CJ, Krzanowski WJ (1996) Mixture separation for mixed-mode data. Stat Comput 6:85–92CrossRef Lawrence CJ, Krzanowski WJ (1996) Mixture separation for mixed-mode data. Stat Comput 6:85–92CrossRef
Zurück zum Zitat Marbac M, Biernacki C, Vandewalle V (2015) Model-based clustering of Gaussian copulas for mixed data. arXiv:1405.1299 (preprint) Marbac M, Biernacki C, Vandewalle V (2015) Model-based clustering of Gaussian copulas for mixed data. arXiv:​1405.​1299 (preprint)
Zurück zum Zitat McLachlan G, Peel D (1998) Robust cluster analysis via mixtures of multivariate t-distributions. In: Amin A, Dori D, Pudil P, Freeman H (eds) Advances in pattern recognition, vol 1451. Springer, Berlin, pp 658–666CrossRef McLachlan G, Peel D (1998) Robust cluster analysis via mixtures of multivariate t-distributions. In: Amin A, Dori D, Pudil P, Freeman H (eds) Advances in pattern recognition, vol 1451. Springer, Berlin, pp 658–666CrossRef
Zurück zum Zitat McParland D, Gormley IC (2013) Clustering ordinal data via latent variable models. In: Van den Poel D, Ultsch A, Lausen B (eds) Algorithms from and for nature and life. Springer, Berlin, pp 127–135CrossRef McParland D, Gormley IC (2013) Clustering ordinal data via latent variable models. In: Van den Poel D, Ultsch A, Lausen B (eds) Algorithms from and for nature and life. Springer, Berlin, pp 127–135CrossRef
Zurück zum Zitat McParland D, Gormley IC, McCormick TH, Clark SJ, Kabudula CW, Collinson MA (2014a) Clustering South African households based on their asset status using latent variable models. Ann Appl Stat 8(2):747–776MathSciNetCrossRefMATH McParland D, Gormley IC, McCormick TH, Clark SJ, Kabudula CW, Collinson MA (2014a) Clustering South African households based on their asset status using latent variable models. Ann Appl Stat 8(2):747–776MathSciNetCrossRefMATH
Zurück zum Zitat McParland D, Gormley IC, Phillips CM, Brennan L, Roche HM (2014b) Clustering mixed continuous and categorical data from the LIPGENE metabolic syndrome study: joint analysis of phenotypic and genetic data. Technical Report, University College Dublin McParland D, Gormley IC, Phillips CM, Brennan L, Roche HM (2014b) Clustering mixed continuous and categorical data from the LIPGENE metabolic syndrome study: joint analysis of phenotypic and genetic data. Technical Report, University College Dublin
Zurück zum Zitat Morlini I (2011) A latent variable approach for clustering mixed binary and continuous variables within a Gaussian mixture model. Adv Data Anal Classif 6(1):5–28MathSciNetCrossRefMATH Morlini I (2011) A latent variable approach for clustering mixed binary and continuous variables within a Gaussian mixture model. Adv Data Anal Classif 6(1):5–28MathSciNetCrossRefMATH
Zurück zum Zitat Murray JS, Dunson DB, Carin L, Lucas JE (2013) Bayesian Gaussian copula factor models for mixed data. J Am Stat Assoc 108(502):656–665MathSciNetCrossRefMATH Murray JS, Dunson DB, Carin L, Lucas JE (2013) Bayesian Gaussian copula factor models for mixed data. J Am Stat Assoc 108(502):656–665MathSciNetCrossRefMATH
Zurück zum Zitat Muthén B, Shedden K (1999) Finite mixture modeling with mixture outcomes using the EM algorithm. Biometrics 55:463–469CrossRefMATH Muthén B, Shedden K (1999) Finite mixture modeling with mixture outcomes using the EM algorithm. Biometrics 55:463–469CrossRefMATH
Zurück zum Zitat O’Hagan A (2012) Topics in model based clustering and classification. PhD thesis, University College Dublin O’Hagan A (2012) Topics in model based clustering and classification. PhD thesis, University College Dublin
Zurück zum Zitat O’Hagan A, Murphy TB, Gormley IC (2012) Computational aspects of ftting mixture models via the expectation-maximisation algorithm. Comput Stat Data Anal 56(12):3843–3864MathSciNetCrossRefMATH O’Hagan A, Murphy TB, Gormley IC (2012) Computational aspects of ftting mixture models via the expectation-maximisation algorithm. Comput Stat Data Anal 56(12):3843–3864MathSciNetCrossRefMATH
Zurück zum Zitat Quinn KM (2004) Bayesian factor analysis for mixed ordinal and continuous responses. Political Anal 12(4):338–353MathSciNetCrossRef Quinn KM (2004) Bayesian factor analysis for mixed ordinal and continuous responses. Political Anal 12(4):338–353MathSciNetCrossRef
Zurück zum Zitat Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New JerseyMATH Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New JerseyMATH
Zurück zum Zitat Wei GCG, Tanner MA (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85:699–704CrossRef Wei GCG, Tanner MA (1990) A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithms. J Am Stat Assoc 85:699–704CrossRef
Zurück zum Zitat Willse A, Boik RJ (1999) Identifiable finite mixtures of location models for clustering mixed-mode data. Stat Comput 9:111–121CrossRef Willse A, Boik RJ (1999) Identifiable finite mixtures of location models for clustering mixed-mode data. Stat Comput 9:111–121CrossRef
Metadaten
Titel
Model based clustering for mixed data: clustMD
verfasst von
Damien McParland
Isobel Claire Gormley
Publikationsdatum
01.06.2016
Verlag
Springer Berlin Heidelberg
Erschienen in
Advances in Data Analysis and Classification / Ausgabe 2/2016
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI
https://doi.org/10.1007/s11634-016-0238-x

Weitere Artikel der Ausgabe 2/2016

Advances in Data Analysis and Classification 2/2016 Zur Ausgabe

Premium Partner