Skip to main content
Erschienen in: Advances in Data Analysis and Classification 1/2013

01.03.2013 | Regular Article

Clustering and classification via cluster-weighted factor analyzers

verfasst von: Sanjeena Subedi, Antonio Punzo, Salvatore Ingrassia, Paul D. McNicholas

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 1/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In model-based clustering and classification, the cluster-weighted model is a convenient approach when the random vector of interest is constituted by a response variable \(Y\) and by a vector \({\varvec{X}}\) of \(p\) covariates. However, its applicability may be limited when \(p\) is high. To overcome this problem, this paper assumes a latent factor structure for \({\varvec{X}}\) in each mixture component, under Gaussian assumptions. This leads to the cluster-weighted factor analyzers (CWFA) model. By imposing constraints on the variance of \(Y\) and the covariance matrix of \({\varvec{X}}\), a novel family of sixteen CWFA models is introduced for model-based clustering and classification. The alternating expectation-conditional maximization algorithm, for maximum likelihood estimation of the parameters of all models in the family, is described; to initialize the algorithm, a 5-step hierarchical procedure is proposed, which uses the nested structures of the models within the family and thus guarantees the natural ranking among the sixteen likelihoods. Artificial and real data show that these models have very good clustering and classification performance and that the algorithm is able to recover the parameters very well.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Anhänge
Nur mit Berechtigung zugänglich
Literatur
Zurück zum Zitat Airoldi, J, Hoffmann R (1984) Age variation in voles (Microtus californicus, M. ochrogaster) and its significance for systematic studies, Occasional papers of the Museum of Natural History, vol 111. University of Kansas, Lawrence Airoldi, J, Hoffmann R (1984) Age variation in voles (Microtus californicus, M. ochrogaster) and its significance for systematic studies, Occasional papers of the Museum of Natural History, vol 111. University of Kansas, Lawrence
Zurück zum Zitat Aitken AC (1926) On Bernoulli’s numerical solution of algebraic equations. Proc Royal Soc Edinburgh 46:289–305MATH Aitken AC (1926) On Bernoulli’s numerical solution of algebraic equations. Proc Royal Soc Edinburgh 46:289–305MATH
Zurück zum Zitat Andrews JL, McNicholas PD, Subedi S (2011) Model-based classification via mixtures of multivariate t-distributions. Comput Stat Data Anal 55(1):520–529MathSciNetMATHCrossRef Andrews JL, McNicholas PD, Subedi S (2011) Model-based classification via mixtures of multivariate t-distributions. Comput Stat Data Anal 55(1):520–529MathSciNetMATHCrossRef
Zurück zum Zitat Baek J, McLachlan GJ, Flack LK (2010) Mixtures of factor analyzers with common factor loadings: applications to the clustering and visualization of high-dimensional data. IEEE Trans Pattern Anal Mach Intell 32:1298–1309CrossRef Baek J, McLachlan GJ, Flack LK (2010) Mixtures of factor analyzers with common factor loadings: applications to the clustering and visualization of high-dimensional data. IEEE Trans Pattern Anal Mach Intell 32:1298–1309CrossRef
Zurück zum Zitat Bartholomew DJ, Knott M (1999) Latent variable models and factor analysis. In: Kendall’s library of statistics, vol 7, 2nd edn. Edward Arnold, London Bartholomew DJ, Knott M (1999) Latent variable models and factor analysis. In: Kendall’s library of statistics, vol 7, 2nd edn. Edward Arnold, London
Zurück zum Zitat Bartlett M (1953) Factor analysis in psychology as a statistician sees it. In: Uppsala symposium on psychological factor analysis, Number 3 in Nordisk Psykologi’s Monograph Series, Uppsala, Sweden, pp 23–34. Almquist and Wiksell, Uppsala Bartlett M (1953) Factor analysis in psychology as a statistician sees it. In: Uppsala symposium on psychological factor analysis, Number 3 in Nordisk Psykologi’s Monograph Series, Uppsala, Sweden, pp 23–34. Almquist and Wiksell, Uppsala
Zurück zum Zitat Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725CrossRef Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725CrossRef
Zurück zum Zitat Böhning D, Dietz E, Schaub R, Schlattmann P, Lindsay B (1994) The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals Inst Stat Math 46(2):373–388MATHCrossRef Böhning D, Dietz E, Schaub R, Schlattmann P, Lindsay B (1994) The distribution of the likelihood ratio for mixtures of densities from the one-parameter exponential family. Annals Inst Stat Math 46(2):373–388MATHCrossRef
Zurück zum Zitat Browne RP, McNicholas PD (2012) Model-based clustering, classification, and discriminant analysis of data with mixed type. J Stat Plann Infer 142(11):2976–2984MathSciNetMATHCrossRef Browne RP, McNicholas PD (2012) Model-based clustering, classification, and discriminant analysis of data with mixed type. J Stat Plann Infer 142(11):2976–2984MathSciNetMATHCrossRef
Zurück zum Zitat Browne RP, McNicholas PD, Sparling MD (2012) Model-based learning using a mixture of mixtures of Gaussian and uniform distributions. IEEE Trans Pattern Anal Mach Intell 34(4):814–817CrossRef Browne RP, McNicholas PD, Sparling MD (2012) Model-based learning using a mixture of mixtures of Gaussian and uniform distributions. IEEE Trans Pattern Anal Mach Intell 34(4):814–817CrossRef
Zurück zum Zitat Carvalho C, Chang J, Lucas J, Nevins J, Wang Q, West M (2008) High-dimensional sparse factor modeling: applications in gene expression genomics. J Am Stat Assoc 103(484):1438–1456MathSciNetCrossRef Carvalho C, Chang J, Lucas J, Nevins J, Wang Q, West M (2008) High-dimensional sparse factor modeling: applications in gene expression genomics. J Am Stat Assoc 103(484):1438–1456MathSciNetCrossRef
Zurück zum Zitat Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recogn 28(5):781–793CrossRef Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recogn 28(5):781–793CrossRef
Zurück zum Zitat Dean N, Murphy TB, Downey G (2006) Using unlabelled data to update classification rules with applications in food authenticity studies. J Royal Stat Soc Ser C 55(1):1–14MathSciNetMATHCrossRef Dean N, Murphy TB, Downey G (2006) Using unlabelled data to update classification rules with applications in food authenticity studies. J Royal Stat Soc Ser C 55(1):1–14MathSciNetMATHCrossRef
Zurück zum Zitat Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J Royal Stat Soc Ser B 39(1):1–38MathSciNetMATH Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J Royal Stat Soc Ser B 39(1):1–38MathSciNetMATH
Zurück zum Zitat Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631MathSciNetMATHCrossRef Fraley C, Raftery AE (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631MathSciNetMATHCrossRef
Zurück zum Zitat Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical report 597, Department of Statistics, University of Washington, Seattle, Washington, USA Fraley C, Raftery AE, Murphy TB, Scrucca L (2012) mclust version 4 for R: normal mixture modeling for model-based clustering, classification, and density estimation. Technical report 597, Department of Statistics, University of Washington, Seattle, Washington, USA
Zurück zum Zitat Frühwirth-Schnatter S (2006) Finite mixture and markov switching models. Springer, New YorkMATH Frühwirth-Schnatter S (2006) Finite mixture and markov switching models. Springer, New YorkMATH
Zurück zum Zitat Gershenfeld N (1997) Nonlinear inference and cluster-weighted modeling. Ann New York Acad Sci 808(1):18–24CrossRef Gershenfeld N (1997) Nonlinear inference and cluster-weighted modeling. Ann New York Acad Sci 808(1):18–24CrossRef
Zurück zum Zitat Ghahramani Z, Hinton G (1997) The EM algorithm for factor analyzers. Technical report CRG-TR-96-1, University of Toronto, Toronto Ghahramani Z, Hinton G (1997) The EM algorithm for factor analyzers. Technical report CRG-TR-96-1, University of Toronto, Toronto
Zurück zum Zitat Grün B, Leisch F (2008) Flexmix version 2: finite mixtures with concomitant variables and varying and constant parameters. J Stat Softw 28(4):1–35 Grün B, Leisch F (2008) Flexmix version 2: finite mixtures with concomitant variables and varying and constant parameters. J Stat Softw 28(4):1–35
Zurück zum Zitat Hosmer D Jr (1973) A comparison of iterative maximum likelihood estimates of the parameters of a mixture of two normal distributions under three different types of sample. Biometrics 29(4):761–770CrossRef Hosmer D Jr (1973) A comparison of iterative maximum likelihood estimates of the parameters of a mixture of two normal distributions under three different types of sample. Biometrics 29(4):761–770CrossRef
Zurück zum Zitat Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218CrossRef Hubert L, Arabie P (1985) Comparing partitions. J Classif 2(1):193–218CrossRef
Zurück zum Zitat Ingrassia S, Minotti SC, Vittadini G (2012b) Local statistical modeling via the cluster-weighted approach with elliptical distributions. J Classif 29(3):363–401 Ingrassia S, Minotti SC, Vittadini G (2012b) Local statistical modeling via the cluster-weighted approach with elliptical distributions. J Classif 29(3):363–401
Zurück zum Zitat Karlis D, Santourian A (2009) Model-based clustering with non-elliptically contoured distributions. Stat Comput 19(1):73–83MathSciNetCrossRef Karlis D, Santourian A (2009) Model-based clustering with non-elliptically contoured distributions. Stat Comput 19(1):73–83MathSciNetCrossRef
Zurück zum Zitat Leisch F (2004) Flexmix: a general framework for finite mixture models and latent class regression in R. J Stat Softw 11(8):1–18 Leisch F (2004) Flexmix: a general framework for finite mixture models and latent class regression in R. J Stat Softw 11(8):1–18
Zurück zum Zitat Lindsay BG (1995) Mixture models: theory, geometry and applications. In: NSF-CBMS regional conference series in probability and statistics, vol. 5. Institute of Mathematical Statistics, Hayward Lindsay BG (1995) Mixture models: theory, geometry and applications. In: NSF-CBMS regional conference series in probability and statistics, vol. 5. Institute of Mathematical Statistics, Hayward
Zurück zum Zitat McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New YorkMATH McLachlan GJ, Basford KE (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New YorkMATH
Zurück zum Zitat McLachlan GJ, D Peel (2000b) Mixtures of factor analyzers. In: Proceedings of the seventh international conference on machine learning, pp 599–606. Morgan Kaufmann, San Francisco. McLachlan GJ, D Peel (2000b) Mixtures of factor analyzers. In: Proceedings of the seventh international conference on machine learning, pp 599–606. Morgan Kaufmann, San Francisco.
Zurück zum Zitat McNicholas PD, Jampani KR, McDaid AF, Murphy TB, Banks L (2011) PGMM: Parsimonious Gaussian Mixture Models. R package version 1.0. McNicholas PD, Jampani KR, McDaid AF, Murphy TB, Banks L (2011) PGMM: Parsimonious Gaussian Mixture Models. R package version 1.0.
Zurück zum Zitat McNicholas PD, Murphy TB (2010a) Model-based clustering of longitudinal data. Can J Stat 38(1):153–168MathSciNetMATH McNicholas PD, Murphy TB (2010a) Model-based clustering of longitudinal data. Can J Stat 38(1):153–168MathSciNetMATH
Zurück zum Zitat McNicholas PD, Murphy TB (2010b) Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26(21):2705–2712CrossRef McNicholas PD, Murphy TB (2010b) Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26(21):2705–2712CrossRef
Zurück zum Zitat McNicholas PD, Murphy TB, McDaid AF, Frost D (2010) Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput Stat Data Anal 54(3):711–723MathSciNetMATHCrossRef McNicholas PD, Murphy TB, McDaid AF, Frost D (2010) Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput Stat Data Anal 54(3):711–723MathSciNetMATHCrossRef
Zurück zum Zitat McNicholas PD, Subedi S (2012) Clustering gene expression time course data using mixtures of multivariate t-distributions. J Stat Plann Infer 142(5):1114–1127MathSciNetMATHCrossRef McNicholas PD, Subedi S (2012) Clustering gene expression time course data using mixtures of multivariate t-distributions. J Stat Plann Infer 142(5):1114–1127MathSciNetMATHCrossRef
Zurück zum Zitat Meng XL, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2):267–278MathSciNetMATHCrossRef Meng XL, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2):267–278MathSciNetMATHCrossRef
Zurück zum Zitat Meng XL, van Dyk D (1997) The EM algorithm: an old folk-song sung to a fast new tune. J Royal Stat Soc Ser B (Stat Methodol) 59(3):511–567MATHCrossRef Meng XL, van Dyk D (1997) The EM algorithm: an old folk-song sung to a fast new tune. J Royal Stat Soc Ser B (Stat Methodol) 59(3):511–567MATHCrossRef
Zurück zum Zitat Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C: the art of scientific computation, 2nd edn. Cambridge University Press, Cambridge Press WH, Teukolsky SA, Vetterling WT, Flannery BP (1992) Numerical recipes in C: the art of scientific computation, 2nd edn. Cambridge University Press, Cambridge
Zurück zum Zitat Rand W (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850CrossRef Rand W (1971) Objective criteria for the evaluation of clustering methods. J Am Stat Assoc 66(336):846–850CrossRef
Zurück zum Zitat Sakamoto Y, Ishiguro M, Kitagawa G (1983) Akaike information criterion statistics. Reidel, Boston Sakamoto Y, Ishiguro M, Kitagawa G (1983) Akaike information criterion statistics. Reidel, Boston
Zurück zum Zitat Schöner B (2000) Probabilistic characterization and synthesis of complex data driven systems. Ph. D. thesis, MIT Schöner B (2000) Probabilistic characterization and synthesis of complex data driven systems. Ph. D. thesis, MIT
Zurück zum Zitat R Development Core Team (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna R Development Core Team (2012) R: a language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Zurück zum Zitat Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101CrossRef Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101CrossRef
Zurück zum Zitat Tipping TE, Bishop CM (1999) Mixtures of probabilistic principal component analysers. Neural Comput 11(2):443–482CrossRef Tipping TE, Bishop CM (1999) Mixtures of probabilistic principal component analysers. Neural Comput 11(2):443–482CrossRef
Zurück zum Zitat Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New YorkMATH Titterington DM, Smith AFM, Makov UE (1985) Statistical analysis of finite mixture distributions. Wiley, New YorkMATH
Zurück zum Zitat Wang Q, Carvalho C, Lucas J, West M (2007) BFRM: Bayesian factor regression modelling. Bull Int Soc Bayesian Anal 14(2):4–5 Wang Q, Carvalho C, Lucas J, West M (2007) BFRM: Bayesian factor regression modelling. Bull Int Soc Bayesian Anal 14(2):4–5
Zurück zum Zitat West M (2003) Bayesian factor regression models in the “large $p$, small $n$” paradigm. In: Bernardo J, Bayarri M, Berger J, Dawid A, Heckerman D, Smith A, West M (eds) Bayesian statistics, vol 7. Oxford University Press, Oxford, pp 723–732 West M (2003) Bayesian factor regression models in the “large $p$, small $n$” paradigm. In: Bernardo J, Bayarri M, Berger J, Dawid A, Heckerman D, Smith A, West M (eds) Bayesian statistics, vol 7. Oxford University Press, Oxford, pp 723–732
Zurück zum Zitat Wolfe JH (1963) Object cluster analysis of social areas. Master’s thesis, University of California, Berkeley Wolfe JH (1963) Object cluster analysis of social areas. Master’s thesis, University of California, Berkeley
Zurück zum Zitat Wolfe JH (1970) Pattern clustering by multivariate mixture analysis. Multivariate Behav Res 5(3):329–350CrossRef Wolfe JH (1970) Pattern clustering by multivariate mixture analysis. Multivariate Behav Res 5(3):329–350CrossRef
Zurück zum Zitat Woodbury MA (1950) Inverting modified matrices. Statistical Research Group, Memo. Rep. no. 42. Princeton University, Princeton Woodbury MA (1950) Inverting modified matrices. Statistical Research Group, Memo. Rep. no. 42. Princeton University, Princeton
Metadaten
Titel
Clustering and classification via cluster-weighted factor analyzers
verfasst von
Sanjeena Subedi
Antonio Punzo
Salvatore Ingrassia
Paul D. McNicholas
Publikationsdatum
01.03.2013
Verlag
Springer-Verlag
Erschienen in
Advances in Data Analysis and Classification / Ausgabe 1/2013
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI
https://doi.org/10.1007/s11634-013-0124-8

Weitere Artikel der Ausgabe 1/2013

Advances in Data Analysis and Classification 1/2013 Zur Ausgabe

Editorial

Editorial

Premium Partner