Skip to main content
Erschienen in: Advances in Data Analysis and Classification 3/2013

01.09.2013 | Regular Article

Dimension reduction for model-based clustering via mixtures of multivariate \(t\)-distributions

verfasst von: Katherine Morris, Paul D. McNicholas, Luca Scrucca

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 3/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We introduce a dimension reduction method for model-based clustering obtained from a finite mixture of \(t\)-distributions. This approach is based on existing work on reducing dimensionality in the case of finite Gaussian mixtures. The method relies on identifying a reduced subspace of the data by considering the extent to which group means and group covariances vary. This subspace contains linear combinations of the original data, which are ordered by importance via the associated eigenvalues. Observations can be projected onto the subspace and the resulting set of variables captures most of the clustering structure available in the data. The approach is illustrated using simulated and real data, where it outperforms its Gaussian analogue.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Andrews JL, McNicholas PD (2011a) Extending mixtures of multivariate \(t\)-factor analyzers. Stat Comput 21(3):361–373MathSciNetCrossRef Andrews JL, McNicholas PD (2011a) Extending mixtures of multivariate \(t\)-factor analyzers. Stat Comput 21(3):361–373MathSciNetCrossRef
Zurück zum Zitat Andrews JL, McNicholas PD (2011b) Mixtures of modified \(t\)-factor analyzers for model-based clustering, classification, and discriminant analysis. J Stat Plan Inference 141(4):1479–1486MathSciNetMATHCrossRef Andrews JL, McNicholas PD (2011b) Mixtures of modified \(t\)-factor analyzers for model-based clustering, classification, and discriminant analysis. J Stat Plan Inference 141(4):1479–1486MathSciNetMATHCrossRef
Zurück zum Zitat Andrews JL, McNicholas PD (2012a) Model-based clustering, classification, and discriminant analysis via mixtures of multivariate \(t\)-distributions: the \(t\)EIGEN family. Stat Comput 22(5):1021–1029MathSciNetMATHCrossRef Andrews JL, McNicholas PD (2012a) Model-based clustering, classification, and discriminant analysis via mixtures of multivariate \(t\)-distributions: the \(t\)EIGEN family. Stat Comput 22(5):1021–1029MathSciNetMATHCrossRef
Zurück zum Zitat Andrews JL, McNicholas PD (2012b) teigen: model-based clustering and classification with the multivariate t-distribution. R package version 1.0 Andrews JL, McNicholas PD (2012b) teigen: model-based clustering and classification with the multivariate t-distribution. R package version 1.0
Zurück zum Zitat Andrews JL, McNicholas PD, Subedi S (2011) Model-based classification via mixtures of multivariate \(t\)-distributions. Comput Stat Data Anal 55(1):520–529 Andrews JL, McNicholas PD, Subedi S (2011) Model-based classification via mixtures of multivariate \(t\)-distributions. Comput Stat Data Anal 55(1):520–529
Zurück zum Zitat Baek J, McLachlan GJ (2011) Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics 27:1269–1276CrossRef Baek J, McLachlan GJ (2011) Mixtures of common t-factor analyzers for clustering high-dimensional microarray data. Bioinformatics 27:1269–1276CrossRef
Zurück zum Zitat Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3): 803–821 Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49(3): 803–821
Zurück zum Zitat Boulesteix AL, Lambert-Lacroix S, Peyre J, Strimmer K (2011) plsgenomics: PLS analyses for genomics. R package version 1.2-6 Boulesteix AL, Lambert-Lacroix S, Peyre J, Strimmer K (2011) plsgenomics: PLS analyses for genomics. R package version 1.2-6
Zurück zum Zitat Bouveyron C, Brunet C (2012) Simultaneous model-based clustering and visualization in the Fisher discriminative subspace. Stat Comput 22(1):301–324MathSciNetCrossRef Bouveyron C, Brunet C (2012) Simultaneous model-based clustering and visualization in the Fisher discriminative subspace. Stat Comput 22(1):301–324MathSciNetCrossRef
Zurück zum Zitat Campbell NA, Mahon RJ (1974) A multivariate study of variation in two species of rock crab of genus leptograpsus. Aust J Zoo l 22:417–425CrossRef Campbell NA, Mahon RJ (1974) A multivariate study of variation in two species of rock crab of genus leptograpsus. Aust J Zoo l 22:417–425CrossRef
Zurück zum Zitat Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28:781–793CrossRef Celeux G, Govaert G (1995) Gaussian parsimonious clustering models. Pattern Recognit 28:781–793CrossRef
Zurück zum Zitat Dean N, Raftery AE (2009) clustvarsel: Variable selection for model-based clustering. R package version 1.3 Dean N, Raftery AE (2009) clustvarsel: Variable selection for model-based clustering. R package version 1.3
Zurück zum Zitat Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Royal Stat Soc 39(1):1–38MathSciNetMATH Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J Royal Stat Soc 39(1):1–38MathSciNetMATH
Zurück zum Zitat Forina M, Armanino C, Castino M, Ubigli M (1986) Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25:189–201 Forina M, Armanino C, Castino M, Ubigli M (1986) Multivariate data analysis as a discriminating method of the origin of wines. Vitis 25:189–201
Zurück zum Zitat Fraley C, Raftery AE (1999) MCLUST: software for model-based cluster analysis. J Classif 16:297–306MATHCrossRef Fraley C, Raftery AE (1999) MCLUST: software for model-based cluster analysis. J Classif 16:297–306MATHCrossRef
Zurück zum Zitat Franczak B, Browne RP, McNicholas PD (2012) Mixtures of shifted asymmetric Laplace distributions. Arxiv, preprint arXiv:1207.1727v3 Franczak B, Browne RP, McNicholas PD (2012) Mixtures of shifted asymmetric Laplace distributions. Arxiv, preprint arXiv:1207.1727v3
Zurück zum Zitat Greselin F, Ingrassia S (2010a) Constrained monotone EM algorithms for mixtures of multivariate \(t\)-distributions. Stat Comput 20(1):9–22 Greselin F, Ingrassia S (2010a) Constrained monotone EM algorithms for mixtures of multivariate \(t\)-distributions. Stat Comput 20(1):9–22
Zurück zum Zitat Greselin F, Ingrassia S (2010b) Weakly homoscedastic constraints for mixtures of \(t\)-distributions. In: Fink A, Lausen B, Seidel W, Ultsch A (eds) Advances in Data Analysis, Data Handling and Business Intelligence. Studies in Classification, Data Analysis, and Knowledge Organization, Springer, Berlin Greselin F, Ingrassia S (2010b) Weakly homoscedastic constraints for mixtures of \(t\)-distributions. In: Fink A, Lausen B, Seidel W, Ultsch A (eds) Advances in Data Analysis, Data Handling and Business Intelligence. Studies in Classification, Data Analysis, and Knowledge Organization, Springer, Berlin
Zurück zum Zitat Hubert L, Arabie P (1985) Comparing partitions. J Classifi 2:193–218CrossRef Hubert L, Arabie P (1985) Comparing partitions. J Classifi 2:193–218CrossRef
Zurück zum Zitat Hubert M, Rousseeuw PJ, Vanden Branden K (2005) ROBPCA: a new approach to robust principal components analysis. Technometrics 47:64–79 Hubert M, Rousseeuw PJ, Vanden Branden K (2005) ROBPCA: a new approach to robust principal components analysis. Technometrics 47:64–79
Zurück zum Zitat Karlis D, Santourian A (2009) Model-based clustering with non-elliptically contoured distributions. Stat Comput 19:73–83MathSciNetCrossRef Karlis D, Santourian A (2009) Model-based clustering with non-elliptically contoured distributions. Stat Comput 19:73–83MathSciNetCrossRef
Zurück zum Zitat Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7:673–679 Khan J, Wei JS, Ringner M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7:673–679
Zurück zum Zitat Lee SX, McLachlan GJ (2013) On mixtures of skew normal and skew t-distributions. Arxiv, preprint arXiv:1211.3602v3 Lee SX, McLachlan GJ (2013) On mixtures of skew normal and skew t-distributions. Arxiv, preprint arXiv:1211.3602v3
Zurück zum Zitat Li KC (1991) Sliced inverse regression for dimension reduction (with discussion). J Am Stat Assoc 86: 316–342 Li KC (1991) Sliced inverse regression for dimension reduction (with discussion). J Am Stat Assoc 86: 316–342
Zurück zum Zitat Loader C (2012) locfit: Local Regression, Likelihood and Density Estimation. R package version 1.5-8 Loader C (2012) locfit: Local Regression, Likelihood and Density Estimation. R package version 1.5-8
Zurück zum Zitat Maugis C, Celeux G, Martin-Magniette ML (2009) Variable selection for clustering with Gaussian mixture models. Biometrics 65:701–709MathSciNetMATHCrossRef Maugis C, Celeux G, Martin-Magniette ML (2009) Variable selection for clustering with Gaussian mixture models. Biometrics 65:701–709MathSciNetMATHCrossRef
Zurück zum Zitat McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley, New YorkMATHCrossRef McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley, New YorkMATHCrossRef
Zurück zum Zitat McLachlan GJ, Bean RW, Jones LT (2007) Extension of the mixture of factor analyzers model to incorporate the multivariate \(t\)-distribution. Comput Stat Data Anal 51(11):5327–5338MATHCrossRef McLachlan GJ, Bean RW, Jones LT (2007) Extension of the mixture of factor analyzers model to incorporate the multivariate \(t\)-distribution. Comput Stat Data Anal 51(11):5327–5338MATHCrossRef
Zurück zum Zitat McNicholas PD (2013) Model-based clustering and classification via mixtures of multivariate t-distributions. In: Giudici P, Ingrassia S, Vichi M (eds) Statistical models for data analysis, studies in classification, data analysis, and knowledge organization. Springer International Publishing, Switzerland McNicholas PD (2013) Model-based clustering and classification via mixtures of multivariate t-distributions. In: Giudici P, Ingrassia S, Vichi M (eds) Statistical models for data analysis, studies in classification, data analysis, and knowledge organization. Springer International Publishing, Switzerland
Zurück zum Zitat McNicholas PD, Murphy TB (2010) Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26(21):2705–2712CrossRef McNicholas PD, Murphy TB (2010) Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26(21):2705–2712CrossRef
Zurück zum Zitat McNicholas PD, Subedi S (2012) Clustering gene expression time course data using mixtures of multivariate t-distributions. J Stat Plan Inference 142(5):1114–1127MathSciNetMATHCrossRef McNicholas PD, Subedi S (2012) Clustering gene expression time course data using mixtures of multivariate t-distributions. J Stat Plan Inference 142(5):1114–1127MathSciNetMATHCrossRef
Zurück zum Zitat Peel D, McLachlan GJ (2000) Robust mixture modelling using the \(t\)-distribution. Stat Comput 10:339–348CrossRef Peel D, McLachlan GJ (2000) Robust mixture modelling using the \(t\)-distribution. Stat Comput 10:339–348CrossRef
Zurück zum Zitat Qiu WL, Joe H (2006) Generation of random clusters with specified degree of separation. J Classifi 23(2):315–334MathSciNetCrossRef Qiu WL, Joe H (2006) Generation of random clusters with specified degree of separation. J Classifi 23(2):315–334MathSciNetCrossRef
Zurück zum Zitat Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101(473): 168–178 Raftery AE, Dean N (2006) Variable selection for model-based clustering. J Am Stat Assoc 101(473): 168–178
Zurück zum Zitat Reaven GM, Miller RG (1979) An attempt to define the nature of chemical diabetes using a multidimensional analysis. Diabetologia 16:17–24CrossRef Reaven GM, Miller RG (1979) An attempt to define the nature of chemical diabetes using a multidimensional analysis. Diabetologia 16:17–24CrossRef
Zurück zum Zitat Steane MA, McNicholas PD, Yada R (2012) Model-based classification via mixtures of multivariate t-factor analyzers. Commun Stat Simul Comput 41(4):510–523 Steane MA, McNicholas PD, Yada R (2012) Model-based classification via mixtures of multivariate t-factor analyzers. Commun Stat Simul Comput 41(4):510–523
Zurück zum Zitat Tibshirani R, Hastie T, Narasimhan B, Chu G (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Nat Acad Sci USA 99(10):6567–6572CrossRef Tibshirani R, Hastie T, Narasimhan B, Chu G (2002) Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Nat Acad Sci USA 99(10):6567–6572CrossRef
Zurück zum Zitat Vrbik I, McNicholas PD (2012) Analytic calculations for the EM algorithm for multivariate skew-mixture models. Stat Prob Lett 82(6):1169–1174MathSciNetMATHCrossRef Vrbik I, McNicholas PD (2012) Analytic calculations for the EM algorithm for multivariate skew-mixture models. Stat Prob Lett 82(6):1169–1174MathSciNetMATHCrossRef
Metadaten
Titel
Dimension reduction for model-based clustering via mixtures of multivariate -distributions
verfasst von
Katherine Morris
Paul D. McNicholas
Luca Scrucca
Publikationsdatum
01.09.2013
Verlag
Springer Berlin Heidelberg
Erschienen in
Advances in Data Analysis and Classification / Ausgabe 3/2013
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI
https://doi.org/10.1007/s11634-013-0137-3

Weitere Artikel der Ausgabe 3/2013

Advances in Data Analysis and Classification 3/2013 Zur Ausgabe

Premium Partner