Skip to main content

2016 | OriginalPaper | Buchkapitel

Genetic Algorithms for Subset Selection in Model-Based Clustering

verfasst von : Luca Scrucca

Erschienen in: Unsupervised Learning Algorithms

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Model-based clustering assumes that the data observed can be represented by a finite mixture model, where each cluster is represented by a parametric distribution. The Gaussian distribution is often employed in the multivariate continuous case. The identification of the subset of relevant clustering variables enables a parsimonious number of unknown parameters to be achieved, thus yielding a more efficient estimate, a clearer interpretation and often improved clustering partitions. This paper discusses variable or feature selection for model-based clustering. Following the approach of Raftery and Dean (J Am Stat Assoc 101(473):168–178, 2006), the problem of subset selection is recast as a model comparison problem, and BIC is used to approximate Bayes factors. The criterion proposed is based on the BIC difference between a candidate clustering model for the given subset and a model which assumes no clustering for the same subset. Thus, the problem amounts to finding the feature subset which maximises such a criterion. A search over the potentially vast solution space is performed using genetic algorithms, which are stochastic search algorithms that use techniques and concepts inspired by evolutionary biology and natural selection. Numerical experiments using real data applications are presented and discussed.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Back, T., Fogel, D.B., Michalewicz, Z.: Evolutionary Computation 1: Basic Algorithms and Operators. IOP Publishing, Bristol and Philadelphia (2000)CrossRefMATH Back, T., Fogel, D.B., Michalewicz, Z.: Evolutionary Computation 1: Basic Algorithms and Operators. IOP Publishing, Bristol and Philadelphia (2000)CrossRefMATH
3.
Zurück zum Zitat Bean, J.C.: Genetic algorithms and random keys for sequencing and optimization. ORSA J. Comput. 6(2), 154–160 (1994)CrossRefMATH Bean, J.C.: Genetic algorithms and random keys for sequencing and optimization. ORSA J. Comput. 6(2), 154–160 (1994)CrossRefMATH
4.
Zurück zum Zitat Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22(7), 719–725 (2000)CrossRef Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans. Pattern Anal. Mach. Intell. 22(7), 719–725 (2000)CrossRef
5.
Zurück zum Zitat Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recogn. 28, 781–793 (1995)CrossRef Celeux, G., Govaert, G.: Gaussian parsimonious clustering models. Pattern Recogn. 28, 781–793 (1995)CrossRef
6.
Zurück zum Zitat Chang, W.C.: On using principal components before separating a mixture of two multivariate normal distributions. Appl. Stat. 32(3), 267–275 (1983)MathSciNetCrossRefMATH Chang, W.C.: On using principal components before separating a mixture of two multivariate normal distributions. Appl. Stat. 32(3), 267–275 (1983)MathSciNetCrossRefMATH
7.
Zurück zum Zitat Chatterjee, S., Laudato, M., Lynch, L.A.: Genetic algorithms and their statistical applications: an introduction. Comput. Stat. Data Anal. 22, 633–651 (1996)CrossRefMATH Chatterjee, S., Laudato, M., Lynch, L.A.: Genetic algorithms and their statistical applications: an introduction. Comput. Stat. Data Anal. 22, 633–651 (1996)CrossRefMATH
8.
11.
Zurück zum Zitat Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 39, 1–38 (1977)MathSciNetMATH Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood from incomplete data via the em algorithm (with discussion). J. R. Stat. Soc. Ser. B Stat. Methodol. 39, 1–38 (1977)MathSciNetMATH
13.
Zurück zum Zitat Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41, 578–588 (1998)MATH Fraley, C., Raftery, A.E.: How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput. J. 41, 578–588 (1998)MATH
14.
Zurück zum Zitat Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002)MathSciNetCrossRefMATH Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002)MathSciNetCrossRefMATH
15.
Zurück zum Zitat Fraley, C., Raftery, A.E., Murphy, T.B., Scrucca, L.: MCLUST version 4 for R: Normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report 597, Department of Statistics, University of Washington (2012) Fraley, C., Raftery, A.E., Murphy, T.B., Scrucca, L.: MCLUST version 4 for R: Normal mixture modeling for model-based clustering, classification, and density estimation. Technical Report 597, Department of Statistics, University of Washington (2012)
16.
Zurück zum Zitat Goldberg, D.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley Professional, Boston, MA (1989)MATH Goldberg, D.: Genetic Algorithms in Search, Optimization, and Machine Learning. Addison-Wesley Professional, Boston, MA (1989)MATH
17.
Zurück zum Zitat Haupt, R.L., Haupt, S.E.: Practical Genetic Algorithms, 2nd edn. Wiley, New York (2004)MATH Haupt, R.L., Haupt, S.E.: Practical Genetic Algorithms, 2nd edn. Wiley, New York (2004)MATH
18.
19.
21.
Zurück zum Zitat Keribin, C.: Consistent estimation of the order of mixture models. Sankhya Ser. A 62(1), 49–66 (2000)MathSciNetMATH Keribin, C.: Consistent estimation of the order of mixture models. Sankhya Ser. A 62(1), 49–66 (2000)MathSciNetMATH
22.
Zurück zum Zitat Law, M.H.C., Figueiredo, M.A.T., Jain, A.K.: Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 26(9), 1154–1166 (2004)CrossRef Law, M.H.C., Figueiredo, M.A.T., Jain, A.K.: Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 26(9), 1154–1166 (2004)CrossRef
23.
Zurück zum Zitat Maugis, C., Celeux, G., Martin-Magniette, M.L.: Variable selection for clustering with gaussian mixture models. Biometrics 65(3), 701–709 (2009)MathSciNetCrossRefMATH Maugis, C., Celeux, G., Martin-Magniette, M.L.: Variable selection for clustering with gaussian mixture models. Biometrics 65(3), 701–709 (2009)MathSciNetCrossRefMATH
24.
Zurück zum Zitat Maugis, C., Celeux, G., Martin-Magniette, M.L.: Variable selection in model-based clustering: a general variable role modeling. Comput. Stat. Data Anal. 53(11), 3872–3882 (2009)MathSciNetCrossRefMATH Maugis, C., Celeux, G., Martin-Magniette, M.L.: Variable selection in model-based clustering: a general variable role modeling. Comput. Stat. Data Anal. 53(11), 3872–3882 (2009)MathSciNetCrossRefMATH
25.
Zurück zum Zitat McLachlan, G.J, Krishnan, T.: The EM Algorithm and Extensions, 2nd edn. Wiley, Hoboken, NJ (2008)CrossRefMATH McLachlan, G.J, Krishnan, T.: The EM Algorithm and Extensions, 2nd edn. Wiley, Hoboken, NJ (2008)CrossRefMATH
28.
Zurück zum Zitat Neath, A.A., Cavanaugh, J.E.: The Bayesian information criterion: background, derivation, and applications. Wiley Interdiscip. Rev. Comput. Stat. 4(2),199–203 (2012). doi:10.1002/wics.199 CrossRef Neath, A.A., Cavanaugh, J.E.: The Bayesian information criterion: background, derivation, and applications. Wiley Interdiscip. Rev. Comput. Stat. 4(2),199–203 (2012). doi:10.​1002/​wics.​199 CrossRef
30.
31.
Zurück zum Zitat Roeder, K., Wasserman, L.: Practical bayesian density estimation using mixtures of normals. J. Am. Stat. Assoc. 92(439), 894–902 (1997)MathSciNetCrossRefMATH Roeder, K., Wasserman, L.: Practical bayesian density estimation using mixtures of normals. J. Am. Stat. Assoc. 92(439), 894–902 (1997)MathSciNetCrossRefMATH
32.
Zurück zum Zitat Schwartz, G.: Estimating the dimension of a model. Ann. Stat. 6, 31–38 (1978)MathSciNet Schwartz, G.: Estimating the dimension of a model. Ann. Stat. 6, 31–38 (1978)MathSciNet
35.
Zurück zum Zitat Scrucca, L.: Graphical tools for model-based mixture discriminant analysis. Adv. Data Anal. Classif. 8(2), 147–165 (2014)MathSciNetCrossRef Scrucca, L.: Graphical tools for model-based mixture discriminant analysis. Adv. Data Anal. Classif. 8(2), 147–165 (2014)MathSciNetCrossRef
37.
Zurück zum Zitat Ševčíková, H.: Statistical simulations on parallel computers. J. Comput. Graph. Stat. 13(4), 886–906 (2004)MathSciNetCrossRef Ševčíková, H.: Statistical simulations on parallel computers. J. Comput. Graph. Stat. 13(4), 886–906 (2004)MathSciNetCrossRef
38.
Zurück zum Zitat Winker, P., Gilli, M.: Applications of optimization heuristics to estimation and modelling problems. Comput. Stat. Data Anal. 47(2), 211–223 (2004)MathSciNetCrossRefMATH Winker, P., Gilli, M.: Applications of optimization heuristics to estimation and modelling problems. Comput. Stat. Data Anal. 47(2), 211–223 (2004)MathSciNetCrossRefMATH
Metadaten
Titel
Genetic Algorithms for Subset Selection in Model-Based Clustering
verfasst von
Luca Scrucca
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-24211-8_3

Neuer Inhalt