Skip to main content
Erschienen in: Advances in Data Analysis and Classification 3/2013

01.09.2013 | Regular Article

Model-based clustering of high-dimensional data streams with online mixture of probabilistic PCA

verfasst von: Anastasios Bellas, Charles Bouveyron, Marie Cottrell, Jérôme Lacaille

Erschienen in: Advances in Data Analysis and Classification | Ausgabe 3/2013

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Model-based clustering is a popular tool which is renowned for its probabilistic foundations and its flexibility. However, model-based clustering techniques usually perform poorly when dealing with high-dimensional data streams, which are nowadays a frequent data type. To overcome this limitation of model-based clustering, we propose an online inference algorithm for the mixture of probabilistic PCA model. The proposed algorithm relies on an EM-based procedure and on a probabilistic and incremental version of PCA. Model selection is also considered in the online setting through parallel computing. Numerical experiments on simulated and real data demonstrate the effectiveness of our approach and compare it to state-of-the-art online EM-based algorithms.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
Zurück zum Zitat Aggarwal C, Han J, Wang J, Yu P (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the 30th International Conference on very large data bases, vol. 30. VLDB Endowment, pp 852–863 Aggarwal C, Han J, Wang J, Yu P (2004) A framework for projected clustering of high dimensional data streams. In: Proceedings of the 30th International Conference on very large data bases, vol. 30. VLDB Endowment, pp 852–863
Zurück zum Zitat Arandjelović O, Cipolla R (2005) Incremental learning of temporally-coherent Gaussian mixture models. In: Proceedings of the British Machine Vision Conference. Oxford, UK, pp 759–768 Arandjelović O, Cipolla R (2005) Incremental learning of temporally-coherent Gaussian mixture models. In: Proceedings of the British Machine Vision Conference. Oxford, UK, pp 759–768
Zurück zum Zitat Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on principles of database systems. ACM, pp 234–243 Babcock B, Datar M, Motwani R, O’Callaghan L (2003) Maintaining variance and k-medians over data stream windows. In: Proceedings of the 22nd ACM SIGMOD-SIGACT-SIGART Symposium on principles of database systems. ACM, pp 234–243
Zurück zum Zitat Baek J, McLachlan G, Flack L (2010) Mixtures of factor analyzers with common factor loadings: Applications to the clustering and visualization of high-dimensional data. Pattern Anal Mach Intell IEEE Trans 32(7):1298–1309CrossRef Baek J, McLachlan G, Flack L (2010) Mixtures of factor analyzers with common factor loadings: Applications to the clustering and visualization of high-dimensional data. Pattern Anal Mach Intell IEEE Trans 32(7):1298–1309CrossRef
Zurück zum Zitat Bartholomew D, Knott M, Moustaki I (2011) Latent variable models and factor analysis: a unified approach, vol 899. Wiley, New YorkCrossRef Bartholomew D, Knott M, Moustaki I (2011) Latent variable models and factor analysis: a unified approach, vol 899. Wiley, New YorkCrossRef
Zurück zum Zitat Basilevsky A (2009) Statistical factor analysis and related methods: theory and applications, vol 418. Wiley-Interscience, New York Basilevsky A (2009) Statistical factor analysis and related methods: theory and applications, vol 418. Wiley-Interscience, New York
Zurück zum Zitat Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. Pattern Anal Mach Intell IEEE Trans 22(7):719–725CrossRef Biernacki C, Celeux G, Govaert G (2000) Assessing a mixture model for clustering with the integrated completed likelihood. Pattern Anal Mach Intell IEEE Trans 22(7):719–725CrossRef
Zurück zum Zitat Bouveyron C, Brunet C (2012) Simultaneous model-based clustering and visualization in the fisher discriminative subspace. Stat Comput 22(1):301–324MathSciNetCrossRef Bouveyron C, Brunet C (2012) Simultaneous model-based clustering and visualization in the fisher discriminative subspace. Stat Comput 22(1):301–324MathSciNetCrossRef
Zurück zum Zitat Bouveyron C, Girard S, Schmid C (2007b) High-dimensional discriminant analysis. Commun Stat Theory Methods 36(14):2607–2623MathSciNetMATHCrossRef Bouveyron C, Girard S, Schmid C (2007b) High-dimensional discriminant analysis. Commun Stat Theory Methods 36(14):2607–2623MathSciNetMATHCrossRef
Zurück zum Zitat Celeux G, Govaert G (1992) A classification em algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332MathSciNetMATHCrossRef Celeux G, Govaert G (1992) A classification em algorithm for clustering and two stochastic versions. Comput Stat Data Anal 14(3):315–332MathSciNetMATHCrossRef
Zurück zum Zitat Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38. doi:10.2307/2984875 Dempster A, Laird N, Rubin D (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39(1):1–38. doi:10.​2307/​2984875
Zurück zum Zitat Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: Proceedings of the 18th International Conference on Machine Learning, pp 106–113 Domingos P, Hulten G (2001) A general method for scaling up machine learning algorithms and its application to clustering. In: Proceedings of the 18th International Conference on Machine Learning, pp 106–113
Zurück zum Zitat Duda R, Har, P, Stork D (1995) Pattern classification and scene analysis, 2nd edn Duda R, Har, P, Stork D (1995) Pattern classification and scene analysis, 2nd edn
Zurück zum Zitat Figueiredo M, Jain A (2002) Unsupervised learning of finite mixture models. Pattern Anal Mach Intell IEEE Trans 24(3):381–396CrossRef Figueiredo M, Jain A (2002) Unsupervised learning of finite mixture models. Pattern Anal Mach Intell IEEE Trans 24(3):381–396CrossRef
Zurück zum Zitat Fraley C, Raftery A (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631MathSciNetMATHCrossRef Fraley C, Raftery A (2002) Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc 97(458):611–631MathSciNetMATHCrossRef
Zurück zum Zitat Gaber M, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Record 34(2):18–26CrossRef Gaber M, Zaslavsky A, Krishnaswamy S (2005) Mining data streams: a review. ACM Sigmod Record 34(2):18–26CrossRef
Zurück zum Zitat Ghahramani Z, Hinton G et al (1996) The em algorithm for mixtures of factor analyzers. Tech. rep., Technical Report CRG-TR-96-1, University of Toronto Ghahramani Z, Hinton G et al (1996) The em algorithm for mixtures of factor analyzers. Tech. rep., Technical Report CRG-TR-96-1, University of Toronto
Zurück zum Zitat Guha S, Mishra N, Motwani R, O’Callaghan L (2000) Clustering data streams. In: Foundations of Computer Science, 2000. In: Proceedings of 41st Annual Symposium on IEEE, pp 359–366 Guha S, Mishra N, Motwani R, O’Callaghan L (2000) Clustering data streams. In: Foundations of Computer Science, 2000. In: Proceedings of 41st Annual Symposium on IEEE, pp 359–366
Zurück zum Zitat Hall P, Hicks Y, Robinson T (2005) A method to add gaussian mixture models. Technical report, University of Bath Hall P, Hicks Y, Robinson T (2005) A method to add gaussian mixture models. Technical report, University of Bath
Zurück zum Zitat Hall P, Marshall D, Martin R (1998) Incremental eigenanalysis for classification. In: British Machine Vision Conference, vol 1. Citeseer, pp 286–295 Hall P, Marshall D, Martin R (1998) Incremental eigenanalysis for classification. In: British Machine Vision Conference, vol 1. Citeseer, pp 286–295
Zurück zum Zitat Jacques J, Bouveyron C, Girard S, Devos O, Duponchel L, Ruckebusch C (2010) Gaussian mixture models for the classification of high-dimensional vibrational spectroscopy data. J Chemom 24(11–12):719–727CrossRef Jacques J, Bouveyron C, Girard S, Devos O, Duponchel L, Ruckebusch C (2010) Gaussian mixture models for the classification of high-dimensional vibrational spectroscopy data. J Chemom 24(11–12):719–727CrossRef
Zurück zum Zitat Lindsay B (1995) Mixture models: theory, geometry and applications. In: JSTOR NSF-CBMS Regional Conference Series in probability and statistics. Lindsay B (1995) Mixture models: theory, geometry and applications. In: JSTOR NSF-CBMS Regional Conference Series in probability and statistics.
Zurück zum Zitat MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on mathematical statistics and probability, vol. 1. California, USA, p 14 MacQueen J et al (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley Symposium on mathematical statistics and probability, vol. 1. California, USA, p 14
Zurück zum Zitat McLachlan G, Krishnan T (1997) The em algorithm and extensions. Wiley-Interscience, New York McLachlan G, Krishnan T (1997) The em algorithm and extensions. Wiley-Interscience, New York
Zurück zum Zitat McLachlan G, Peel D (2000) Finite mixture models, vol 299. Wiley-Interscience, New YorkMATHCrossRef McLachlan G, Peel D (2000) Finite mixture models, vol 299. Wiley-Interscience, New YorkMATHCrossRef
Zurück zum Zitat McLachlan G, Peel D, Bean R (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41(3):379–388MathSciNetMATHCrossRef McLachlan G, Peel D, Bean R (2003) Modelling high-dimensional data by mixtures of factor analyzers. Comput Stat Data Anal 41(3):379–388MathSciNetMATHCrossRef
Zurück zum Zitat McNicholas P, Murphy T, McDaid A, Frost D (2010) Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput Stat Data Anal 54(3):711–723MathSciNetMATHCrossRef McNicholas P, Murphy T, McDaid A, Frost D (2010) Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Comput Stat Data Anal 54(3):711–723MathSciNetMATHCrossRef
Zurück zum Zitat Neal R, Hinton G (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. Learn Graph Models 89:355–368CrossRef Neal R, Hinton G (1998) A view of the EM algorithm that justifies incremental, sparse, and other variants. Learn Graph Models 89:355–368CrossRef
Zurück zum Zitat O’callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of 18th International Conference on Data Engineering, pp 685–694 O’callaghan L, Mishra N, Meyerson A, Guha S, Motwani R (2002) Streaming-data algorithms for high-quality clustering. In: Proceedings of 18th International Conference on Data Engineering, pp 685–694
Zurück zum Zitat Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101 Spearman C (1904) The proof and measurement of association between two things. Am J Psychol 15(1):72–101
Zurück zum Zitat Tipping M, Bishop C (1999) Mixtures of probabilistic principal component analyzers. Neural Comput 11(2):443–482CrossRef Tipping M, Bishop C (1999) Mixtures of probabilistic principal component analyzers. Neural Comput 11(2):443–482CrossRef
Zurück zum Zitat Titterington D (1984) Recursive parameter estimation using incomplete data. J R Stat Soc Ser B (Methodol) 46(2):257–267MathSciNetMATH Titterington D (1984) Recursive parameter estimation using incomplete data. J R Stat Soc Ser B (Methodol) 46(2):257–267MathSciNetMATH
Zurück zum Zitat Ueda N, Nakano R, Ghahramani Z, Hinton G (2000) Smem algorithm for mixture models. Neural Comput 12(9):2109–2128CrossRef Ueda N, Nakano R, Ghahramani Z, Hinton G (2000) Smem algorithm for mixture models. Neural Comput 12(9):2109–2128CrossRef
Zurück zum Zitat Wang WL, Lin TI (2013) An efficient ecm algorithm for maximum likelihood estimation in mixtures of t-factor analyzers. Comput Stat 28(2):751–759CrossRef Wang WL, Lin TI (2013) An efficient ecm algorithm for maximum likelihood estimation in mixtures of t-factor analyzers. Comput Stat 28(2):751–759CrossRef
Zurück zum Zitat Wu C (1983) On the convergence properties of the em algorithm. Ann Stat 11(1):95–103MATHCrossRef Wu C (1983) On the convergence properties of the em algorithm. Ann Stat 11(1):95–103MATHCrossRef
Zurück zum Zitat Zhao JH, Yu PL (2008) Fast ml estimation for the mixture of factor analyzers via an ecm algorithm. Neural Netw IEEE Trans 19(11):1956–1961CrossRef Zhao JH, Yu PL (2008) Fast ml estimation for the mixture of factor analyzers via an ecm algorithm. Neural Netw IEEE Trans 19(11):1956–1961CrossRef
Metadaten
Titel
Model-based clustering of high-dimensional data streams with online mixture of probabilistic PCA
verfasst von
Anastasios Bellas
Charles Bouveyron
Marie Cottrell
Jérôme Lacaille
Publikationsdatum
01.09.2013
Verlag
Springer Berlin Heidelberg
Erschienen in
Advances in Data Analysis and Classification / Ausgabe 3/2013
Print ISSN: 1862-5347
Elektronische ISSN: 1862-5355
DOI
https://doi.org/10.1007/s11634-013-0133-7

Weitere Artikel der Ausgabe 3/2013

Advances in Data Analysis and Classification 3/2013 Zur Ausgabe

Premium Partner