Skip to main content

2016 | OriginalPaper | Buchkapitel

Application of Mixture Models to Large Datasets

verfasst von : Sharon X. Lee, Geoffrey McLachlan, Saumyadipta Pyne

Erschienen in: Big Data Analytics

Verlag: Springer India

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Mixture distributions are commonly being applied for modelling and for discriminant and cluster analyses in a wide variety of situations. We first consider normal and t-mixture models. As they are highly parameterized, we review methods to enable them to be fitted to large datasets involving many observations and variables. Attention is then given to extensions of these mixture models to mixtures with skew normal and skew t-distributions for the segmentation of data into clusters of non-elliptical shape. The focus is then on the latter models in conjunction with the JCM (joint clustering and matching) procedure for an automated approach to the clustering of cells in a sample in flow cytometry where a large number of cells and their associated markers have been measured. For a class of multiple samples, we consider the use of JCM for matching the sample-specific clusters across the samples in the class and for improving the clustering of each individual sample. The supervised classification of a sample is also considered in the case where there are different classes of samples corresponding, for example, to different outcomes or treatment strategies for patients undergoing medical screening or treatment.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc, Ser B 39:1–38MathSciNetMATH Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc, Ser B 39:1–38MathSciNetMATH
2.
Zurück zum Zitat McLachlan GJ, Peel D (2000) Finite mixture models. Wiley Series in Probability and Statistics, New YorkCrossRefMATH McLachlan GJ, Peel D (2000) Finite mixture models. Wiley Series in Probability and Statistics, New YorkCrossRefMATH
3.
Zurück zum Zitat McLachlan GJ, Do KA, Ambroise C (2004) Analyzing microarray gene expression data. Hoboken, New JerseyCrossRefMATH McLachlan GJ, Do KA, Ambroise C (2004) Analyzing microarray gene expression data. Hoboken, New JerseyCrossRefMATH
4.
Zurück zum Zitat Pyne S, Lee SX, Wang K, Irish J, Tamayo P, Nazaire MD, Duong T, Ng SK, Hafler D, Levy R, Nolan GP, Mesirov J, McLachlan GJ (2014) Joint modeling and registration of cell populations in cohorts of high-dimensional flow cytometric data. PLOS ONE 9(7):e100334CrossRef Pyne S, Lee SX, Wang K, Irish J, Tamayo P, Nazaire MD, Duong T, Ng SK, Hafler D, Levy R, Nolan GP, Mesirov J, McLachlan GJ (2014) Joint modeling and registration of cell populations in cohorts of high-dimensional flow cytometric data. PLOS ONE 9(7):e100334CrossRef
5.
Zurück zum Zitat Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA, De Jager PL, Mesirow JP (2009) Automated high-dimensional flow cytometric data analysis. Proc Natl Acad Sci USA 106:8519–8524CrossRef Pyne S, Hu X, Wang K, Rossin E, Lin TI, Maier LM, Baecher-Allan C, McLachlan GJ, Tamayo P, Hafler DA, De Jager PL, Mesirow JP (2009) Automated high-dimensional flow cytometric data analysis. Proc Natl Acad Sci USA 106:8519–8524CrossRef
6.
Zurück zum Zitat Li JQ, Barron AR (2000) Mixture density estimation. In: Solla SA, Leen TK, Mueller KR (eds) Advances in neural information processing systems. MIT Press, Cambridge, pp 279–285 Li JQ, Barron AR (2000) Mixture density estimation. In: Solla SA, Leen TK, Mueller KR (eds) Advances in neural information processing systems. MIT Press, Cambridge, pp 279–285
7.
Zurück zum Zitat McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley-Interscience, Hokoben McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley-Interscience, Hokoben
8.
Zurück zum Zitat McLachlan GJ, Peel D (1998) Robust cluster analysis via mixtures of multivariate \(t\)-distributions. In: Amin A, Dori D, Pudil P, Freeman H (eds) Lecture notes in computer science, vol 1451. Springer, Berlin, pp 658–666 McLachlan GJ, Peel D (1998) Robust cluster analysis via mixtures of multivariate \(t\)-distributions. In: Amin A, Dori D, Pudil P, Freeman H (eds) Lecture notes in computer science, vol 1451. Springer, Berlin, pp 658–666
10.
Zurück zum Zitat McLachlan GJ (1987) On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. J R Stat Soc Ser C (Appl Stat) 36:318–324 McLachlan GJ (1987) On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture. J R Stat Soc Ser C (Appl Stat) 36:318–324
11.
Zurück zum Zitat McLachlan GJ, Peel D (1998) Robust cluster analysis via mixtures of multivariate \(t\)-distributions. In: Amin A, Dori D, Pudil P, Freeman H (eds) Lecture notes in computer science. Springer, Berlin, pp 658–666 McLachlan GJ, Peel D (1998) Robust cluster analysis via mixtures of multivariate \(t\)-distributions. In: Amin A, Dori D, Pudil P, Freeman H (eds) Lecture notes in computer science. Springer, Berlin, pp 658–666
12.
Zurück zum Zitat Baek J, McLachlan GJ (2008) Mixtures of factor analyzers with common factor loadings for the clustering and visualisation of high-dimensional data. Technical Report NI08018-SCH, Preprint Series of the Isaac Newton Institute for Mathematical Sciences, Cambridge Baek J, McLachlan GJ (2008) Mixtures of factor analyzers with common factor loadings for the clustering and visualisation of high-dimensional data. Technical Report NI08018-SCH, Preprint Series of the Isaac Newton Institute for Mathematical Sciences, Cambridge
13.
Zurück zum Zitat Baek J, McLachlan GJ (2011) Mixtures of common \(t\)-factor analyzers for clustering high-dimensional microarray data. Bioinformatics 27:1269–1276CrossRef Baek J, McLachlan GJ (2011) Mixtures of common \(t\)-factor analyzers for clustering high-dimensional microarray data. Bioinformatics 27:1269–1276CrossRef
14.
Zurück zum Zitat McLachlan GJ, Bean RW, Peel D (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18:413–422CrossRef McLachlan GJ, Bean RW, Peel D (2002) A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 18:413–422CrossRef
15.
Zurück zum Zitat Yb Chan (2010) Hall P. Using evidence of mixed populations to select variables for clustering very high dimensional data. J Am Stat Assoc 105:798–809CrossRef Yb Chan (2010) Hall P. Using evidence of mixed populations to select variables for clustering very high dimensional data. J Am Stat Assoc 105:798–809CrossRef
16.
Zurück zum Zitat Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791CrossRef Lee DD, Seung HS (1999) Learning the parts of objects by non-negative matrix factorization. Nature 401:788–791CrossRef
17.
Zurück zum Zitat Donoho D, Stodden V (2004) When does non-negative matrix factorization give correct decomposition into parts? In: Advances in neural information processing systems, vol 16. MIT Press, Cambridge, MA, pp 1141–1148 Donoho D, Stodden V (2004) When does non-negative matrix factorization give correct decomposition into parts? In: Advances in neural information processing systems, vol 16. MIT Press, Cambridge, MA, pp 1141–1148
18.
Zurück zum Zitat Golub GH, van Loan CF (1983) Matrix computation. The John Hopkins University Press, BaltimoreMATH Golub GH, van Loan CF (1983) Matrix computation. The John Hopkins University Press, BaltimoreMATH
19.
Zurück zum Zitat Kossenkov AV, Ochs MF (2009) Matrix factorization for recovery of biological processes from microarray data. Methods Enzymol 267:59–77CrossRef Kossenkov AV, Ochs MF (2009) Matrix factorization for recovery of biological processes from microarray data. Methods Enzymol 267:59–77CrossRef
20.
Zurück zum Zitat Johnstone IM, Lu AY (2009) On consistency and sparsity for principal components analysis in high dimensions. J Am Stat Assoc 104:682–693MathSciNetCrossRefMATH Johnstone IM, Lu AY (2009) On consistency and sparsity for principal components analysis in high dimensions. J Am Stat Assoc 104:682–693MathSciNetCrossRefMATH
21.
Zurück zum Zitat Nikulin V, McLachlan G (2009) On a general method for matrix factorisation applied to supervised classification. In: Chen J, Chen X, Ely J, Hakkani-Tr D, He J, Hsu HH, Liao L, Liu C, Pop M, Ranganathan S (eds) Proceedings of 2009 IEEE international conference on bioinformatics and biomedicine workshop. IEEE Computer Society, Washington, D.C. Los Alamitos, CA, pp 43–48 Nikulin V, McLachlan G (2009) On a general method for matrix factorisation applied to supervised classification. In: Chen J, Chen X, Ely J, Hakkani-Tr D, He J, Hsu HH, Liao L, Liu C, Pop M, Ranganathan S (eds) Proceedings of 2009 IEEE international conference on bioinformatics and biomedicine workshop. IEEE Computer Society, Washington, D.C. Los Alamitos, CA, pp 43–48
22.
Zurück zum Zitat Witten DM, Tibshirani R, Hastie T (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10:515–534CrossRef Witten DM, Tibshirani R, Hastie T (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10:515–534CrossRef
23.
Zurück zum Zitat Nikulin V, McLachlan GJ (2010) Penalized principal component analysis of microarray data. In: Masulli F, Peterson L, Tagliaferri R (eds) Lecture notes in bioinformatics, vol 6160. Springer, Berlin, pp 82–96 Nikulin V, McLachlan GJ (2010) Penalized principal component analysis of microarray data. In: Masulli F, Peterson L, Tagliaferri R (eds) Lecture notes in bioinformatics, vol 6160. Springer, Berlin, pp 82–96
24.
Zurück zum Zitat Aghaeepour N, Finak G (2013) The FLOWCAP Consortium, The DREAM Consortium. In: Hoos H, Mosmann T, Gottardo R, Brinkman RR, Scheuermann RH (eds) Critical assessment of automated flow cytometry analysis techniques. Nature Methods 10:228–238 Aghaeepour N, Finak G (2013) The FLOWCAP Consortium, The DREAM Consortium. In: Hoos H, Mosmann T, Gottardo R, Brinkman RR, Scheuermann RH (eds) Critical assessment of automated flow cytometry analysis techniques. Nature Methods 10:228–238
25.
Zurück zum Zitat Naim I, Datta S, Sharma G, Cavenaugh JS, Mosmann TR (2010) Swift: scalable weighted iterative sampling for flow cytometry clustering. In: IEEE International conference on acoustics speech and signal processing (ICASSP), 2010, pp 509–512 Naim I, Datta S, Sharma G, Cavenaugh JS, Mosmann TR (2010) Swift: scalable weighted iterative sampling for flow cytometry clustering. In: IEEE International conference on acoustics speech and signal processing (ICASSP), 2010, pp 509–512
26.
Zurück zum Zitat Cron A, Gouttefangeas C, Frelinger J, Lin L, Singh SK, Britten CM, Welters MJ, van der Burg SH, West M, Chan C (2013) Hierarchical modeling for rare event detection and cell subset alignment across flow cytometry samples. PLoS Comput Biol 9(7):e1003130CrossRef Cron A, Gouttefangeas C, Frelinger J, Lin L, Singh SK, Britten CM, Welters MJ, van der Burg SH, West M, Chan C (2013) Hierarchical modeling for rare event detection and cell subset alignment across flow cytometry samples. PLoS Comput Biol 9(7):e1003130CrossRef
27.
Zurück zum Zitat Dundar M, Akova F, Yerebakan HZ, Rajwa B (2014) A non-parametric bayesian model for joint cell clustering and cluster matching: identification of anomalous sample phenotypes with random effects. BMC Bioinform 15(314):1–15 Dundar M, Akova F, Yerebakan HZ, Rajwa B (2014) A non-parametric bayesian model for joint cell clustering and cluster matching: identification of anomalous sample phenotypes with random effects. BMC Bioinform 15(314):1–15
28.
Zurück zum Zitat Lo K, Brinkman RR, Gottardo R (2008) Automated gating of flow cytometry data via robust model-based clustering. Cytometry Part A 73:312–332 Lo K, Brinkman RR, Gottardo R (2008) Automated gating of flow cytometry data via robust model-based clustering. Cytometry Part A 73:312–332
29.
Zurück zum Zitat Lo K, Hahne F, Brinkman RR, Gottardo R (2009) flowclust: a bioconductor package for automated gating of flow cytometry data. BMC Bioinform 10(145):1–8 Lo K, Hahne F, Brinkman RR, Gottardo R (2009) flowclust: a bioconductor package for automated gating of flow cytometry data. BMC Bioinform 10(145):1–8
30.
Zurück zum Zitat Frühwirth-Schnatter S, Pyne S (2010) Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-\(t\) distributions. Biostatistics 11:317–336CrossRef Frühwirth-Schnatter S, Pyne S (2010) Bayesian inference for finite mixtures of univariate and multivariate skew-normal and skew-\(t\) distributions. Biostatistics 11:317–336CrossRef
31.
Zurück zum Zitat Azzalini A, Capitanio A (2003) Distribution generated by perturbation of symmetry with emphasis on a multivariate skew t distribution. J R Stat Soc Ser B 65(2):367–389MathSciNetCrossRefMATH Azzalini A, Capitanio A (2003) Distribution generated by perturbation of symmetry with emphasis on a multivariate skew t distribution. J R Stat Soc Ser B 65(2):367–389MathSciNetCrossRefMATH
32.
33.
Zurück zum Zitat Lee S, McLachlan GJ (2014) Finite mixtures of multivariate skew \(t\)-distributions: some recent and new results. Stat Comput 24:181–202MathSciNetCrossRefMATH Lee S, McLachlan GJ (2014) Finite mixtures of multivariate skew \(t\)-distributions: some recent and new results. Stat Comput 24:181–202MathSciNetCrossRefMATH
34.
Zurück zum Zitat Lee SX, McLachlan GJ (2016) Finite mixtures of canonical fundamental skew \(t\)-distributions: the unification of the unrestricted and restricted skew t-mixture models. Stat Comput. doi:10.1007/s11222-015-9545-x Lee SX, McLachlan GJ (2016) Finite mixtures of canonical fundamental skew \(t\)-distributions: the unification of the unrestricted and restricted skew t-mixture models. Stat Comput. doi:10.​1007/​s11222-015-9545-x
35.
Zurück zum Zitat Lee SX, McLachlan GJ, Pyne S (2014) Supervised classification of flow cytometric samples via the joint clustering and matching procedure. arXiv:1411.2820 [q-bio.QM] Lee SX, McLachlan GJ, Pyne S (2014) Supervised classification of flow cytometric samples via the joint clustering and matching procedure. arXiv:​1411.​2820 [q-bio.QM]
36.
Zurück zum Zitat Lee SX, McLachlan GJ, Pyne S. Modelling of inter-sample variation in flow cytometric data with the joint clustering and matching (JCM) procedure. Cytometry: Part A 2016. doi:10.1002/cyto.a.22789 Lee SX, McLachlan GJ, Pyne S. Modelling of inter-sample variation in flow cytometric data with the joint clustering and matching (JCM) procedure. Cytometry: Part A 2016. doi:10.​1002/​cyto.​a.​22789
37.
Zurück zum Zitat Criag FE, Brinkman RR, Eyck ST, Aghaeepour N (2014) Computational analysis optimizes the flow cytometric evaluation for lymphoma. Cytometry B 86:18–24CrossRef Criag FE, Brinkman RR, Eyck ST, Aghaeepour N (2014) Computational analysis optimizes the flow cytometric evaluation for lymphoma. Cytometry B 86:18–24CrossRef
38.
Zurück zum Zitat Azad A, Rajwa B, Pothen A (2014) Immunophenotypes of acute myeloid leukemia from flow cytometry data using templates. arXiv:1403.6358 [q-bio.QM] Azad A, Rajwa B, Pothen A (2014) Immunophenotypes of acute myeloid leukemia from flow cytometry data using templates. arXiv:​1403.​6358 [q-bio.QM]
39.
Zurück zum Zitat Ge Y, Sealfon SC (2012) flowpeaks: a fast unsupervised clustering for flow cytometry data via k-means and density peak finding. Bioinformatics 28:2052–2058CrossRef Ge Y, Sealfon SC (2012) flowpeaks: a fast unsupervised clustering for flow cytometry data via k-means and density peak finding. Bioinformatics 28:2052–2058CrossRef
40.
Zurück zum Zitat Rossin E, Lin TI, Ho HJ, Mentzer S, Pyne S (2011) A framework for analytical characterization of monoclonal antibodies based on reactivity profiles in different tissues. Bioinformatics 27:2746–2753CrossRef Rossin E, Lin TI, Ho HJ, Mentzer S, Pyne S (2011) A framework for analytical characterization of monoclonal antibodies based on reactivity profiles in different tissues. Bioinformatics 27:2746–2753CrossRef
41.
Zurück zum Zitat Ho HJ, Lin TI, Chang HH, Haase HB, Huang S, Pyne S (2012) Parametric modeling of cellular state transitions as measured with flow cytometry different tissues. BMC Bioinform. 2012. 13:(Suppl 5):S5 Ho HJ, Lin TI, Chang HH, Haase HB, Huang S, Pyne S (2012) Parametric modeling of cellular state transitions as measured with flow cytometry different tissues. BMC Bioinform. 2012. 13:(Suppl 5):S5
42.
Zurück zum Zitat Ho HJ, Pyne S, Lin TI (2012) Maximum likelihood inference for mixtures of skew student-\(t\)-normal distributions through practical EM-type algorithms. Stat Comput 22:287–299MathSciNetCrossRefMATH Ho HJ, Pyne S, Lin TI (2012) Maximum likelihood inference for mixtures of skew student-\(t\)-normal distributions through practical EM-type algorithms. Stat Comput 22:287–299MathSciNetCrossRefMATH
Metadaten
Titel
Application of Mixture Models to Large Datasets
verfasst von
Sharon X. Lee
Geoffrey McLachlan
Saumyadipta Pyne
Copyright-Jahr
2016
Verlag
Springer India
DOI
https://doi.org/10.1007/978-81-322-3628-3_4

Premium Partner