Skip to main content
Erschienen in: International Journal of Data Science and Analytics 2/2017

05.07.2017 | Regular Paper

Using data to build a better EM: EM* for big data

verfasst von: Hasan Kurban, Mark Jenne, Mehmet M. Dalkilic

Erschienen in: International Journal of Data Science and Analytics | Ausgabe 2/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like expectation maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a nonlinear hierarchical data structure (heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM*. We show our EM* algorithm outperform EM-T algorithm over large real-world and synthetic data sets. We lastly conclude with some theoretical underpinnings that explain why EM* is successful.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Kurban, H., Jenne, M., Dalkilic, M.M.: EM*: an EM algorithm for Big Data. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 312–320 (2016) Kurban, H., Jenne, M., Dalkilic, M.M.: EM*: an EM algorithm for Big Data. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 312–320 (2016)
2.
Zurück zum Zitat Wu, X., Kumar, V., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2007)CrossRef Wu, X., Kumar, V., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2007)CrossRef
3.
Zurück zum Zitat Jenne, M., Boberg, O., Kurban, H., Dalkilic, M.M.: Studying the milky way galaxy using ParaHeap-k. IEEE Comput. Soc. 47(9), 26–33 (2014)CrossRef Jenne, M., Boberg, O., Kurban, H., Dalkilic, M.M.: Studying the milky way galaxy using ParaHeap-k. IEEE Comput. Soc. 47(9), 26–33 (2014)CrossRef
4.
Zurück zum Zitat Arthur, D., Vassilvitskii, S.: “k-means++,” The advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007) Arthur, D., Vassilvitskii, S.: “k-means++,” The advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007)
5.
Zurück zum Zitat Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. Proc. VLDB Endow. 5(7), 622–633 (2012)CrossRef Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. Proc. VLDB Endow. 5(7), 622–633 (2012)CrossRef
6.
Zurück zum Zitat Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood estimation from incomplete data via the em algorithm. J. R. Stat. Soc. 39(1), 1–38 (1977)MATH Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood estimation from incomplete data via the em algorithm. J. R. Stat. Soc. 39(1), 1–38 (1977)MATH
8.
Zurück zum Zitat Yuille, A.L., Stolorz, P., Utans, J.: Mixtures of distributions and the EMalgorithm. Neural Comput. 6(1), 334–340 (1994)CrossRef Yuille, A.L., Stolorz, P., Utans, J.: Mixtures of distributions and the EMalgorithm. Neural Comput. 6(1), 334–340 (1994)CrossRef
9.
Zurück zum Zitat Xu, L., Jordan, M.: On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput. 8(1), 129–151 (1996)CrossRef Xu, L., Jordan, M.: On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput. 8(1), 129–151 (1996)CrossRef
10.
Zurück zum Zitat Roweis, S., Ghahramani, Z.: A unifying review of linear Gaussian models. Neural Comput. 11(2), 305–345 (1999)CrossRef Roweis, S., Ghahramani, Z.: A unifying review of linear Gaussian models. Neural Comput. 11(2), 305–345 (1999)CrossRef
11.
Zurück zum Zitat Ghosh, A.K., Chakraborty, A.: Use of EM algorithm for data reduction under sparsity assumption. Comput. Stat. 32(2), 387–407 (2017)MathSciNetCrossRefMATH Ghosh, A.K., Chakraborty, A.: Use of EM algorithm for data reduction under sparsity assumption. Comput. Stat. 32(2), 387–407 (2017)MathSciNetCrossRefMATH
12.
Zurück zum Zitat McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions. Wiley, New York (2007)MATH McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions. Wiley, New York (2007)MATH
13.
Zurück zum Zitat Jung, Y.G., Kang, M.S.: Clustering performance comparison using K-means and expectation maximization algorithms. Biotechnol. Biotechnol. Equip. 28(1), 44–48 (2014)CrossRef Jung, Y.G., Kang, M.S.: Clustering performance comparison using K-means and expectation maximization algorithms. Biotechnol. Biotechnol. Equip. 28(1), 44–48 (2014)CrossRef
14.
Zurück zum Zitat Do, C.B., Batzoglou, S.: What is the expectation maximization algorithm. Nat. Biotechnol. 26, 897–899 (2008)CrossRef Do, C.B., Batzoglou, S.: What is the expectation maximization algorithm. Nat. Biotechnol. 26, 897–899 (2008)CrossRef
15.
Zurück zum Zitat Watanabe, M., Yamaguchi, K.: The EM Algorithm and Related Statistical Models. CRC Press, Boca Raton (2003)CrossRefMATH Watanabe, M., Yamaguchi, K.: The EM Algorithm and Related Statistical Models. CRC Press, Boca Raton (2003)CrossRefMATH
16.
Zurück zum Zitat Hastie, T., Tibshiraniet, R., Friedman, J.: The EM algorithm an old folk-song sung to a fast new tune. J. R. Stat. Soc. Ser. B 59(3), 511–567 (1997)MathSciNetCrossRef Hastie, T., Tibshiraniet, R., Friedman, J.: The EM algorithm an old folk-song sung to a fast new tune. J. R. Stat. Soc. Ser. B 59(3), 511–567 (1997)MathSciNetCrossRef
17.
Zurück zum Zitat Zhang, Y., Brady, M., Smith, S.: Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm. IEEE Trans. Med. Imaging 20(1), 45–57 (2001)CrossRef Zhang, Y., Brady, M., Smith, S.: Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm. IEEE Trans. Med. Imaging 20(1), 45–57 (2001)CrossRef
18.
Zurück zum Zitat Langari, R., Wang, L., Yen, J.: Radial basis function networks, regression weights, and the expectation–maximization algorithm. IEEE Trans. Syst. Man Cybern. Part A 7(5), 613–623 (1993) Langari, R., Wang, L., Yen, J.: Radial basis function networks, regression weights, and the expectation–maximization algorithm. IEEE Trans. Syst. Man Cybern. Part A 7(5), 613–623 (1993)
19.
Zurück zum Zitat Dayan, P., Hinton, G.E.: Using expectation–maximization for reinforcement learning. Neural Comput. 9(2), 271–278 (1997)CrossRefMATH Dayan, P., Hinton, G.E.: Using expectation–maximization for reinforcement learning. Neural Comput. 9(2), 271–278 (1997)CrossRefMATH
20.
Zurück zum Zitat Tang, M., et al.: On improved EM algorithm and confidence interval construction for incomplete \({\rm r}\times {\rm c}\) tables. Comput. Stat. Data Anal. 51(6), 2919–2933 (2007)MathSciNetCrossRefMATH Tang, M., et al.: On improved EM algorithm and confidence interval construction for incomplete \({\rm r}\times {\rm c}\) tables. Comput. Stat. Data Anal. 51(6), 2919–2933 (2007)MathSciNetCrossRefMATH
21.
Zurück zum Zitat Lindstrom, M.J., Bates, D.M.: Newton-Raphson and the EM algorithm for linear mixed-effects models for repeated-measures data. J. Am. Stat. Assoc. 83, 1014–1022 (1988)MathSciNetMATH Lindstrom, M.J., Bates, D.M.: Newton-Raphson and the EM algorithm for linear mixed-effects models for repeated-measures data. J. Am. Stat. Assoc. 83, 1014–1022 (1988)MathSciNetMATH
22.
Zurück zum Zitat Epperson, J.F.: An Introduction to Numerical Methods and Analysis. Wiley, New York (2013)MATH Epperson, J.F.: An Introduction to Numerical Methods and Analysis. Wiley, New York (2013)MATH
23.
Zurück zum Zitat Jamshidian, M., Jennrich, R.I.: Conjugate gradient acceleration of the EM algorithm. J. Am. Stat. Assoc. 88, 221–228 (1993)MathSciNetMATH Jamshidian, M., Jennrich, R.I.: Conjugate gradient acceleration of the EM algorithm. J. Am. Stat. Assoc. 88, 221–228 (1993)MathSciNetMATH
24.
Zurück zum Zitat Jamshidian, M., Jennrich, R.I.: Acceleration of the EM algorithm by using quasi-Newton methods. J. R. Stat. Soc. 59, 569–587 (1997)MathSciNetCrossRefMATH Jamshidian, M., Jennrich, R.I.: Acceleration of the EM algorithm by using quasi-Newton methods. J. R. Stat. Soc. 59, 569–587 (1997)MathSciNetCrossRefMATH
25.
Zurück zum Zitat Neal, R. M., Hinton, G. E.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan, M.I. (ed.) Learning in Graphical Models, pp. 355–368. MIT Press, Cambridge (1999) Neal, R. M., Hinton, G. E.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan, M.I. (ed.) Learning in Graphical Models, pp. 355–368. MIT Press, Cambridge (1999)
26.
Zurück zum Zitat Booth, J.G., Hobert, J.P.: Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 61(1), 265–285 (1999)CrossRefMATH Booth, J.G., Hobert, J.P.: Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 61(1), 265–285 (1999)CrossRefMATH
27.
Zurück zum Zitat Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2), 267–278 (1993)MathSciNetCrossRefMATH Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2), 267–278 (1993)MathSciNetCrossRefMATH
28.
Zurück zum Zitat Liu, C., et al.: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81(4), 633–648 (1994)MathSciNetCrossRefMATH Liu, C., et al.: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81(4), 633–648 (1994)MathSciNetCrossRefMATH
29.
Zurück zum Zitat Celeux, G., Chauveau, D., Diebolt, J.: On stochastic versions of the EM algorithm. Doctoral dissertation, INRIA (1995) Celeux, G., Chauveau, D., Diebolt, J.: On stochastic versions of the EM algorithm. Doctoral dissertation, INRIA (1995)
30.
Zurück zum Zitat Wei, G.C., et al.: A Monte Carlo implementation of the EM algorithm and the poor mans data augmentation algorithms. J. Am. Stat. Assoc. 85(411), 699–704 (1990)CrossRef Wei, G.C., et al.: A Monte Carlo implementation of the EM algorithm and the poor mans data augmentation algorithms. J. Am. Stat. Assoc. 85(411), 699–704 (1990)CrossRef
33.
Zurück zum Zitat Tanner, M.A., Wong, W.H.: The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc. 82(398), 528–550 (1987)MathSciNetCrossRefMATH Tanner, M.A., Wong, W.H.: The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc. 82(398), 528–550 (1987)MathSciNetCrossRefMATH
34.
35.
Zurück zum Zitat Louis, T.A.: Finding the observed information matrix when using the EM algorithm. J. R. Stat. Assoc. Ser. B 44, 226–233 (1982)MathSciNetMATH Louis, T.A.: Finding the observed information matrix when using the EM algorithm. J. R. Stat. Assoc. Ser. B 44, 226–233 (1982)MathSciNetMATH
36.
Zurück zum Zitat Efron, B., Hinkley, D.V.: Assessing the accuracy of the maximum likelihood estimator: observed versus expected Fisher information. Biometrika 51(6), 457–482 (1978)MathSciNetCrossRefMATH Efron, B., Hinkley, D.V.: Assessing the accuracy of the maximum likelihood estimator: observed versus expected Fisher information. Biometrika 51(6), 457–482 (1978)MathSciNetCrossRefMATH
37.
Zurück zum Zitat Bradley, P., Fayyad, U., Reina, C.: Scaling EM clustering to large databases. Technical report, Microsoft Research (1997) Bradley, P., Fayyad, U., Reina, C.: Scaling EM clustering to large databases. Technical report, Microsoft Research (1997)
38.
Zurück zum Zitat Bradley, P., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: ACM KDD Conference (1998) Bradley, P., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: ACM KDD Conference (1998)
39.
Zurück zum Zitat Fanstrom, F., Lewis, J., Elkan, C.: Scalability for clustering algorithms revisited. SIGKDD Explor. 2(1), 51–57 (2000)CrossRef Fanstrom, F., Lewis, J., Elkan, C.: Scalability for clustering algorithms revisited. SIGKDD Explor. 2(1), 51–57 (2000)CrossRef
40.
Zurück zum Zitat Ordonez, C., Omiecinski, E.: FREM: fast and robust EM clustering for large data sets. In: ACM CIKM Conference, pp. 590–599 (2002) Ordonez, C., Omiecinski, E.: FREM: fast and robust EM clustering for large data sets. In: ACM CIKM Conference, pp. 590–599 (2002)
41.
Zurück zum Zitat Mangasarian, O.L., Wolberg, W.H.: Cancer diagnosis via linear programming. SIAM News 23(5), 1–18 (1990) Mangasarian, O.L., Wolberg, W.H.: Cancer diagnosis via linear programming. SIAM News 23(5), 1–18 (1990)
44.
Zurück zum Zitat Girardi, L., Groenewegen, M.A.T., Hatziminaoglou, E., Costa, L.: Star counts in the galaxy. Simulating from very deep to very shallow photometric surveys with the trilegal code. Astronomy Astrophys. 4(36), 895–915 (2005)CrossRef Girardi, L., Groenewegen, M.A.T., Hatziminaoglou, E., Costa, L.: Star counts in the galaxy. Simulating from very deep to very shallow photometric surveys with the trilegal code. Astronomy Astrophys. 4(36), 895–915 (2005)CrossRef
Metadaten
Titel
Using data to build a better EM: EM* for big data
verfasst von
Hasan Kurban
Mark Jenne
Mehmet M. Dalkilic
Publikationsdatum
05.07.2017
Verlag
Springer International Publishing
Erschienen in
International Journal of Data Science and Analytics / Ausgabe 2/2017
Print ISSN: 2364-415X
Elektronische ISSN: 2364-4168
DOI
https://doi.org/10.1007/s41060-017-0062-1

Weitere Artikel der Ausgabe 2/2017

International Journal of Data Science and Analytics 2/2017 Zur Ausgabe