nach oben

International Journal of Data Science and Analytics

Erschienen in:

05.07.2017 | Regular Paper

Using data to build a better EM: EM* for big data

verfasst von: Hasan Kurban, Mark Jenne, Mehmet M. Dalkilic

Erschienen in: International Journal of Data Science and Analytics | Ausgabe 2/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config

KI-gestützte Suche

Aus

Abstract

Existing data mining techniques, more particularly iterative learning algorithms, become overwhelmed with big data. While parallelism is an obvious and, usually, necessary strategy, we observe that both (1) continually revisiting data and (2) visiting all data are two of the most prominent problems especially for iterative, unsupervised algorithms like expectation maximization algorithm for clustering (EM-T). Our strategy is to embed EM-T into a nonlinear hierarchical data structure (heap) that allows us to (1) separate data that needs to be revisited from data that does not and (2) narrow the iteration toward the data that is more difficult to cluster. We call this extended EM-T, EM*. We show our EM* algorithm outperform EM-T algorithm over large real-world and synthetic data sets. We lastly conclude with some theoretical underpinnings that explain why EM* is successful.

Nächster Artikel Tell cause from effect: models and evaluation

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Jetzt informieren

Kurban, H., Jenne, M., Dalkilic, M.M.: EM*: an EM algorithm for Big Data. In: 2016 IEEE International Conference on Data Science and Advanced Analytics (DSAA), pp. 312–320 (2016)

Wu, X., Kumar, V., et al.: Top 10 algorithms in data mining. Knowl. Inf. Syst. 14(1), 1–37 (2007)CrossRef

Jenne, M., Boberg, O., Kurban, H., Dalkilic, M.M.: Studying the milky way galaxy using ParaHeap-k. IEEE Comput. Soc. 47(9), 26–33 (2014)CrossRef

Arthur, D., Vassilvitskii, S.: “k-means++,” The advantages of careful seeding. In: Proceedings of the Eighteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 1027–1035 (2007)

Bahmani, B., Moseley, B., Vattani, A., Kumar, R., Vassilvitskii, S.: Scalable k-means++. Proc. VLDB Endow. 5(7), 622–633 (2012)CrossRef

Dempster, A.P., Laird, N.M., Rubin, D.B.: Maximum likelihood estimation from incomplete data via the em algorithm. J. R. Stat. Soc. 39(1), 1–38 (1977)MATH

Wu, C.F.J.: On the convergence properties of the EM algorithm. Ann. Stat. 11(1), 95–103 (1983)MathSciNetCrossRefMATH

Yuille, A.L., Stolorz, P., Utans, J.: Mixtures of distributions and the EMalgorithm. Neural Comput. 6(1), 334–340 (1994)CrossRef

Xu, L., Jordan, M.: On convergence properties of the EM algorithm for Gaussian mixtures. Neural Comput. 8(1), 129–151 (1996)CrossRef

10.

Roweis, S., Ghahramani, Z.: A unifying review of linear Gaussian models. Neural Comput. 11(2), 305–345 (1999)CrossRef

11.

Ghosh, A.K., Chakraborty, A.: Use of EM algorithm for data reduction under sparsity assumption. Comput. Stat. 32(2), 387–407 (2017)MathSciNetCrossRefMATH

12.

McLachlan, G.J., Krishnan, T.: The EM Algorithm and Extensions. Wiley, New York (2007)MATH

13.

Jung, Y.G., Kang, M.S.: Clustering performance comparison using K-means and expectation maximization algorithms. Biotechnol. Biotechnol. Equip. 28(1), 44–48 (2014)CrossRef

14.

Do, C.B., Batzoglou, S.: What is the expectation maximization algorithm. Nat. Biotechnol. 26, 897–899 (2008)CrossRef

15.

Watanabe, M., Yamaguchi, K.: The EM Algorithm and Related Statistical Models. CRC Press, Boca Raton (2003)CrossRefMATH

16.

Hastie, T., Tibshiraniet, R., Friedman, J.: The EM algorithm an old folk-song sung to a fast new tune. J. R. Stat. Soc. Ser. B 59(3), 511–567 (1997)MathSciNetCrossRef

17.

Zhang, Y., Brady, M., Smith, S.: Segmentation of brain MR images through a hidden Markov random field model and the expectation-maximization algorithm. IEEE Trans. Med. Imaging 20(1), 45–57 (2001)CrossRef

18.

Langari, R., Wang, L., Yen, J.: Radial basis function networks, regression weights, and the expectation–maximization algorithm. IEEE Trans. Syst. Man Cybern. Part A 7(5), 613–623 (1993)

19.

Dayan, P., Hinton, G.E.: Using expectation–maximization for reinforcement learning. Neural Comput. 9(2), 271–278 (1997)CrossRefMATH

20.

Tang, M., et al.: On improved EM algorithm and confidence interval construction for incomplete \({\rm r}\times {\rm c}\) tables. Comput. Stat. Data Anal. 51(6), 2919–2933 (2007)MathSciNetCrossRefMATH

21.

Lindstrom, M.J., Bates, D.M.: Newton-Raphson and the EM algorithm for linear mixed-effects models for repeated-measures data. J. Am. Stat. Assoc. 83, 1014–1022 (1988)MathSciNetMATH

22.

Epperson, J.F.: An Introduction to Numerical Methods and Analysis. Wiley, New York (2013)MATH

23.

Jamshidian, M., Jennrich, R.I.: Conjugate gradient acceleration of the EM algorithm. J. Am. Stat. Assoc. 88, 221–228 (1993)MathSciNetMATH

24.

Jamshidian, M., Jennrich, R.I.: Acceleration of the EM algorithm by using quasi-Newton methods. J. R. Stat. Soc. 59, 569–587 (1997)MathSciNetCrossRefMATH

25.

Neal, R. M., Hinton, G. E.: A view of the EM algorithm that justifies incremental, sparse, and other variants. In: Jordan, M.I. (ed.) Learning in Graphical Models, pp. 355–368. MIT Press, Cambridge (1999)

26.

Booth, J.G., Hobert, J.P.: Maximizing generalized linear mixed model likelihoods with an automated Monte Carlo EM algorithm. J. R. Stat. Soc. Ser. B (Methodol.) 61(1), 265–285 (1999)CrossRefMATH

27.

Meng, X.L., Rubin, D.B.: Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80(2), 267–278 (1993)MathSciNetCrossRefMATH

28.

Liu, C., et al.: The ECME algorithm: a simple extension of EM and ECM with faster monotone convergence. Biometrika 81(4), 633–648 (1994)MathSciNetCrossRefMATH

29.

Celeux, G., Chauveau, D., Diebolt, J.: On stochastic versions of the EM algorithm. Doctoral dissertation, INRIA (1995)

30.

Wei, G.C., et al.: A Monte Carlo implementation of the EM algorithm and the poor mans data augmentation algorithms. J. Am. Stat. Assoc. 85(411), 699–704 (1990)CrossRef

31.

Liu, C., et al.: Parameter expansion to accelerate EM: the PX-EM algorithm. Biometrika 85(4), 755–770 (1998)MathSciNetCrossRefMATH

32.

Liu, J.S., Wu, Y.N.: Parameter expansion for data augmentation. J. Am. Stat. Assoc. 94(448), 1264–1274 (1999)MathSciNetCrossRefMATH

33.

Tanner, M.A., Wong, W.H.: The calculation of posterior distributions by data augmentation. J. Am. Stat. Assoc. 82(398), 528–550 (1987)MathSciNetCrossRefMATH

34.

Lange, K.: A quasi-Newton acceleration of the EM algorithm. Stat. Sin. 5(1), 1–18 (1995)MathSciNetMATH

35.

Louis, T.A.: Finding the observed information matrix when using the EM algorithm. J. R. Stat. Assoc. Ser. B 44, 226–233 (1982)MathSciNetMATH

36.

Efron, B., Hinkley, D.V.: Assessing the accuracy of the maximum likelihood estimator: observed versus expected Fisher information. Biometrika 51(6), 457–482 (1978)MathSciNetCrossRefMATH

37.

Bradley, P., Fayyad, U., Reina, C.: Scaling EM clustering to large databases. Technical report, Microsoft Research (1997)

38.

Bradley, P., Fayyad, U., Reina, C.: Scaling clustering algorithms to large databases. In: ACM KDD Conference (1998)

39.

Fanstrom, F., Lewis, J., Elkan, C.: Scalability for clustering algorithms revisited. SIGKDD Explor. 2(1), 51–57 (2000)CrossRef

40.

Ordonez, C., Omiecinski, E.: FREM: fast and robust EM clustering for large data sets. In: ACM CIKM Conference, pp. 590–599 (2002)

41.

Mangasarian, O.L., Wolberg, W.H.: Cancer diagnosis via linear programming. SIAM News 23(5), 1–18 (1990)

42.

Bache, K., Lichman, M.: UCI Machine Learning Repository. http://archive.ics.uci.edu/ml (2013)

43.

Jenne, M., Boberg, O., Kurban, H., Dalkilic, M. M.: Computational astronomy. http://www.computationalastronomy.com (2014)

44.

Girardi, L., Groenewegen, M.A.T., Hatziminaoglou, E., Costa, L.: Star counts in the galaxy. Simulating from very deep to very shallow photometric surveys with the trilegal code. Astronomy Astrophys. 4(36), 895–915 (2005)CrossRef

Titel: Using data to build a better EM: EM* for big data
verfasst von: Hasan Kurban
Mark Jenne
Mehmet M. Dalkilic
Publikationsdatum: 05.07.2017
Verlag: Springer International Publishing
Erschienen in: International Journal of Data Science and Analytics / Ausgabe 2/2017
Print ISSN: 2364-415X
Elektronische ISSN: 2364-4168
DOI: https://doi.org/10.1007/s41060-017-0062-1

Springer Professional

Abstract

Bitte loggen Sie sich ein, um Zugang zu Ihrer Lizenz zu erhalten.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Weitere Artikel der Ausgabe 2/2017

Caveats and pitfalls in crowdsourcing research: the case of soccer referee bias

Tell cause from effect: models and evaluation

Spatio-semantic user profiles in location-based social networks

A novel clustering-based method for time series motif discovery under time warping measure