Top

The VLDB Journal

Published in:

15-04-2021 | Regular Paper

Model averaging in distributed machine learning: a case study with Apache Spark

Authors: Yunyan Guo, Zhipeng Zhang, Jiawei Jiang, Wentao Wu, Ce Zhang, Bin Cui, Jianzhong Li

Published in: The VLDB Journal | Issue 4/2021

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

The increasing popularity of Apache Spark has attracted many users to put their data into its ecosystem. On the other hand, it has been witnessed in the literature that Spark is slow when it comes to distributed machine learning (ML). One resort is to switch to specialized systems such as parameter servers, which are claimed to have better performance. Nonetheless, users have to undergo the painful procedure of moving data into and out of Spark. In this paper, we investigate performance bottlenecks of MLlib (an official Spark package for ML) in detail, by focusing on analyzing its implementation of stochastic gradient descent (SGD)—the workhorse under the training of many ML models. We show that the performance inferiority of Spark is caused by implementation issues rather than fundamental flaws of the bulk synchronous parallel (BSP) model that governs Spark’s execution: we can significantly improve Spark’s performance by leveraging the well-known “model averaging” (MA) technique in distributed ML. Indeed, model averaging is not limited to SGD, and we further showcase an application of MA to training latent Dirichlet allocation (LDA) models within Spark. Our implementation is not intrusive and requires light development effort. Experimental evaluation results reveal that the MA-based versions of SGD and LDA can be orders of magnitude faster compared to their counterparts without using MA.

previous article In-Memory Interval Joins

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

It remains an open question to ensure convergence when using model averaging (perhaps with an implementation different from the current MA-SGD) to train deep models.

We assign one task to each executor because when we increase the number of tasks per executor, the time per iteration increases due to the heavy communication overhead.

https://en.wikipedia.org/wiki/Gantt_chart.

https://0x0fff.com/spark-architecture-shuffle/.

We ignore the intermediate aggregators in Fig. 2b.

However, the proof is very lengthy and technical, and thus is omitted here.

https://en.wikipedia.org/wiki/Digamma_function.

The default value of c is 49 to guarantee a relative error of \(1e{-}8\), though \(c=9\) is enough for a relative error of \(1e{-}5\).

SVM is a representative for GLMs. In fact, linear models share similar training process from a system perspective.

https://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.

https://archive.ics.uci.edu/ml/machine-learning-datasets/ and http://commoncrawl.org.

http://spark.apache.org/docs/latest/tuning.html.

We chose batch size as follows: 16k for NYTimes, 64k for PubMed, and 100k for CommonCrawl.

The speedup per iteration is computed by dividing the elapsed time (the right plot) by the number of iterations (the left plot).

Since L-BFGS in spark.ml performs normalization, we evaluate the models with the normalized data.

Amazon EC2 on-Demand Instance Pricing. https://aws.amazon.com/ec2/pricing/on-demand/

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI, pp. 265–283 (2016)

Agarwal, A., Chapelle, O., Dudík, M., Langford, J.: A reliable effective terascale linear learning system. J. Mach. Learn. Res. 15(1), 1111–1133 (2014)MathSciNetMATH

Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., Smola, A.J.: Scalable inference in latent variable models. In: WSDM, pp. 123–132. ACM (2012)

Alistarh, D., Allen-Zhu, Z., Li, J.: Byzantine stochastic gradient descent. In: Advances in Neural Information Processing Systems, pp. 4613–4623 (2018)

Anderson, M., Smith, S., Sundaram, N., Capotă, M., Zhao, Z., Dulloor, S., Satish, N., Willke, T.L.: Bridging the gap between HPC and big data frameworks. Proc. VLDB Endow. 10(8), 901–912 (2017)CrossRef

Bernardo, J.M., et al.: Psi (digamma) function. Appl. Stat. 25(3), 315–317 (1976)CrossRef

Boden, C., Spina, A., Rabl, T., Markl, V.: Benchmarking data flow systems for scalable machine learning. In: Proceedings of the 4th ACM SIGMOD Workshop on Algorithms and Systems for MapReduce and Beyond, pp. 1–10 (2017)

Bottou, L.: Large-scale machine learning with stochastic gradient descent, pp. 177–186 (2010)

10.

Bottou, L.: Stochastic gradient descent tricks. In: Neural Networks: Tricks of the Trade, pp. 421–436. Springer (2012)

11.

Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: SIGKDD, pp. 785–794 (2016)

12.

Chen, W., Wang, Z., Zhou, J.: Large-scale l-BFGS using mapreduce. In: Advances in Neural Information Processing Systems, pp. 1332–1340 (2014)

13.

Dai, J., Wang, Y., Qiu, X., Ding, D., Zhang, Y., Wang, Y., Jia, X., Zhang, C., Wan, Y., Li, Z., et al.: Bigdl: a distributed deep learning framework for big data. Preprint arXiv:1804.05839 (2018)

14.

Dekel, O., Gilad-Bachrach, R., Shamir, O., Xiao, L.: Optimal distributed online prediction using mini-batches. Journal of Machine Learning Research 13(Jan.), 165–202 (2012)MathSciNetMATH

15.

Fan, W., Xu, J., Wu, Y., Yu, W., Jiang, J., Zheng, Z., Zhang, B., Cao, Y., Tian, C.: Parallelizing sequential graph computations. In: SIGMOD, pp. 495–510 (2017)

16.

Foulds, J., Boyles, L., DuBois, C., Smyth, P., Welling, M.: Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation. In: SIGKDD, pp. 446–454. ACM (2013)

17.

Hoffman, M., Bach, F.R., Blei, D.M.: Online learning for latent Dirichlet allocation. In: NIPS, pp. 856–864 (2010)

18.

Hsieh, K., Harlap, A., Vijaykumar, N., Konomis, D., Ganger, G.R., Gibbons, P.B., Mutlu, O.: Gaia: Geo-distributed machine learning approaching \(\{\)LAN\(\}\) speeds. In: NSDI, pp. 629–647 (2017)

19.

Huang, Y., Jin, T., Wu, Y., Cai, Z., Yan, X., Yang, F., Li, J., Guo, Y., Cheng, J.: Flexps: flexible parallelism control in parameter server architecture. Proc. VLDB Endow. 11(5), 566–579 (2018)CrossRef

20.

Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: SIGMOD, pp. 463–478 (2017)

21.

Jiang, J., Fu, F., Yang, T., Cui, B.: Sketchml: accelerating distributed machine learning with data sketches. In: Proceedings of the 2018 International Conference on Management of Data, pp. 1269–1284 (2018)

22.

Jiang, J., Yu, L., Jiang, J., Liu, Y., Cui, B.: Angel: a new large-scale machine learning system. Natl. Sci. Rev. 5(2), 216–236 (2017)CrossRef

23.

Jiang, P., Agrawal, G.: A linear speedup analysis of distributed deep learning with sparse and quantized communication. In: Advances in Neural Information Processing Systems, pp. 2525–2536 (2018)

24.

Kaoudi, Z., Quiané-Ruiz, J.A., Thirumuruganathan, S., Chawla, S., Agrawal, D.: A cost-based optimizer for gradient descent optimization. In: SIGMOD, pp. 977–992. ACM (2017)

25.

Kucukelbir, A., Ranganath, R., Gelman, A., Blei, D.: Automatic variational inference in Stan. In: NIPS, pp. 568–576 (2015)

26.

Li, F., Chen, L., Zeng, Y., Kumar, A., Wu, X., Naughton, J.F., Patel, J.M.: Tuple-oriented compression for large-scale mini-batch stochastic gradient descent. In: Proceedings of the 2019 International Conference on Management of Data, pp. 1517–1534 (2019)

27.

Li, M., Anderson, D.G., Park, J.W., Smola, A.J., Ahmed, A., Josifovski, V., Long, J., Shekita, E.J., Su, B.Y.: Scaling distributed machine learning with the parameter server. In: OSDI, pp. 583–598 (2014)

28.

Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45(1–3), 503–528 (1989)MathSciNetCrossRef

29.

Liu, X., Zeng, J., Yang, X., Yan, J., Yang, Q.: Scalable parallel EM algorithms for latent Dirichlet allocation in multi-core systems. In: WWW, pp. 669–679 (2015)

30.

McSherry, F., Isard, M., Murray, D.G.: Scalability! but at what \(\{\)COST\(\}\)? In: HotOS (2015)

31.

Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., et al.: Mllib: machine learning in Apache Spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)MathSciNetMATH

32.

Onizuka, M., Fujimori, T., Shiokawa, H.: Graph partitioning for distributed graph processing. Data Sci. Eng. 2(1), 94–105 (2017)CrossRef

33.

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems, pp. 8024–8035 (2019)

34.

Řehůřek, R., Sojka, P.: Software framework for topic modelling with large corpora. In: Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, pp. 45–50. ELRA, Valletta, Malta (2010). http://is.muni.cz/publication/884893/en

35.

Robbins, H., Monro, S.: A stochastic approximation method. Ann. Math. Stat. 22, 400–407 (1951)MathSciNetCrossRef

36.

Stich, S.U.: Local SGD converges fast and communicates little. In: ICLR 2019 International Conference on Learning Representations, CONF (2019)

37.

Thakur, R., Rabenseifner, R., Gropp, W.: Optimization of collective communication operations in MPICH. Int. J. High Perform. Comput. Appl. 19(1), 49–66 (2005)CrossRef

38.

Ueno, K., Suzumura, T., Maruyama, N., Fujisawa, K., Matsuoka, S.: Efficient breadth-first search on massively parallel and distributed-memory machines. Data Sci. Eng. 2(1), 22–35 (2017) CrossRef

39.

Xie, C., Koyejo, S., Gupta, I.: Zeno: Distributed stochastic gradient descent with suspicion-based fault-tolerance. In: International Conference on Machine Learning, pp. 6893–6901 (2019)

40.

Xing, E.P., Ho, Q., Dai, W., Kim, J.K., Wei, J., Lee, S., Zheng, X., Xie, P., Kumar, A., Yu, Y.: Petuum: a new platform for distributed machine learning on big data. IEEE Trans. Big Data 1(2), 49–67 (2015)CrossRef

41.

Xu, N., Chen, L., Cui, B.: LogGP: a log-based dynamic graph partitioning method. Proc. VLDB Endow. 7(14), 1917–1928 (2014)CrossRef

42.

Yuan, J., Gao, F., Ho, Q., Dai, W., Wei, J., Zheng, X., Xing, E.P., Liu, T.Y., Ma, W.Y.: Lightlda: big topic models on modest computer clusters. In: World Wide Web, pp. 1351–1361 (2015)

43.

Yut, L., Zhang, C., Shao, Y., Cui, B.: LDA*: a robust and large-scale topic modeling system. Proc. VLDB Endow. 10(11), 1406–1417 (2017)CrossRef

44.

Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. HotCloud 10(10–10), 95 (2010)

45.

Zaheer, M., Wick, M., Tristan, J.B., Smola, A., Steele, G.: Exponential stochastic cellular automata for massively parallel inference. In: Artificial Intelligence and Statistics, pp. 966–975 (2016)

46.

Zhang, C., Ré, C.: Dimmwitted: a study of main-memory statistical analytics. Proc. VLDB Endow. 7(12), 11 (2014)CrossRef

47.

Zhang, H., Zeng, L., Wu, W., Zhang, C.: How good are machine learning clouds for binary classification with good features? In: SoCC, p. 649 (2017)

48.

Zhang, J., De Sa, C., Mitliagkas, I., Ré, C.: Parallel SGD: When does averaging help? Preprint arXiv:1606.07365 (2016)

49.

Zhang, K., Alqahtani, S., Demirbas, M.: A comparison of distributed machine learning platforms. In: ICCCN, pp. 1–9 (2017)

50.

Zhang, Y., Jordan, M.I.: Splash: User-friendly programming interface for parallelizing stochastic algorithms. Preprint arXiv:1506.07552 (2015)

51.

Zinkevich, M., Weimer, M., Li, L., Smola, A.J.: Parallelized stochastic gradient descent. In: NIPS, pp. 2595–2603 (2010)

Title: Model averaging in distributed machine learning: a case study with Apache Spark
Authors: Yunyan Guo
Zhipeng Zhang
Jiawei Jiang
Wentao Wu
Ce Zhang
Bin Cui
Jianzhong Li
Publication date: 15-04-2021
Publisher: Springer Berlin Heidelberg
Published in: The VLDB Journal / Issue 4/2021
Print ISSN: 1066-8888
Electronic ISSN: 0949-877X
DOI: https://doi.org/10.1007/s00778-021-00664-7

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Other articles of this Issue 4/2021

Efficient and effective ER with progressive blocking

Location- and keyword-based querying of geo-textual data: a survey

Micro-architectural analysis of in-memory OLTP: Revisited

Mis-categorized entities detection

Distributed detection of sequential anomalies in univariate time series

Visually aware recommendation with aesthetic features

Premium Partner