Skip to main content
Top

2019 | OriginalPaper | Chapter

Benchmarking Distributed Data Processing Systems for Machine Learning Workloads

Authors : Christoph Boden, Tilmann Rabl, Sebastian Schelter, Volker Markl

Published in: Performance Evaluation and Benchmarking for the Era of Artificial Intelligence

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

Distributed data processing systems have been widely adopted to robustly scale out computations on massive data sets to many compute nodes in recent years. These systems are also popular choices to scale out the training of machine learning models. However, there is a lack of benchmarks to assess how efficiently data processing systems actually perform at executing machine learning algorithms at scale. For example, the learning algorithms chosen in the corresponding systems papers tend to be those that fit well onto the system’s paradigm rather than state of the art methods. Furthermore, experiments in those papers often neglect important aspects such as addressing all aspects of scalability. In this paper, we share our experience in evaluating novel data processing systems and present a core set of experiments of a benchmark for distributed data processing systems for machine learning workloads, a rationale for their necessity as well as an experimental evaluation.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
4.
go back to reference Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: OSDI, pp. 265–283. USENIX Association (2016) Abadi, M., et al.: TensorFlow: a system for large-scale machine learning. In: OSDI, pp. 265–283. USENIX Association (2016)
5.
go back to reference Alexandrov, A., et al.: The stratosphere platform for big data analytics. VLDB J. 23(6), 939–964 (2014)CrossRef Alexandrov, A., et al.: The stratosphere platform for big data analytics. VLDB J. 23(6), 939–964 (2014)CrossRef
7.
go back to reference Bell, R.M., Koren, Y.: Scalable collaborative filtering with jointly derived neighborhood interpolation weights. In: Seventh IEEE International Conference on Data Mining (ICDM 2007), pp. 43–52, October 2007 Bell, R.M., Koren, Y.: Scalable collaborative filtering with jointly derived neighborhood interpolation weights. In: Seventh IEEE International Conference on Data Mining (ICDM 2007), pp. 43–52, October 2007
8.
go back to reference Boden, C., Rabl, T., Markl, V.: Distributed machine learning-but at what cost? Boden, C., Rabl, T., Markl, V.: Distributed machine learning-but at what cost?
9.
go back to reference Boden, C., Spina, A., Rabl, T., Markl, V.: Benchmarking data flow systems for scalable machine learning. In: Proceedings of the 4th Algorithms and Systems on MapReduce and Beyond, BeyondMR 2017, pp. 5:1–5:10. ACM, New York (2017) Boden, C., Spina, A., Rabl, T., Markl, V.: Benchmarking data flow systems for scalable machine learning. In: Proceedings of the 4th Algorithms and Systems on MapReduce and Beyond, BeyondMR 2017, pp. 5:1–5:10. ACM, New York (2017)
10.
go back to reference Böse, J.-H., et al.: Probabilistic demand forecasting at scale. Proc. VLDB Endow. 10(12), 1694–1705 (2017)CrossRef Böse, J.-H., et al.: Probabilistic demand forecasting at scale. Proc. VLDB Endow. 10(12), 1694–1705 (2017)CrossRef
11.
go back to reference Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: EMNLP, pp. 858–867 (2007) Brants, T., Popat, A.C., Xu, P., Och, F.J., Dean, J.: Large language models in machine translation. In: EMNLP, pp. 858–867 (2007)
12.
go back to reference Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30(1), 107–117 (1998). Proceedings of the Seventh International World Wide Web ConferenceCrossRef Brin, S., Page, L.: The anatomy of a large-scale hypertextual Web search engine. Comput. Netw. ISDN Syst. 30(1), 107–117 (1998). Proceedings of the Seventh International World Wide Web ConferenceCrossRef
13.
go back to reference Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)CrossRef Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: Haloop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)CrossRef
14.
go back to reference Cai, Z., Gao, Z.J., Luo, S., Perez, L.L., Vagena, Z., Jermaine, C.: A comparison of platforms for implementing and running very large scale machine learning algorithms. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014, pp. 1371–1382 (2014) Cai, Z., Gao, Z.J., Luo, S., Perez, L.L., Vagena, Z., Jermaine, C.: A comparison of platforms for implementing and running very large scale machine learning algorithms. In: Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data, SIGMOD 2014, pp. 1371–1382 (2014)
15.
go back to reference Caninil, K.: Sibyl: a system for large scale supervised machine learning (2012) Caninil, K.: Sibyl: a system for large scale supervised machine learning (2012)
16.
go back to reference Caruana, R., Karampatziakis, N., Yessenalina, A.: An empirical evaluation of supervised learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, ICML 2008, pp. 96–103. ACM, New York (2008) Caruana, R., Karampatziakis, N., Yessenalina, A.: An empirical evaluation of supervised learning in high dimensions. In: Proceedings of the 25th International Conference on Machine Learning, ICML 2008, pp. 96–103. ACM, New York (2008)
17.
go back to reference Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 161–168. ACM, New York (2006) Caruana, R., Niculescu-Mizil, A.: An empirical comparison of supervised learning algorithms. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 161–168. ACM, New York (2006)
18.
go back to reference Chapelle, O., Manavoglu, E., Rosales, R.: Simple and scalable response prediction for display advertising. ACM Trans. Intell. Syst. Technol. 5(4), 61:1–61:34 (2014)CrossRef Chapelle, O., Manavoglu, E., Rosales, R.: Simple and scalable response prediction for display advertising. ACM Trans. Intell. Syst. Technol. 5(4), 61:1–61:34 (2014)CrossRef
19.
go back to reference Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 785–794. ACM, New York (2016) Chen, T., Guestrin, C.: XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016, pp. 785–794. ACM, New York (2016)
20.
go back to reference Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274 (2015) Chen, T., et al.: MXNet: a flexible and efficient machine learning library for heterogeneous distributed systems. CoRR, abs/1512.01274 (2015)
21.
go back to reference Coleman, C., et al.: DAWNBench: an end-to-end deep learning benchmark and competition. In: ML Systems Workshop @ NIPS 2017, vol. 100, no. 101, p. 102 (2017) Coleman, C., et al.: DAWNBench: an end-to-end deep learning benchmark and competition. In: ML Systems Workshop @ NIPS 2017, vol. 100, no. 101, p. 102 (2017)
22.
go back to reference Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 271–280. ACM, New York (2007) Das, A.S., Datar, M., Garg, A., Rajaram, S.: Google news personalization: scalable online collaborative filtering. In: Proceedings of the 16th International Conference on World Wide Web, WWW 2007, pp. 271–280. ACM, New York (2007)
23.
go back to reference Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
24.
go back to reference Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012)CrossRef Domingos, P.: A few useful things to know about machine learning. Commun. ACM 55(10), 78–87 (2012)CrossRef
25.
go back to reference Ekanayake, J., et al.: Twister: a runtime for iterative MapReduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, pp. 810–818. ACM, New York (2010) Ekanayake, J., et al.: Twister: a runtime for iterative MapReduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC 2010, pp. 810–818. ACM, New York (2010)
26.
go back to reference Ewen, S., Tzoumas, K., Kaufmann, M., Markl, V.: Spinning fast iterative data flows. Proc. VLDB Endow. 5, 1268–1279 (2012)CrossRef Ewen, S., Tzoumas, K., Kaufmann, M., Markl, V.: Spinning fast iterative data flows. Proc. VLDB Endow. 5, 1268–1279 (2012)CrossRef
27.
28.
go back to reference Ghazal, A., et al.: Bigbench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 1197–1208. ACM, New York (2013) Ghazal, A., et al.: Bigbench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 1197–1208. ACM, New York (2013)
29.
go back to reference Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. The MIT Press, Cambridge (2016)MATH Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. The MIT Press, Cambridge (2016)MATH
30.
go back to reference Gustafson, J.L.: Reevaluating Amdahl’s law. Commun. ACM 31(5), 532–533 (1988)CrossRef Gustafson, J.L.: Reevaluating Amdahl’s law. Commun. ACM 31(5), 532–533 (1988)CrossRef
31.
go back to reference Halevy, A., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. IEEE Intell. Syst. 24(2), 8–12 (2009)CrossRef Halevy, A., Norvig, P., Pereira, F.: The unreasonable effectiveness of data. IEEE Intell. Syst. 24(2), 8–12 (2009)CrossRef
32.
go back to reference He, X., et al.: Practical lessons from predicting clicks on ads at Facebook. In: Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, ADKDD 2014, pp. 5:1–5:9. ACM, New York (2014) He, X., et al.: Practical lessons from predicting clicks on ads at Facebook. In: Proceedings of the Eighth International Workshop on Data Mining for Online Advertising, ADKDD 2014, pp. 5:1–5:9. ACM, New York (2014)
33.
go back to reference Hoffer, E., Hubara, I., Soudry, D.: Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In: NIPS, pp. 1729–1739 (2017) Hoffer, E., Hubara, I., Soudry, D.: Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In: NIPS, pp. 1729–1739 (2017)
34.
go back to reference Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: Agrawal, D., Candan, K.S., Li, W.-S. (eds.) New Frontiers in Information and Software as Services. LNBIP, vol. 74, pp. 209–228. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-19294-4_9CrossRef Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: Agrawal, D., Candan, K.S., Li, W.-S. (eds.) New Frontiers in Information and Software as Services. LNBIP, vol. 74, pp. 209–228. Springer, Heidelberg (2011). https://​doi.​org/​10.​1007/​978-3-642-19294-4_​9CrossRef
35.
go back to reference Jagadish, H.V., et al.: Big data and its technical challenges. Commun. ACM 57(7), 86–94 (2014)CrossRef Jagadish, H.V., et al.: Big data and its technical challenges. Commun. ACM 57(7), 86–94 (2014)CrossRef
36.
go back to reference Jimmy, L., Kolcz, A.: Large-scale machine learning at Twitter. In: SIGMOD 2012 (2012) Jimmy, L., Kolcz, A.: Large-scale machine learning at Twitter. In: SIGMOD 2012 (2012)
37.
go back to reference Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009)CrossRef Koren, Y., Bell, R., Volinsky, C.: Matrix factorization techniques for recommender systems. Computer 42(8), 30–37 (2009)CrossRef
38.
go back to reference LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)CrossRef LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436 (2015)CrossRef
39.
go back to reference Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers, San Rafael (2010)CrossRef Lin, J., Dyer, C.: Data-Intensive Text Processing with MapReduce. Morgan and Claypool Publishers, San Rafael (2010)CrossRef
40.
go back to reference Ling, X., Deng, W., Gu, C., Zhou, H., Li, C., Sun, F.: Model ensemble for click prediction in Bing search ads. In: Proceedings of the 26th International Conference on World Wide Web Companion, WWW 2017 Companion, pp. 689–698, Republic and Canton of Geneva, Switzerland. International World Wide Web Conferences Steering Committee (2017) Ling, X., Deng, W., Gu, C., Zhou, H., Li, C., Sun, F.: Model ensemble for click prediction in Bing search ads. In: Proceedings of the 26th International Conference on World Wide Web Companion, WWW 2017 Companion, pp. 689–698, Republic and Canton of Geneva, Switzerland. International World Wide Web Conferences Steering Committee (2017)
41.
go back to reference Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proce. VLDB Endow. 5(8), 716–727 (2012)CrossRef Low, Y., Bickson, D., Gonzalez, J., Guestrin, C., Kyrola, A., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proce. VLDB Endow. 5(8), 716–727 (2012)CrossRef
42.
go back to reference Low, Y., Gonzalez, J.E., Kyrola, A., Bickson, D., Guestrin, C.E., Hellerstein, J.: GraphLab: a new framework for parallel machine learning. arXiv preprint arXiv:1408.2041 (2014) Low, Y., Gonzalez, J.E., Kyrola, A., Bickson, D., Guestrin, C.E., Hellerstein, J.: GraphLab: a new framework for parallel machine learning. arXiv preprint arXiv:​1408.​2041 (2014)
43.
go back to reference Marcu, O.C., Costan, A., Antoniu, G., Pérez-Hernéndez, M.S.: Spark versus flink: understanding performance in big data analytics frameworks. IEEE CLUSTER 2016, 433–442 (2016) Marcu, O.C., Costan, A., Antoniu, G., Pérez-Hernéndez, M.S.: Spark versus flink: understanding performance in big data analytics frameworks. IEEE CLUSTER 2016, 433–442 (2016)
44.
go back to reference McMahan, H.B., et al.: Ad click prediction: a view from the trenches. In: KDD 2013. ACM (2013) McMahan, H.B., et al.: Ad click prediction: a view from the trenches. In: KDD 2013. ACM (2013)
45.
go back to reference McSherry, F., Isard, M., Murray, D.G.: Scalability! But at what cost? In: USENIX HOTOS 2015. USENIX Association (2015) McSherry, F., Isard, M., Murray, D.G.: Scalability! But at what cost? In: USENIX HOTOS 2015. USENIX Association (2015)
46.
go back to reference Meng, X., et al.: MLlib: machine learning in Apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)MathSciNetMATH Meng, X., et al.: MLlib: machine learning in Apache spark. J. Mach. Learn. Res. 17(1), 1235–1241 (2016)MathSciNetMATH
47.
go back to reference Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.-G.: Making sense of performance in data analytics frameworks. In: Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation, NSDI 2015, pp. 293–307. USENIX Association, Berkeley (2015) Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.-G.: Making sense of performance in data analytics frameworks. In: Proceedings of the 12th USENIX Conference on Networked Systems Design and Implementation, NSDI 2015, pp. 293–307. USENIX Association, Berkeley (2015)
48.
go back to reference Richardson, M., Dominowska, E., Ragno, R.: Predicting clicks: estimating the click-through rate for new ads. In: WWW 2007. ACM (2007) Richardson, M., Dominowska, E., Ragno, R.: Predicting clicks: estimating the click-through rate for new ads. In: WWW 2007. ACM (2007)
49.
go back to reference Schelter, S., Boden, C., Schenck, M., Alexandrov, A., Markl, V.: Distributed matrix factorization with MapReduce using a series of broadcast-joins. In: ACM RecSys 2013 (2013) Schelter, S., Boden, C., Schenck, M., Alexandrov, A., Markl, V.: Distributed matrix factorization with MapReduce using a series of broadcast-joins. In: ACM RecSys 2013 (2013)
50.
go back to reference Shi, J., et al.: Clash of the Titans: MapReduce vs. spark for large scale data analytics. Proc. VLDB Endow. 8(13), 2110–2121 (2015)CrossRef Shi, J., et al.: Clash of the Titans: MapReduce vs. spark for large scale data analytics. Proc. VLDB Endow. 8(13), 2110–2121 (2015)CrossRef
51.
go back to reference Veiga, J., Expósito, R.R., Pardo, X.C., Taboada, G.L., Tourifio, J.: Performance evaluation of big data frameworks for large-scale data analytics. IEEE BigData 2016, 424–431 (2016) Veiga, J., Expósito, R.R., Pardo, X.C., Taboada, G.L., Tourifio, J.: Performance evaluation of big data frameworks for large-scale data analytics. IEEE BigData 2016, 424–431 (2016)
52.
go back to reference Weinberger, K., Dasgupta, A., Langford, J., Smola, A., Attenberg, J.: Feature hashing for large scale multitask learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, pp. 1113–1120. ACM, New York (2009) Weinberger, K., Dasgupta, A., Langford, J., Smola, A., Attenberg, J.: Feature hashing for large scale multitask learning. In: Proceedings of the 26th Annual International Conference on Machine Learning, ICML 2009, pp. 1113–1120. ACM, New York (2009)
53.
go back to reference Yu, D., et al.: An introduction to computational networks and the computational network toolkit. Microsoft Technical report MSR-TR-2014-112 (2014) Yu, D., et al.: An introduction to computational networks and the computational network toolkit. Microsoft Technical report MSR-TR-2014-112 (2014)
54.
go back to reference Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI 2012 (2012) Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI 2012 (2012)
56.
go back to reference Zhuang, Y., Chin, W.-S., Juan, Y.-C., Lin, C.-J.: A fast parallel SGD for matrix factorization in shared memory systems. In: Proceedings of the 7th ACM Conference on Recommender Systems, RecSys 2013, pp. 249–256. ACM, New York (2013) Zhuang, Y., Chin, W.-S., Juan, Y.-C., Lin, C.-J.: A fast parallel SGD for matrix factorization in shared memory systems. In: Proceedings of the 7th ACM Conference on Recommender Systems, RecSys 2013, pp. 249–256. ACM, New York (2013)
Metadata
Title
Benchmarking Distributed Data Processing Systems for Machine Learning Workloads
Authors
Christoph Boden
Tilmann Rabl
Sebastian Schelter
Volker Markl
Copyright Year
2019
DOI
https://doi.org/10.1007/978-3-030-11404-6_4