Skip to main content
Erschienen in: The VLDB Journal 2/2024

20.09.2023 | Regular Paper

A systematic evaluation of machine learning on serverless infrastructure

verfasst von: Jiawei Jiang, Shaoduo Gan, Bo Du, Gustavo Alonso, Ana Klimovic, Ankit Singla, Wentao Wu, Sheng Wang, Ce Zhang

Erschienen in: The VLDB Journal | Ausgabe 2/2024

Einloggen

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Recently, the serverless paradigm of computing has inspired research on its applicability to data-intensive tasks such as ETL, database query processing, and machine learning (ML) model training. Recent efforts have proposed multiple systems for training large-scale ML models in a distributed manner on top of serverless infrastructures (e.g., AWS Lambda). Yet, there is so far no consensus on the design space for such systems when compared with systems built on top of classical “serverful” infrastructures. Indeed, a variety of factors could impact the performance of training ML models in a distributed environment, such as the optimization algorithm used and the synchronization protocol followed by parallel executors, which must be carefully considered when designing serverless ML systems. To clarify contradictory observations from previous work, in this paper we present a systematic comparative study of serverless and serverful systems for distributed ML training. We present a design space that covers design choices made by previous systems on aspects such as optimization algorithms and synchronization protocols. We then implement a platform, LambdaML , that enables a fair comparison between serverless and serverful systems by navigating the aforementioned design space. We further improve LambdaML toward automatic support by designing a hyper-parameter tuning framework that leverages the ability of serverless infrastructure. We present empirical evaluation results using LambdaML on both single training jobs and multi-tenant workloads. Our results reveal that there is no “one size fits all” serverless solution given the current state of the art—one must choose different designs for different ML workloads. We also develop an analytic model based on the empirical observations to capture the cost/performance tradeoffs that one has to consider when deciding between serverless and serverful designs for distributed ML training.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Fußnoten
3
Although we did a lot of efforts to run larger image dataset (e.g., ImageNet), the performance is extremely slow, since Lambda does not provide GPUs and limits the maximal memory.
 
4
We do not include regression models such as linear regression, but we believe the trade-off space would be the same, since the model complexity is similar.
 
7
Note that, Hogwild! [57] uses a single machine to train RCV1 within 9.5 s (without startup and data loading time). The model training time of LambdaML is about 27 s. Hogwild! uses a lock-free asynchronous strategy and prefers sparse datasets. Although the training algorithm is different from our setting, we believe it is important to report these numbers as a reference.
 
8
Distributed PyTorch with ADMM achieves the best results when training LR and SVM, distributed PyTorch achieves the best results when training KM, and distributed PyTorch with SGD achieves the best results when training MN.
 
9
Due to space limitations, we show results of four representative tasks. The observed patterns are the same on the other workloads.
 
Literatur
1.
Zurück zum Zitat Abadi, D.J., Madden, S.R., Hachem, N.: Column-stores vs. row-stores: how different are they really? In: SIGMOD, pp. 967–980 (2008) Abadi, D.J., Madden, S.R., Hachem, N.: Column-stores vs. row-stores: how different are they really? In: SIGMOD, pp. 967–980 (2008)
2.
Zurück zum Zitat Abadi, M., Barham, P., Chen, J., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI, pp. 265–283 (2016) Abadi, M., Barham, P., Chen, J., et al.: Tensorflow: a system for large-scale machine learning. In: OSDI, pp. 265–283 (2016)
3.
Zurück zum Zitat Akkus, I.E., Chen, R., Rimac, I., et al.: Sand: towards high-performance serverless computing. In: USENIX ATC, pp. 923–935 (2018) Akkus, I.E., Chen, R., Rimac, I., et al.: Sand: towards high-performance serverless computing. In: USENIX ATC, pp. 923–935 (2018)
4.
Zurück zum Zitat Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5(1), 1–9 (2014)CrossRef Baldi, P., Sadowski, P., Whiteson, D.: Searching for exotic particles in high-energy physics with deep learning. Nat. Commun. 5(1), 1–9 (2014)CrossRef
5.
Zurück zum Zitat Baldini, I., Castro, P., Chang, K., et al.: Serverless computing: current trends and open problems. In: Research Advances in Cloud Computing, pp. 1–20 (2017) Baldini, I., Castro, P., Chang, K., et al.: Serverless computing: current trends and open problems. In: Research Advances in Cloud Computing, pp. 1–20 (2017)
6.
Zurück zum Zitat Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. JMLR 13(2), 281–305 (2012)MathSciNet Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. JMLR 13(2), 281–305 (2012)MathSciNet
7.
Zurück zum Zitat Bergstra, J., Yamins, D., Cox, D.D., et al.: Hyperopt: a python library for optimizing the hyperparameters of machine learning algorithms. In: SciPy, vol. 13, p. 20 (2013) Bergstra, J., Yamins, D., Cox, D.D., et al.: Hyperopt: a python library for optimizing the hyperparameters of machine learning algorithms. In: SciPy, vol. 13, p. 20 (2013)
8.
Zurück zum Zitat Bhattacharjee, A., Barve, Y., Khare, S., Bao, S., Gokhale, A., Damiano, T.: Stratum: a serverless framework for the lifecycle management of machine learning-based data analytics tasks. In: OpML, pp. 59–61 (2019) Bhattacharjee, A., Barve, Y., Khare, S., Bao, S., Gokhale, A., Damiano, T.: Stratum: a serverless framework for the lifecycle management of machine learning-based data analytics tasks. In: OpML, pp. 59–61 (2019)
9.
Zurück zum Zitat Boehm, M., Tatikonda, S., Reinwald, B., et al.: Hybrid parallelization strategies for large-scale machine learning in systemML. VLDB 7(7), 553–564 (2014) Boehm, M., Tatikonda, S., Reinwald, B., et al.: Hybrid parallelization strategies for large-scale machine learning in systemML. VLDB 7(7), 553–564 (2014)
10.
Zurück zum Zitat Boyd, S., Parikh, N., Chu, E., et al.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)CrossRef Boyd, S., Parikh, N., Chu, E., et al.: Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn. 3(1), 1–122 (2011)CrossRef
11.
Zurück zum Zitat Cao, W., Zhang, Y., Yang, X., Li, F., Wang, S., Hu, Q., Cheng, X., Chen, Z., Liu, Z., Fang, J., et al.: PolarDB serverless: a cloud native database for disaggregated data centers. In: SIGMOD, pp. 2477–2489 (2021) Cao, W., Zhang, Y., Yang, X., Li, F., Wang, S., Hu, Q., Cheng, X., Chen, Z., Liu, Z., Fang, J., et al.: PolarDB serverless: a cloud native database for disaggregated data centers. In: SIGMOD, pp. 2477–2489 (2021)
12.
Zurück zum Zitat Carreira, J., Fonseca, P., Tumanov, A., Zhang, A., Katz, R.: Cirrus: a serverless framework for end-to-end ML workflows. In: SoCC, pp. 13–24 (2019) Carreira, J., Fonseca, P., Tumanov, A., Zhang, A., Katz, R.: Cirrus: a serverless framework for end-to-end ML workflows. In: SoCC, pp. 13–24 (2019)
13.
Zurück zum Zitat Castro, P., Ishakian, V., Muthusamy, V., Slominski, A.: The rise of serverless computing. Commun. ACM 62(12), 44–54 (2019)CrossRef Castro, P., Ishakian, V., Muthusamy, V., Slominski, A.: The rise of serverless computing. Commun. ACM 62(12), 44–54 (2019)CrossRef
14.
Zurück zum Zitat Chaturapruek, S., Duchi, J.C., Ré, C.: Asynchronous stochastic convex optimization: the noise is in the noise and SGD don’t care. In: NeurIPS, pp. 1531–1539 (2015) Chaturapruek, S., Duchi, J.C., Ré, C.: Asynchronous stochastic convex optimization: the noise is in the noise and SGD don’t care. In: NeurIPS, pp. 1531–1539 (2015)
15.
Zurück zum Zitat Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: SIGKDD, pp. 785–794 (2016) Chen, T., Guestrin, C.: Xgboost: a scalable tree boosting system. In: SIGKDD, pp. 785–794 (2016)
16.
Zurück zum Zitat Dean, J., Corrado, G., Monga, R., et al.: Large scale distributed deep networks. In: NeurIPS, pp. 1223–1231 (2012) Dean, J., Corrado, G., Monga, R., et al.: Large scale distributed deep networks. In: NeurIPS, pp. 1223–1231 (2012)
17.
Zurück zum Zitat Falkner, S., Klein, A., Hutter, F.: BOHB: robust and efficient hyperparameter optimization at scale. In: ICML, pp. 1437–1446 (2018) Falkner, S., Klein, A., Hutter, F.: BOHB: robust and efficient hyperparameter optimization at scale. In: ICML, pp. 1437–1446 (2018)
18.
Zurück zum Zitat Fard, A., Le, A., Larionov, G., Dhillon, W., Bear, C.: Vertica-ML: distributed machine learning in vertica database. In: SIGMOD, pp. 755–768 (2020) Fard, A., Le, A., Larionov, G., Dhillon, W., Bear, C.: Vertica-ML: distributed machine learning in vertica database. In: SIGMOD, pp. 755–768 (2020)
19.
Zurück zum Zitat Feng, L., Kudva, P., Da Silva, D., Hu, J.: Exploring serverless computing for neural network training. In: CLOUD, pp. 334–341 (2018) Feng, L., Kudva, P., Da Silva, D., Hu, J.: Exploring serverless computing for neural network training. In: CLOUD, pp. 334–341 (2018)
20.
Zurück zum Zitat Fingler, H., Akshintala, A., Rossbach, C.J.: USETL: unikernels for serverless extract transform and load why should you settle for less? In: APSys, pp. 23–30 (2019) Fingler, H., Akshintala, A., Rossbach, C.J.: USETL: unikernels for serverless extract transform and load why should you settle for less? In: APSys, pp. 23–30 (2019)
21.
Zurück zum Zitat Gropp, W., Gropp, W.D., Lusk, E., Lusk, A.D.F.E.E., Skjellum, A.: Using MPI: portable parallel programming with the message-passing interface, vol. 1 (1999) Gropp, W., Gropp, W.D., Lusk, E., Lusk, A.D.F.E.E., Skjellum, A.: Using MPI: portable parallel programming with the message-passing interface, vol. 1 (1999)
22.
Zurück zum Zitat Gupta, V., Kadhe, S., Courtade, T., Mahoney, M.W., Ramchandran, K.: Oversketched Newton: fast convex optimization for serverless systems. arXiv:1903.08857 (2019) Gupta, V., Kadhe, S., Courtade, T., Mahoney, M.W., Ramchandran, K.: Oversketched Newton: fast convex optimization for serverless systems. arXiv:​1903.​08857 (2019)
23.
Zurück zum Zitat Hellerstein, J.M., Faleiro, J.M., Gonzalez, J., et al.: Serverless computing: one step forward, two steps back. In: CIDR (2019) Hellerstein, J.M., Faleiro, J.M., Gonzalez, J., et al.: Serverless computing: one step forward, two steps back. In: CIDR (2019)
24.
Zurück zum Zitat Hendrickson, S., Sturdevant, S., Harter, T., Venkataramani, V., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H.: Serverless computation with openlambda. In: HotCloud (2016) Hendrickson, S., Sturdevant, S., Harter, T., Venkataramani, V., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H.: Serverless computation with openlambda. In: HotCloud (2016)
25.
Zurück zum Zitat Ho, Q., Cipar, J., Cui, H., et al.: More effective distributed ml via a stale synchronous parallel parameter server. In: NeurIPS, pp. 1223–1231 (2013) Ho, Q., Cipar, J., Cui, H., et al.: More effective distributed ml via a stale synchronous parallel parameter server. In: NeurIPS, pp. 1223–1231 (2013)
26.
Zurück zum Zitat Hsieh, K., Harlap, A., Vijaykumar, N., Konomis, D., Ganger, G.R., Gibbons, P.B., Mutlu, O.: Gaia: geo-distributed machine learning approaching LAN speeds. In: NSDI, pp. 629–647 (2017) Hsieh, K., Harlap, A., Vijaykumar, N., Konomis, D., Ganger, G.R., Gibbons, P.B., Mutlu, O.: Gaia: geo-distributed machine learning approaching LAN speeds. In: NSDI, pp. 629–647 (2017)
27.
Zurück zum Zitat Huang, Y., Jin, T., Wu, Y., et al.: FlexPS: flexible parallelism control in parameter server architecture. VLDB 11(5), 566–579 (2018) Huang, Y., Jin, T., Wu, Y., et al.: FlexPS: flexible parallelism control in parameter server architecture. VLDB 11(5), 566–579 (2018)
28.
Zurück zum Zitat Ishakian, V., Muthusamy, V., Slominski, A.: Serving deep learning models in a serverless platform. In: IC2E, pp. 257–262 (2018) Ishakian, V., Muthusamy, V., Slominski, A.: Serving deep learning models in a serverless platform. In: IC2E, pp. 257–262 (2018)
29.
Zurück zum Zitat Jiang, J., Cui, B., Zhang, C., Fu, F.: DimBoost: boosting gradient boosting decision tree to higher dimensions. In: SIGMOD, pp. 1363–1376 (2018) Jiang, J., Cui, B., Zhang, C., Fu, F.: DimBoost: boosting gradient boosting decision tree to higher dimensions. In: SIGMOD, pp. 1363–1376 (2018)
30.
Zurück zum Zitat Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: SIGMOD, pp. 463–478 (2017) Jiang, J., Cui, B., Zhang, C., Yu, L.: Heterogeneity-aware distributed parameter servers. In: SIGMOD, pp. 463–478 (2017)
31.
Zurück zum Zitat Jiang, J., Fu, F., Yang, T., Cui, B.: SketchML: accelerating distributed machine learning with data sketches. In: SIGMOD, pp. 1269–1284 (2018) Jiang, J., Fu, F., Yang, T., Cui, B.: SketchML: accelerating distributed machine learning with data sketches. In: SIGMOD, pp. 1269–1284 (2018)
32.
Zurück zum Zitat Jiang, J., Yu, L., Jiang, J., Liu, Y., Cui, B.: Angel: a new large-scale machine learning system. Natl. Sci. Rev. 5(2), 216–236 (2018)CrossRef Jiang, J., Yu, L., Jiang, J., Liu, Y., Cui, B.: Angel: a new large-scale machine learning system. Natl. Sci. Rev. 5(2), 216–236 (2018)CrossRef
33.
Zurück zum Zitat Jonas, E., Schleier-Smith, J., Sreekanti, V., et al.: Cloud programming simplified: a Berkeley view on serverless computing. arXiv:1902.03383 (2019) Jonas, E., Schleier-Smith, J., Sreekanti, V., et al.: Cloud programming simplified: a Berkeley view on serverless computing. arXiv:​1902.​03383 (2019)
34.
Zurück zum Zitat Kaoudi, Z., Quiané-Ruiz, J.A., Thirumuruganathan, S., Chawla, S., Agrawal, D.: A cost-based optimizer for gradient descent optimization. In: SIGMOD, pp. 977–992 (2017) Kaoudi, Z., Quiané-Ruiz, J.A., Thirumuruganathan, S., Chawla, S., Agrawal, D.: A cost-based optimizer for gradient descent optimization. In: SIGMOD, pp. 977–992 (2017)
35.
Zurück zum Zitat Kara, K., Eguro, K., Zhang, C., Alonso, G.: ColumnML: column-store machine learning with on-the-fly data transformation. VLDB 12(4), 348–361 (2018) Kara, K., Eguro, K., Zhang, C., Alonso, G.: ColumnML: column-store machine learning with on-the-fly data transformation. VLDB 12(4), 348–361 (2018)
36.
Zurück zum Zitat Klein, A., Falkner, S., Mansur, N., Hutter, F.: RoBO: a flexible and robust Bayesian optimization framework in Python. In: NIPS 2017 Bayesian Optimization Workshop, pp. 4–9 (2017) Klein, A., Falkner, S., Mansur, N., Hutter, F.: RoBO: a flexible and robust Bayesian optimization framework in Python. In: NIPS 2017 Bayesian Optimization Workshop, pp. 4–9 (2017)
37.
Zurück zum Zitat Klimovic, A., Wang, Y., Kozyrakis, C., Stuedi, P., Pfefferle, J., Trivedi, A.: Understanding ephemeral storage for serverless analytics. In: USENIX ATC, pp. 789–794 (2018) Klimovic, A., Wang, Y., Kozyrakis, C., Stuedi, P., Pfefferle, J., Trivedi, A.: Understanding ephemeral storage for serverless analytics. In: USENIX ATC, pp. 789–794 (2018)
38.
Zurück zum Zitat Klimovic, A., Wang, Y., Stuedi, P., Trivedi, A., Pfefferle, J., Kozyrakis, C.: Pocket: elastic ephemeral storage for serverless analytics. In: OSDI, pp. 427–444 (2018) Klimovic, A., Wang, Y., Stuedi, P., Trivedi, A., Pfefferle, J., Kozyrakis, C.: Pocket: elastic ephemeral storage for serverless analytics. In: OSDI, pp. 427–444 (2018)
39.
Zurück zum Zitat Kraska, T., Talwalkar, A., Duchi, J.C., Griffith, R., Franklin, M.J., Jordan, M.I.: MLbase: a distributed machine-learning system. In: CIDR, vol. 1, pp. 2–1 (2013) Kraska, T., Talwalkar, A., Duchi, J.C., Griffith, R., Franklin, M.J., Jordan, M.I.: MLbase: a distributed machine-learning system. In: CIDR, vol. 1, pp. 2–1 (2013)
40.
Zurück zum Zitat Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. JMLR 5(4), 361–397 (2004) Lewis, D.D., Yang, Y., Rose, T.G., Li, F.: RCV1: a new benchmark collection for text categorization research. JMLR 5(4), 361–397 (2004)
41.
Zurück zum Zitat Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. JMLR 18(1), 6765–6816 (2017)MathSciNet Li, L., Jamieson, K., DeSalvo, G., Rostamizadeh, A., Talwalkar, A.: Hyperband: a novel bandit-based approach to hyperparameter optimization. JMLR 18(1), 6765–6816 (2017)MathSciNet
42.
Zurück zum Zitat Li, M., Andersen, D.G., Smola, A.J., Yu, K.: Communication efficient distributed machine learning with the parameter server. In: NeurIPS, pp. 19–27 (2014) Li, M., Andersen, D.G., Smola, A.J., Yu, K.: Communication efficient distributed machine learning with the parameter server. In: NeurIPS, pp. 19–27 (2014)
43.
Zurück zum Zitat Li, S., Zhao, Y., Varma, R., et al.: PyTorch distributed: experiences on accelerating data parallel training. VLDB 13(12), 3005–3018 (2020) Li, S., Zhao, Y., Varma, R., et al.: PyTorch distributed: experiences on accelerating data parallel training. VLDB 13(12), 3005–3018 (2020)
44.
Zurück zum Zitat Liaw, R., Bhardwaj, R., Dunlap, L., Zou, Y., Gonzalez, J.E., Stoica, I., Tumanov, A.: HyperSched: dynamic resource reallocation for model development on a deadline. In: SoCC, pp. 61–73 (2019) Liaw, R., Bhardwaj, R., Dunlap, L., Zou, Y., Gonzalez, J.E., Stoica, I., Tumanov, A.: HyperSched: dynamic resource reallocation for model development on a deadline. In: SoCC, pp. 61–73 (2019)
45.
Zurück zum Zitat Liu, J., Zhang, C.: Distributed learning systems with first-order methods. Found. Trends Databases 9, 1–100 (2020)CrossRef Liu, J., Zhang, C.: Distributed learning systems with first-order methods. Found. Trends Databases 9, 1–100 (2020)CrossRef
46.
Zurück zum Zitat McSherry, F., Isard, M., Murray, D.G.: Scalability! but at what cost? In: HotOS (2015) McSherry, F., Isard, M., Murray, D.G.: Scalability! but at what cost? In: HotOS (2015)
47.
Zurück zum Zitat Meng, X., Bradley, J., Yavuz, B., et al.: MLlib: machine learning in apache spark. JMLR 17(1), 1235–1241 (2016)MathSciNet Meng, X., Bradley, J., Yavuz, B., et al.: MLlib: machine learning in apache spark. JMLR 17(1), 1235–1241 (2016)MathSciNet
48.
Zurück zum Zitat Misra, U., Liaw, R., Dunlap, L., Bhardwaj, R., Kandasamy, K., Gonzalez, J.E., Stoica, I., Tumanov, A.: RubberBand: cloud-based hyperparameter tuning. In: EuroSys, pp. 327–342 (2021) Misra, U., Liaw, R., Dunlap, L., Bhardwaj, R., Kandasamy, K., Gonzalez, J.E., Stoica, I., Tumanov, A.: RubberBand: cloud-based hyperparameter tuning. In: EuroSys, pp. 327–342 (2021)
49.
Zurück zum Zitat Müller, I., Marroquín, R., Alonso, G.: Lambada: interactive data analytics on cold data using serverless cloud infrastructure. In: SIGMOD, pp. 115–130 (2020) Müller, I., Marroquín, R., Alonso, G.: Lambada: interactive data analytics on cold data using serverless cloud infrastructure. In: SIGMOD, pp. 115–130 (2020)
50.
Zurück zum Zitat Narayanan, D., Santhanam, K., Kazhamiaka, F., Phanishayee, A., Zaharia, M.: Heterogeneity-aware cluster scheduling policies for deep learning workloads. In: OSDI, pp. 481–498 (2020) Narayanan, D., Santhanam, K., Kazhamiaka, F., Phanishayee, A., Zaharia, M.: Heterogeneity-aware cluster scheduling policies for deep learning workloads. In: OSDI, pp. 481–498 (2020)
51.
Zurück zum Zitat Ooi, B.C., Tan, K.L., Wang, S., et al.: SINGA: a distributed deep learning platform. In: MM, pp. 685–688 (2015) Ooi, B.C., Tan, K.L., Wang, S., et al.: SINGA: a distributed deep learning platform. In: MM, pp. 685–688 (2015)
52.
Zurück zum Zitat Paszke, A., Gross, S., Massa, F., et al.: PyTorch: an imperative style, high-performance deep learning library. NeurIPS 32, 8026–8037 (2019) Paszke, A., Gross, S., Massa, F., et al.: PyTorch: an imperative style, high-performance deep learning library. NeurIPS 32, 8026–8037 (2019)
53.
Zurück zum Zitat Perron, M., Castro Fernandez, R., DeWitt, D., Madden, S.: Starling: A scalable query engine on cloud functions. In: SIGMOD, pp. 131–141 (2020) Perron, M., Castro Fernandez, R., DeWitt, D., Madden, S.: Starling: A scalable query engine on cloud functions. In: SIGMOD, pp. 131–141 (2020)
54.
Zurück zum Zitat Poppe, O., Guo, Q., Lang, W., Arora, P., Oslake, M., Xu, S., Kalhan, A.: Moneyball: proactive auto-scaling in Microsoft Azure SQL database serverless. In: VLDB (2022) Poppe, O., Guo, Q., Lang, W., Arora, P., Oslake, M., Xu, S., Kalhan, A.: Moneyball: proactive auto-scaling in Microsoft Azure SQL database serverless. In: VLDB (2022)
55.
Zurück zum Zitat Pu, Q., Venkataraman, S., Stoica, I.: Shuffling, fast and slow: scalable analytics on serverless infrastructure. In: NSDI, pp. 193–206 (2019) Pu, Q., Venkataraman, S., Stoica, I.: Shuffling, fast and slow: scalable analytics on serverless infrastructure. In: NSDI, pp. 193–206 (2019)
56.
Zurück zum Zitat Rausch, T., Hummer, W., Muthusamy, V., Rashed, A., Dustdar, S.: Towards a serverless platform for edge AI. In: HotEdge (2019) Rausch, T., Hummer, W., Muthusamy, V., Rashed, A., Dustdar, S.: Towards a serverless platform for edge AI. In: HotEdge (2019)
57.
Zurück zum Zitat Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In: NeurIPS, pp. 693–701 (2011) Recht, B., Re, C., Wright, S., Niu, F.: Hogwild: A lock-free approach to parallelizing stochastic gradient descent. In: NeurIPS, pp. 693–701 (2011)
58.
Zurück zum Zitat Schleier-Smith, J., Sreekanti, V., Khandelwal, A., Carreira, J., Yadwadkar, N.J., Popa, R.A., Gonzalez, J.E., Stoica, I., Patterson, D.A.: What serverless computing is and should become: the next phase of cloud computing. Commun. ACM 64(5), 76–84 (2021)CrossRef Schleier-Smith, J., Sreekanti, V., Khandelwal, A., Carreira, J., Yadwadkar, N.J., Popa, R.A., Gonzalez, J.E., Stoica, I., Patterson, D.A.: What serverless computing is and should become: the next phase of cloud computing. Commun. ACM 64(5), 76–84 (2021)CrossRef
60.
Zurück zum Zitat Sparks, E.R., Venkataraman, S., Kaftan, T., Franklin, M.J., Recht, B.: KeystoneML: optimizing pipelines for large-scale advanced analytics. In: ICDE, pp. 535–546 (2017) Sparks, E.R., Venkataraman, S., Kaftan, T., Franklin, M.J., Recht, B.: KeystoneML: optimizing pipelines for large-scale advanced analytics. In: ICDE, pp. 535–546 (2017)
61.
Zurück zum Zitat Tandon, R., Lei, Q., Dimakis, A.G., Karampatziakis, N.: Gradient coding: avoiding stragglers in distributed learning. In: ICML, pp. 3368–3376 (2017) Tandon, R., Lei, Q., Dimakis, A.G., Karampatziakis, N.: Gradient coding: avoiding stragglers in distributed learning. In: ICML, pp. 3368–3376 (2017)
62.
Zurück zum Zitat Tang, H., Gan, S., Zhang, C., Zhang, T., Liu, J.: Communication compression for decentralized training. In: NeurIPS, pp. 7663–7673 (2018) Tang, H., Gan, S., Zhang, C., Zhang, T., Liu, J.: Communication compression for decentralized training. In: NeurIPS, pp. 7663–7673 (2018)
63.
Zurück zum Zitat Tang, H., Lian, X., Yan, M., Zhang, C., Liu, J.: D\(^{2}\): decentralized training over decentralized data. In: ICML, pp. 4848–4856 (2018) Tang, H., Lian, X., Yan, M., Zhang, C., Liu, J.: D\(^{2}\): decentralized training over decentralized data. In: ICML, pp. 4848–4856 (2018)
64.
Zurück zum Zitat Wang, H., Niu, D., Li, B.: Distributed machine learning with a serverless architecture. In: INFOCOM, pp. 1288–1296 (2019) Wang, H., Niu, D., Li, B.: Distributed machine learning with a serverless architecture. In: INFOCOM, pp. 1288–1296 (2019)
65.
Zurück zum Zitat Wang, J., Joshi, G.: Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd. arXiv:1810.08313 (2018) Wang, J., Joshi, G.: Adaptive communication strategies to achieve the best error-runtime trade-off in local-update sgd. arXiv:​1810.​08313 (2018)
66.
Zurück zum Zitat Wang, L., Li, M., Zhang, Y., Ristenpart, T., Swift, M.: Peeking behind the curtains of serverless platforms. In: USENIX ATC, pp. 133–146 (2018) Wang, L., Li, M., Zhang, Y., Ristenpart, T., Swift, M.: Peeking behind the curtains of serverless platforms. In: USENIX ATC, pp. 133–146 (2018)
67.
Zurück zum Zitat Wawrzoniak, M., Müller, I., Fraga Barcelos Paulus Bruno, R., Alonso, G.: Boxer: data analytics on network-enabled serverless platforms. In: CIDR (2021) Wawrzoniak, M., Müller, I., Fraga Barcelos Paulus Bruno, R., Alonso, G.: Boxer: data analytics on network-enabled serverless platforms. In: CIDR (2021)
68.
Zurück zum Zitat Wu, Y., Dinh, T.T.A., Hu, G., Zhang, M., Chee, Y.M., Ooi, B.C.: Serverless data science-are we there yet? A case study of model serving (2022) Wu, Y., Dinh, T.T.A., Hu, G., Zhang, M., Chee, Y.M., Ooi, B.C.: Serverless data science-are we there yet? A case study of model serving (2022)
69.
Zurück zum Zitat Zhang, H., Li, J., Kara, K., Alistarh, D., Liu, J., Zhang, C.: ZipML: training linear models with end-to-end low precision, and a little bit of deep learning. In: ICML, pp. 4035–4043 (2017) Zhang, H., Li, J., Kara, K., Alistarh, D., Liu, J., Zhang, C.: ZipML: training linear models with end-to-end low precision, and a little bit of deep learning. In: ICML, pp. 4035–4043 (2017)
70.
Zurück zum Zitat Zhang, Z., Jiang, J., Wu, W., Zhang, C., Yu, L., Cui, B.: MLlib*: fast training of GLMs using spark MLlib. In: ICDE, pp. 1778–1789 (2019) Zhang, Z., Jiang, J., Wu, W., Zhang, C., Yu, L., Cui, B.: MLlib*: fast training of GLMs using spark MLlib. In: ICDE, pp. 1778–1789 (2019)
71.
Zurück zum Zitat Zheng, S., Meng, Q., Wang, T., et al.: Asynchronous stochastic gradient descent with delay compensation. In: ICML, pp. 4120–4129 (2017) Zheng, S., Meng, Q., Wang, T., et al.: Asynchronous stochastic gradient descent with delay compensation. In: ICML, pp. 4120–4129 (2017)
72.
Zurück zum Zitat Zinkevich, M., Weimer, M., Smola, A.J., Li, L.: Parallelized stochastic gradient descent. In: NeurIPS, pp. 2595–2603 (2010) Zinkevich, M., Weimer, M., Smola, A.J., Li, L.: Parallelized stochastic gradient descent. In: NeurIPS, pp. 2595–2603 (2010)
Metadaten
Titel
A systematic evaluation of machine learning on serverless infrastructure
verfasst von
Jiawei Jiang
Shaoduo Gan
Bo Du
Gustavo Alonso
Ana Klimovic
Ankit Singla
Wentao Wu
Sheng Wang
Ce Zhang
Publikationsdatum
20.09.2023
Verlag
Springer Berlin Heidelberg
Erschienen in
The VLDB Journal / Ausgabe 2/2024
Print ISSN: 1066-8888
Elektronische ISSN: 0949-877X
DOI
https://doi.org/10.1007/s00778-023-00813-0

Weitere Artikel der Ausgabe 2/2024

The VLDB Journal 2/2024 Zur Ausgabe