Skip to main content
Erschienen in: Distributed and Parallel Databases 3/2019

24.08.2018

Abstract cost models for distributed data-intensive computations

verfasst von: Rundong Li, Ningfang Mi, Mirek Riedewald, Yizhou Sun, Yi Yao

Erschienen in: Distributed and Parallel Databases | Ausgabe 3/2019

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

We consider data analytics workloads on distributed architectures, in particular clusters of commodity machines. To find a job partitioning that minimizes running time, a cost model, which we more accurately refer to as makespan model, is needed. In attempting to find the simplest possible, but sufficiently accurate, such model, we explore piecewise linear functions of input, output, and computational complexity. They are abstract in the sense that they capture fundamental algorithm properties, but do not require explicit modeling of system and implementation details such as the number of disk accesses. We show how the simplified functional structure can be exploited to reduce optimization cost. In the general case, we identify a lower bound that can be used for search-space pruning. For applications with homogeneous tasks, we further demonstrate how to directly integrate the model into the makespan optimization process, reducing search-space dimensionality and thus complexity by orders of magnitude. Experimental results provide evidence of good prediction quality and successful makespan optimization across a variety of operators and cluster architectures.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39(5), 575–582 (1995)CrossRef Agarwal, R.C., Balle, S.M., Gustavson, F.G., Joshi, M., Palkar, P.: A three-dimensional approach to parallel matrix multiplication. IBM J. Res. Dev. 39(5), 575–582 (1995)CrossRef
2.
Zurück zum Zitat Akdere, M., Cetintemel, U., Riondato, M., Upfal, E., Zdonik, S.: Learning-based query performance modeling and prediction. In: ICDE, pp. 390–401 (2012) Akdere, M., Cetintemel, U., Riondato, M., Upfal, E., Zdonik, S.: Learning-based query performance modeling and prediction. In: ICDE, pp. 390–401 (2012)
3.
Zurück zum Zitat Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer, K., Kubiatowicz, J., Morgan, N., Patterson, D., Sen, K., Wawrzynek, J., Wessel, D., Yelick, K.: A view of the parallel computing landscape. Commun. ACM 52(10), 56–67 (2009)CrossRef Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer, K., Kubiatowicz, J., Morgan, N., Patterson, D., Sen, K., Wawrzynek, J., Wessel, D., Yelick, K.: A view of the parallel computing landscape. Commun. ACM 52(10), 56–67 (2009)CrossRef
4.
Zurück zum Zitat Ballard, G., Buluc, A., Demmel, J., Grigori, L., Lipshitz, B., Schwartz, O., Toledo, S.: Communication optimal parallel multiplication of sparse random matrices. In: SPAA, pp. 222–231 (2013) Ballard, G., Buluc, A., Demmel, J., Grigori, L., Lipshitz, B., Schwartz, O., Toledo, S.: Communication optimal parallel multiplication of sparse random matrices. In: SPAA, pp. 222–231 (2013)
5.
Zurück zum Zitat Duggan, J., Cetintemel, U., Papaemmanouil, O., Upfal, E.: Performance prediction for concurrent database workloads. In: SIGMOD, pp. 337–348 (2011) Duggan, J., Cetintemel, U., Papaemmanouil, O., Upfal, E.: Performance prediction for concurrent database workloads. In: SIGMOD, pp. 337–348 (2011)
6.
Zurück zum Zitat Duggan, J., Papaemmanouil, O., Çetintemel, U., Upfal, E.: Contender: a resource modeling approach for concurrent query performance prediction. In: EDBT, pp. 109–120 (2014) Duggan, J., Papaemmanouil, O., Çetintemel, U., Upfal, E.: Contender: a resource modeling approach for concurrent query performance prediction. In: EDBT, pp. 109–120 (2014)
7.
Zurück zum Zitat Elmroth, E., Gustavson, F., Jonsson, I., K$\mathring{\text{a}}$gstr$\ddot{\text{ o }}$m, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Rev. 46(1), 3–45 (2004) Elmroth, E., Gustavson, F., Jonsson, I., K$\mathring{\text{a}}$gstr$\ddot{\text{ o }}$m, B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software. SIAM Rev. 46(1), 3–45 (2004)
8.
Zurück zum Zitat Ganapathi, A., Kuno, H.A., Dayal, U., Wiener, J.L., Fox, A., Jordan, M.I., Patterson, D.A.: Predicting multiple metrics for queries: Better decisions enabled by machine learning. In: ICDE, pp. 592–603 (2009) Ganapathi, A., Kuno, H.A., Dayal, U., Wiener, J.L., Fox, A., Jordan, M.I., Patterson, D.A.: Predicting multiple metrics for queries: Better decisions enabled by machine learning. In: ICDE, pp. 592–603 (2009)
9.
Zurück zum Zitat Goodrich, M.T., Sitchinava, N., Zhang, Q.: Sorting, searching, and simulation in the mapreduce framework. In: ISAAC, pp. 374–383 (2011) Goodrich, M.T., Sitchinava, N., Zhang, Q.: Sorting, searching, and simulation in the mapreduce framework. In: ISAAC, pp. 374–383 (2011)
10.
Zurück zum Zitat Gounaris, A., Kougka, G., Tous, R., Montes, C.T., Torres, J.: Dynamic configuration of partitioning in spark applications. IEEE Trans. Parallel Distrib. Syst. 28(7), 1891–1904 (2017)CrossRef Gounaris, A., Kougka, G., Tous, R., Montes, C.T., Torres, J.: Dynamic configuration of partitioning in spark applications. IEEE Trans. Parallel Distrib. Syst. 28(7), 1891–1904 (2017)CrossRef
11.
Zurück zum Zitat Gunther, N.J., Puglia, P., Tomasette, K.: Hadoop superlinear scalability. Commun. ACM 58(4), 46–55 (2015)CrossRef Gunther, N.J., Puglia, P., Tomasette, K.: Hadoop superlinear scalability. Commun. ACM 58(4), 46–55 (2015)CrossRef
12.
Zurück zum Zitat Hahn, C., Warren, S., Eastman, R.: Extended edited synoptic cloud reports from ships and land stations over the globe, 1952–2009 (ndp-026c) (2012) Hahn, C., Warren, S., Eastman, R.: Extended edited synoptic cloud reports from ships and land stations over the globe, 1952–2009 (ndp-026c) (2012)
13.
Zurück zum Zitat Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. VLDB 4(11), 1111–1122 (2011) Herodotou, H., Babu, S.: Profiling, what-if analysis, and cost-based optimization of mapreduce programs. VLDB 4(11), 1111–1122 (2011)
14.
Zurück zum Zitat Huang, B., Babu, S., Yang, J.: Cumulon: optimizing statistical data analysis in the cloud. In: Proceedings of SIGMOD, pp. 1–12 (2013) Huang, B., Babu, S., Yang, J.: Cumulon: optimizing statistical data analysis in the cloud. In: Proceedings of SIGMOD, pp. 1–12 (2013)
15.
Zurück zum Zitat Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)CrossRefMATH Irony, D., Toledo, S., Tiskin, A.: Communication lower bounds for distributed-memory matrix multiplication. J. Parallel Distrib. Comput. 64(9), 1017–1026 (2004)CrossRefMATH
16.
Zurück zum Zitat Kaoudi, Z., Quiane-Ruiz, JA., Thirumuruganathan, S., Chawla, S., Agrawal, D.: A cost-based optimizer for gradient descent optimization. In: SIGMOD, pp. 977–992 (2017) Kaoudi, Z., Quiane-Ruiz, JA., Thirumuruganathan, S., Chawla, S., Agrawal, D.: A cost-based optimizer for gradient descent optimization. In: SIGMOD, pp. 977–992 (2017)
17.
Zurück zum Zitat Karloff, H., Suri, S., Vassilvitskii, S.: A model of computation for mapreduce. In: SODA, pp. 938–948 (2010) Karloff, H., Suri, S., Vassilvitskii, S.: A model of computation for mapreduce. In: SODA, pp. 938–948 (2010)
18.
Zurück zum Zitat Li, J., Naughton, J.F., Nehme, R.V.: Resource bricolage and resource selection for parallel database systems. VLDB J. 26(1), 31–54 (2017)CrossRef Li, J., Naughton, J.F., Nehme, R.V.: Resource bricolage and resource selection for parallel database systems. VLDB J. 26(1), 31–54 (2017)CrossRef
19.
Zurück zum Zitat Li, R., Riedewald, M., Deng, X.: Submodularity of distributed join computation. In: (Upcoming) SIGMOD (2018) Li, R., Riedewald, M., Deng, X.: Submodularity of distributed join computation. In: (Upcoming) SIGMOD (2018)
20.
Zurück zum Zitat Lichman, M.: UCI machine learning repository (2013) Lichman, M.: UCI machine learning repository (2013)
21.
Zurück zum Zitat Morton, K., Balazinska, M., Grossman, D.: Paratimer: a progress indicator for mapreduce dags. In: SIGMOD, pp. 507–518 (2010) Morton, K., Balazinska, M., Grossman, D.: Paratimer: a progress indicator for mapreduce dags. In: SIGMOD, pp. 507–518 (2010)
22.
Zurück zum Zitat Munson, A.M., Webb, K., Sheldon, D., Fink, D., Hochachka, W.M., Iliff, M., Riedewald, M., Sorokina, D., Sullivan, B., Wood, C., Kelling, S.: The Ebird Reference Dataset, Version 2014. Cornell Lab of Ornithology and National Audubon Society, Ithaca, NY (2014) Munson, A.M., Webb, K., Sheldon, D., Fink, D., Hochachka, W.M., Iliff, M., Riedewald, M., Sorokina, D., Sullivan, B., Wood, C., Kelling, S.: The Ebird Reference Dataset, Version 2014. Cornell Lab of Ornithology and National Audubon Society, Ithaca, NY (2014)
23.
Zurück zum Zitat Quinlan, J.R., et al.: Learning with continuous classes. Aust. Jt. Conf. Artif. Intell. 92, 343–348 (1992) Quinlan, J.R., et al.: Learning with continuous classes. Aust. Jt. Conf. Artif. Intell. 92, 343–348 (1992)
24.
Zurück zum Zitat Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill, New York (2003)MATH Ramakrishnan, R., Gehrke, J.: Database Management Systems, 3rd edn. McGraw-Hill, New York (2003)MATH
25.
Zurück zum Zitat Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs. In: VLDB, pp. 1319–1330 (2014) Shi, J., Zou, J., Lu, J., Cao, Z., Li, S., Wang, C.: Mrtuner: a toolkit to enable holistic optimization for mapreduce jobs. In: VLDB, pp. 1319–1330 (2014)
26.
Zurück zum Zitat Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5d matrix multiplication and LU factorization algorithms. In: Euro-Par 2011 Parallel Processing, pp. 90–109 (2011) Solomonik, E., Demmel, J.: Communication-optimal parallel 2.5d matrix multiplication and LU factorization algorithms. In: Euro-Par 2011 Parallel Processing, pp. 90–109 (2011)
27.
Zurück zum Zitat Valsalam, V., Skjellum, A.: A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurr. Comput. 14(10), 805–839 (2002)CrossRefMATH Valsalam, V., Skjellum, A.: A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels. Concurr. Comput. 14(10), 805–839 (2002)CrossRefMATH
28.
Zurück zum Zitat van de Geijn, R.A., Watts, J.: Summa: Scalable Universal Matrix Multiplication Algorithm. University of Texas at Austin, Tech. rep. (1995) van de Geijn, R.A., Watts, J.: Summa: Scalable Universal Matrix Multiplication Algorithm. University of Texas at Austin, Tech. rep. (1995)
29.
Zurück zum Zitat Venkataraman, S., Yang, Z., Franklin, M., Recht, B., Stoica, I.: Ernest: efficient performance prediction for large-scale advanced analytics. In: NSDI, pp. 363–378 (2016) Venkataraman, S., Yang, Z., Franklin, M., Recht, B., Stoica, I.: Ernest: efficient performance prediction for large-scale advanced analytics. In: NSDI, pp. 363–378 (2016)
30.
Zurück zum Zitat Vieth, E.: Fitting piecewise linear regression functions to biological responses. J. Appl. Physiol. 67(1), 390–396 (1989)CrossRef Vieth, E.: Fitting piecewise linear regression functions to biological responses. J. Appl. Physiol. 67(1), 390–396 (1989)CrossRef
31.
Zurück zum Zitat Wang, G., Chan, CY.: Multi-query optimization in mapreduce framework. In: VLDB, pp. 145–156 (2013) Wang, G., Chan, CY.: Multi-query optimization in mapreduce framework. In: VLDB, pp. 145–156 (2013)
32.
Zurück zum Zitat White, B., Lepreau, J., Stoller, L., Ricci, R., Guruprasad, S., Newbold, M., Hibler, M., Barb, C, Joglekar, A.: An integrated experimental environment for distributed systems and networks. In: OSDI, pp. 255–270 (2002) White, B., Lepreau, J., Stoller, L., Ricci, R., Guruprasad, S., Newbold, M., Hibler, M., Barb, C, Joglekar, A.: An integrated experimental environment for distributed systems and networks. In: OSDI, pp. 255–270 (2002)
33.
Zurück zum Zitat Wu, W., Chi, Y., Hacígümüş, H., Naughton, J.F.: Towards predicting query execution time for concurrent and dynamic database workloads. VLDB 6(10), 925–936 (2013) Wu, W., Chi, Y., Hacígümüş, H., Naughton, J.F.: Towards predicting query execution time for concurrent and dynamic database workloads. VLDB 6(10), 925–936 (2013)
34.
Zurück zum Zitat Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using mapreduce. In: VLDB, pp. 1184–1195 (2012) Zhang, X., Chen, L., Wang, M.: Efficient multi-way theta-join processing using mapreduce. In: VLDB, pp. 1184–1195 (2012)
Metadaten
Titel
Abstract cost models for distributed data-intensive computations
verfasst von
Rundong Li
Ningfang Mi
Mirek Riedewald
Yizhou Sun
Yi Yao
Publikationsdatum
24.08.2018
Verlag
Springer US
Erschienen in
Distributed and Parallel Databases / Ausgabe 3/2019
Print ISSN: 0926-8782
Elektronische ISSN: 1573-7578
DOI
https://doi.org/10.1007/s10619-018-7244-2

Weitere Artikel der Ausgabe 3/2019

Distributed and Parallel Databases 3/2019 Zur Ausgabe