Skip to main content

2018 | OriginalPaper | Buchkapitel

Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments

verfasst von : Nicolas Poggi, Alejandro Montero, David Carrera

Erschienen in: Performance Evaluation and Benchmarking for the Analytics Era

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

BigBench is the new standard (TPCx-BB) for benchmarking and testing Big Data systems. The TPCx-BB specification describes several business use cases—queries—which require a broad combination of data extraction techniques including SQL, Map/Reduce (M/R), user code (UDF), and Machine Learning to fulfill them. However, currently, there is no widespread knowledge of the different resource requirements and expected performance of each query, as is the case to more established benchmarks. Moreover, over the last year, the Spark framework and APIs have been evolving very rapidly, with major improvements in performance and the stable release of v2. It is our intent to compare the current state of Spark to Hive’s base implementation which can use the legacy M/R engine and Mahout or the current Tez and MLlib frameworks. At the same time, cloud providers currently offer convenient on-demand managed big data clusters (PaaS) with a pay-as-you-go model. In PaaS, analytical engines such as Hive and Spark come ready to use, with a general-purpose configuration and upgrade management. The study characterizes both the BigBench queries and the out-of-the-box performance of Spark and Hive versions in the cloud. At the same time, comparing popular PaaS offerings in terms of reliability, data scalability (1 GB to 10 TB), versions, and settings from Azure HDinsight, Amazon Web Services EMR, and Google Cloud Dataproc. The query characterization highlights the similarities and differences in Hive an Spark frameworks, and which queries are the most resource consuming according to CPU, memory, and I/O. Scalability results show how there is a need for configuration tuning in most cloud providers as data scale grows, especially with Sparks memory usage. These results can help practitioners to quickly test systems by picking a subset of the queries which stresses each of the categories. At the same time, results show how Hive and Spark compare and what performance can be expected of each in PaaS.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
2.
4.
Zurück zum Zitat Floratou, A., Minhas, U.F., Özcan, F.: SQL-on-Hadoop: full circle back to shared-nothing database architectures. In: Proceedings of VLDB Endowment (2014) Floratou, A., Minhas, U.F., Özcan, F.: SQL-on-Hadoop: full circle back to shared-nothing database architectures. In: Proceedings of VLDB Endowment (2014)
5.
Zurück zum Zitat Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: BigBench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 1197–1208. ACM, New York (2013) Ghazal, A., Rabl, T., Hu, M., Raab, F., Poess, M., Crolotte, A., Jacobsen, H.-A.: BigBench: towards an industry standard benchmark for big data analytics. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, SIGMOD 2013, pp. 1197–1208. ACM, New York (2013)
9.
Zurück zum Zitat Huang, S., et al.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 22nd International Conference on Data Engineering Workshops (2010) Huang, S., et al.: The HiBench benchmark suite: characterization of the MapReduce-based data analysis. In: 22nd International Conference on Data Engineering Workshops (2010)
13.
Zurück zum Zitat Gualtieri, M., Yuhanna, N.: Elasticity, automation, and pay-as-you-go compel enterprise adoption of hadoop in the cloud. The Forrester Wave: Big Data Hadoop Cloud Solutions, Q2 2016 Gualtieri, M., Yuhanna, N.: Elasticity, automation, and pay-as-you-go compel enterprise adoption of hadoop in the cloud. The Forrester Wave: Big Data Hadoop Cloud Solutions, Q2 2016
14.
Zurück zum Zitat Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD, pp. 165–178 (2009) Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD, pp. 165–178 (2009)
15.
Zurück zum Zitat Poggi, N., Berral, J.L., Carrera, D., Vujic, N., Green, D., Blakeley, J., et al.: From performance profiling to predictive analytics while evaluating hadoop cost-efficiency in ALOJA. In: 2015 IEEE International Conference on Big Data (Big Data) (2015) Poggi, N., Berral, J.L., Carrera, D., Vujic, N., Green, D., Blakeley, J., et al.: From performance profiling to predictive analytics while evaluating hadoop cost-efficiency in ALOJA. In: 2015 IEEE International Conference on Big Data (Big Data) (2015)
16.
Zurück zum Zitat Poggi, N., Berral, J.L., Fenech, T., Carrera, D., Blakeley, J., Minhas, U.F., Vujic, N.: The state of SQL-on-Hadoop in the cloud. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 1432–1443, December 2016 Poggi, N., Berral, J.L., Fenech, T., Carrera, D., Blakeley, J., Minhas, U.F., Vujic, N.: The state of SQL-on-Hadoop in the cloud. In: 2016 IEEE International Conference on Big Data (Big Data), pp. 1432–1443, December 2016
17.
Zurück zum Zitat Poggi, N., Carrera, D., Vujic, N., Blakeley, J., et al.: ALOJA: A systematic study of hadoop deployment variables to enable automated characterization of cost-effectiveness. In: 2014 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 27–30 October 2014 Poggi, N., Carrera, D., Vujic, N., Blakeley, J., et al.: ALOJA: A systematic study of hadoop deployment variables to enable automated characterization of cost-effectiveness. In: 2014 IEEE International Conference on Big Data (Big Data), Washington, DC, USA, 27–30 October 2014
18.
Zurück zum Zitat Poggi, N., Montero, A.: Using BigBench to compare hive and spark versions and features Poggi, N., Montero, A.: Using BigBench to compare hive and spark versions and features
21.
Zurück zum Zitat Transaction Processing Performance Council: TPC Benchmark H - Standard Specification, Version 2.17.1 (2014) Transaction Processing Performance Council: TPC Benchmark H - Standard Specification, Version 2.17.1 (2014)
22.
Zurück zum Zitat Transaction Processing Performance Council: TPC Benchmark DS - Standard Specification, Version 1.3.1 (2015) Transaction Processing Performance Council: TPC Benchmark DS - Standard Specification, Version 1.3.1 (2015)
23.
Zurück zum Zitat Vijayakumar, S.: Hadoop based data intensive computation on IAAS cloud platforms. UNF Theses and Dissertations, page Paper 567 (2015) Vijayakumar, S.: Hadoop based data intensive computation on IAAS cloud platforms. UNF Theses and Dissertations, page Paper 567 (2015)
25.
Zurück zum Zitat Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)CrossRef Zaharia, M., Xin, R.S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M.J., Ghodsi, A., Gonzalez, J., Shenker, S., Stoica, I.: Apache spark: a unified engine for big data processing. Commun. ACM 59(11), 56–65 (2016)CrossRef
26.
Zurück zum Zitat Zhang, Z., Cherkasova, L., Loo, B.T.: Exploiting cloud heterogeneity for optimized cost/performance mapreduce processing. In: CloudDP 2014 Zhang, Z., Cherkasova, L., Loo, B.T.: Exploiting cloud heterogeneity for optimized cost/performance mapreduce processing. In: CloudDP 2014
27.
Zurück zum Zitat Zhang, Z., et al.: Optimizing cost and performance trade-offs for MapReduce job processing in the cloud. In: NOMS 2014 Zhang, Z., et al.: Optimizing cost and performance trade-offs for MapReduce job processing in the cloud. In: NOMS 2014
Metadaten
Titel
Characterizing BigBench Queries, Hive, and Spark in Multi-cloud Environments
verfasst von
Nicolas Poggi
Alejandro Montero
David Carrera
Copyright-Jahr
2018
DOI
https://doi.org/10.1007/978-3-319-72401-0_5

Neuer Inhalt