Skip to main content

2016 | OriginalPaper | Buchkapitel

How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

verfasst von : Ahsan Javed Awan, Mats Brorsson, Vladimir Vlassov, Eduard Ayguade

Erschienen in: Big Data Benchmarks, Performance Optimization, and Emerging Hardware

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not well understood. We present a deep-dive analysis of Spark based applications on a large scale-up server machine. Our analysis reveals that Spark based data analytics are DRAM bound and do not benefit by using more than 12 cores for an executor. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10 % better instruction retirement rate (due to lower L1 cache misses and higher core utilization). We match memory behaviour with the garbage collector to improve performance of applications between 1.6x to 3x.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
4.
Zurück zum Zitat Appuswamy, R., Gkantsidis, C., Narayanan, D., Hodson, O., Rowstron, A.I.T.: Scale-up vs scale-out for hadoop: time to rethink? In: ACM Symposium on Cloud Computing, SOCC, p. 20 (2013) Appuswamy, R., Gkantsidis, C., Narayanan, D., Hodson, O., Rowstron, A.I.T.: Scale-up vs scale-out for hadoop: time to rethink? In: ACM Symposium on Cloud Computing, SOCC, p. 20 (2013)
5.
Zurück zum Zitat Awan, A.J., Brorsson, M., Vlassov, V., Ayguadé, E.: Performance characterization of in-memory data analytics on a modern cloud server. arXiv preprint arXiv:1506.07742 (2015) Awan, A.J., Brorsson, M., Vlassov, V., Ayguadé, E.: Performance characterization of in-memory data analytics on a modern cloud server. arXiv preprint arXiv:​1506.​07742 (2015)
6.
Zurück zum Zitat Chen, R., Chen, H., Zang, B.: Tiled-mapreduce: Optimizing resource usages of data-parallel applications on multicore with tiling. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pp. 523–534. PACT 2010 (2010) Chen, R., Chen, H., Zang, B.: Tiled-mapreduce: Optimizing resource usages of data-parallel applications on multicore with tiling. In: Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques, pp. 523–534. PACT 2010 (2010)
7.
Zurück zum Zitat Detlefs, D., Flood, C., Heller, S., Printezis, T.: Garbage-first garbage collection. In: Proceedings of the 4th international symposium on Memory management, pp. 37–48. ACM (2004) Detlefs, D., Flood, C., Heller, S., Printezis, T.: Garbage-first garbage collection. In: Proceedings of the 4th international symposium on Memory management, pp. 37–48. ACM (2004)
8.
Zurück zum Zitat Levinthal, D.: Performance analysis guide for intel core i7 processor and intel xeon 5500 processors. In: Intel Performance Analysis Guide (2009) Levinthal, D.: Performance analysis guide for intel core i7 processor and intel xeon 5500 processors. In: Intel Performance Analysis Guide (2009)
9.
Zurück zum Zitat Ferdman, M., Adileh, A., Kocberber, O., Volos, S., Alisafaee, M., Jevdjic, D., Kaynak, C., Popescu, A.D., Ailamaki, A., Falsafi, B.: Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In: Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 37–48. ASPLOS XVII (2012) Ferdman, M., Adileh, A., Kocberber, O., Volos, S., Alisafaee, M., Jevdjic, D., Kaynak, C., Popescu, A.D., Ailamaki, A., Falsafi, B.: Clearing the clouds: A study of emerging scale-out workloads on modern hardware. In: Proceedings of the Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 37–48. ASPLOS XVII (2012)
10.
Zurück zum Zitat Jia, Z., Wang, L., Zhan, J., Zhang, L., Luo, C.: Characterizing data analysis workloads in data centers. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 66–76 (2013) Jia, Z., Wang, L., Zhan, J., Zhang, L., Luo, C.: Characterizing data analysis workloads in data centers. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 66–76 (2013)
11.
Zurück zum Zitat Jia, Z., Zhan, J., Wang, L., Han, R., McKee, S.A., Yang, Q., Luo, C., Li, J.: Characterizing and subsetting big data workloads. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 191–201 (2014) Jia, Z., Zhan, J., Wang, L., Han, R., McKee, S.A., Yang, Q., Luo, C., Li, J.: Characterizing and subsetting big data workloads. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 191–201 (2014)
12.
Zurück zum Zitat Jiang, T., Zhang, Q., Hou, R., Chai, L., McKee, S.A., Jia, Z., Sun, N.: Understanding the behavior of in-memory computing workloads. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 22–30 (2014) Jiang, T., Zhang, Q., Hou, R., Chai, L., McKee, S.A., Jia, Z., Sun, N.: Understanding the behavior of in-memory computing workloads. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 22–30 (2014)
13.
Zurück zum Zitat Karakostas, V., Unsal, O.S., Nemirovsky, M., Cristal, A., Swift, M.: Performance analysis of the memory management unit under scale-out workloads. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 1–12, October 2014 Karakostas, V., Unsal, O.S., Nemirovsky, M., Cristal, A., Swift, M.: Performance analysis of the memory management unit under scale-out workloads. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 1–12, October 2014
14.
Zurück zum Zitat Luo, C., Zhan, J., Jia, Z., Wang, L., Lu, G., Zhang, L., Xu, C.Z., Sun, N.: Cloudrank-d: Benchmarking and ranking cloud computing systems for data processing applications. Front. Comput. Sci. 6(4), 347–362 (2012)MathSciNet Luo, C., Zhan, J., Jia, Z., Wang, L., Lu, G., Zhang, L., Xu, C.Z., Sun, N.: Cloudrank-d: Benchmarking and ranking cloud computing systems for data processing applications. Front. Comput. Sci. 6(4), 347–362 (2012)MathSciNet
15.
Zurück zum Zitat Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., Zhan, J.: BDGS: A scalable big data generator suite in big data benchmarking. In: Rabl, T., Raghunath, N., Poess, M., Bhandarkar, M., Jacobsen, H.-A., Baru, C. (eds.) Advancing Big Data Benchmarks. LNCS, pp. 138–154. Springer, Heidelberg (2014) Ming, Z., Luo, C., Gao, W., Han, R., Yang, Q., Wang, L., Zhan, J.: BDGS: A scalable big data generator suite in big data benchmarking. In: Rabl, T., Raghunath, N., Poess, M., Bhandarkar, M., Jacobsen, H.-A., Baru, C. (eds.) Advancing Big Data Benchmarks. LNCS, pp. 138–154. Springer, Heidelberg (2014)
16.
Zurück zum Zitat Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.G.: Making sense of performance in data analytics frameworks. In: 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2015), pp. 293–307 (2015) Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.G.: Making sense of performance in data analytics frameworks. In: 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2015), pp. 293–307 (2015)
17.
Zurück zum Zitat Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., Qiu, B.: Bigdatabench: A big data benchmark suite from internet services. In: 20th IEEE International Symposium on High Performance Computer Architecture, HPCA, pp. 488–499 (2014) Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., Qiu, B.: Bigdatabench: A big data benchmark suite from internet services. In: 20th IEEE International Symposium on High Performance Computer Architecture, HPCA, pp. 488–499 (2014)
18.
Zurück zum Zitat Yasin, A.: A top-down method for performance analysis and counters architecture. In: 2014 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS, pp. 35–44 (2014) Yasin, A.: A top-down method for performance analysis and counters architecture. In: 2014 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS, pp. 35–44 (2014)
19.
Zurück zum Zitat Yasin, A., Ben-Asher, Y., Mendelson, A.: Deep-dive analysis of the data analytics workload in cloudsuite. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 202–211, October 2014 Yasin, A., Ben-Asher, Y., Mendelson, A.: Deep-dive analysis of the data analytics workload in cloudsuite. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 202–211, October 2014
20.
Zurück zum Zitat Yoo, R.M., Romano, A., Kozyrakis, C.: Phoenix rebirth: Scalable mapreduce on a large-scale shared-memory system. In: Proceedings of IEEE International Symposium on Workload Characterization (IISWC), pp. 198–207 (2009) Yoo, R.M., Romano, A., Kozyrakis, C.: Phoenix rebirth: Scalable mapreduce on a large-scale shared-memory system. In: Proceedings of IEEE International Symposium on Workload Characterization (IISWC), pp. 198–207 (2009)
21.
Zurück zum Zitat Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2012), pp. 15–28. San Jose, CA (2012) Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Presented as part of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI 2012), pp. 15–28. San Jose, CA (2012)
22.
Zurück zum Zitat Zhang, K., Chen, R., Chen, H.: Numa-aware graph-structured analytics. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 183–193. ACM (2015) Zhang, K., Chen, R., Chen, H.: Numa-aware graph-structured analytics. In: Proceedings of the 20th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 183–193. ACM (2015)
Metadaten
Titel
How Data Volume Affects Spark Based Data Analytics on a Scale-up Server
verfasst von
Ahsan Javed Awan
Mats Brorsson
Vladimir Vlassov
Eduard Ayguade
Copyright-Jahr
2016
DOI
https://doi.org/10.1007/978-3-319-29006-5_7