Skip to main content
Erschienen in: Cluster Computing 4/2017

19.09.2017

An experimental analysis of limitations of MapReduce for iterative algorithms on Spark

verfasst von: Minseo Kang, Jae-Gil Lee

Erschienen in: Cluster Computing | Ausgabe 4/2017

Einloggen

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

MapReduce is the most popular framework for distributed processing. Recently, the scalability of data mining and machine learning algorithms has significantly improved with help from MapReduce. However, MapReduce does not handle iterative algorithms very efficiently. The problem is that many data mining and machine learning algorithms are iterative by nature. In order to overcome the limitations of MapReduce, many advanced distributed systems have been developed, including HaLoop, iMapReduce, Twister, and Spark. In this paper, we identify and categorize the limitations of MapReduce in handling iterative algorithms, and then, experimentally investigate the consequences of these limitations by using the most flexible and stable distributed system, Spark. According to our experiment results, the network I/O overhead was the primary factor that affected system performance the most. The disk I/O overhead also affected system performance, but it was less significant than the network I/O overhead. For the synchronization overhead, it affected system performance only when the static data was not cached.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
1.
Zurück zum Zitat Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
2.
Zurück zum Zitat Doulkeridis, C., Nørvåg, K.: A survey of large-scale analytical query processing in MapReduce. VLDB J. Int. J. Very Large Data Bases 23(3), 355–380 (2014)CrossRef Doulkeridis, C., Nørvåg, K.: A survey of large-scale analytical query processing in MapReduce. VLDB J. Int. J. Very Large Data Bases 23(3), 355–380 (2014)CrossRef
3.
Zurück zum Zitat Lee, S., Kim, J., Moon, Y.S., Lee, W.: Efficient level-based top-down data cube computation using MapReduce. Trans. Large-Scale Data-Knowl.-Cent. Syst. XXI, pp. 1–9 (2015) Lee, S., Kim, J., Moon, Y.S., Lee, W.: Efficient level-based top-down data cube computation using MapReduce. Trans. Large-Scale Data-Knowl.-Cent. Syst. XXI, pp. 1–9 (2015)
4.
Zurück zum Zitat Shim, K.: MapReduce algorithms for big data analysis. Proc. VLDB Endow. 5(12), 2016–2017 (2012)CrossRef Shim, K.: MapReduce algorithms for big data analysis. Proc. VLDB Endow. 5(12), 2016–2017 (2012)CrossRef
6.
Zurück zum Zitat Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)CrossRef Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow. 3(1–2), 285–296 (2010)CrossRef
7.
Zurück zum Zitat Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. Int. J. Very Large Data Bases 21(2), 169–190 (2012)CrossRef Bu, Y., Howe, B., Balazinska, M., Ernst, M.D.: The HaLoop approach to large-scale iterative data analysis. VLDB J. Int. J. Very Large Data Bases 21(2), 169–190 (2012)CrossRef
8.
Zurück zum Zitat Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: (2010, April) MapReduce Online. NSDI 10(4), 20 (2010) Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: (2010, April) MapReduce Online. NSDI 10(4), 20 (2010)
9.
Zurück zum Zitat Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818 (2010, June) Ekanayake, J., Li, H., Zhang, B., Gunarathne, T., Bae, S.H., Qiu, J., Fox, G.: Twister: a runtime for iterative mapreduce. In: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, pp. 810–818 (2010, June)
10.
Zurück zum Zitat Lee, H., Kang, M., Youn, S.B., Lee, J. G., Kwon, Y.: An experimental comparison of iterative MapReduce frameworks. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 2089–2094 (2016, October) Lee, H., Kang, M., Youn, S.B., Lee, J. G., Kwon, Y.: An experimental comparison of iterative MapReduce frameworks. In: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management, pp. 2089–2094 (2016, October)
11.
Zurück zum Zitat Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, vol. 10, pp. 10 (2010, June) Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, vol. 10, pp. 10 (2010, June)
12.
Zurück zum Zitat Zhang, Y., Gao, Q., Gao, L., Wang, C.: iMapreduce: a distributed computing framework for iterative computation. J. Grid Comput.10(1), 47–68 (2012) Zhang, Y., Gao, Q., Gao, L., Wang, C.: iMapreduce: a distributed computing framework for iterative computation. J. Grid Comput.10(1), 47–68 (2012)
13.
Zurück zum Zitat Jiang, X., Li, C., Sun, J.: A modified K-means clustering for mining of multimedia databases based on dimensionality reduction and similarity measures. Clust. Comput. 1–8 (2017) Jiang, X., Li, C., Sun, J.: A modified K-means clustering for mining of multimedia databases based on dimensionality reduction and similarity measures. Clust. Comput. 1–8 (2017)
14.
Zurück zum Zitat Miner, D., Shook, A.: MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O’Reilly Media, Inc. (2012) Miner, D., Shook, A.: MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O’Reilly Media, Inc. (2012)
15.
Zurück zum Zitat Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics. Clust. Comput. 1–15 (2017) Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: SparkBench: a spark benchmarking suite characterizing large-scale in-memory data analytics. Clust. Comput. 1–15 (2017)
16.
Zurück zum Zitat Kang, M., Lee, J.: A comparative analysis of iterative MapReduce systems. In: Proceedings of the 6th International Conference on Emerging Databases: Technologies, Applications, and Theory (EDB), pp. 61–64 (2016) Kang, M., Lee, J.: A comparative analysis of iterative MapReduce systems. In: Proceedings of the 6th International Conference on Emerging Databases: Technologies, Applications, and Theory (EDB), pp. 61–64 (2016)
17.
Zurück zum Zitat Lin, J., Dyer, C.: Data-intensive text processing with MapReduce. Synth. Lect. Hum. Lang. Technol. 3(1), 1–177 (2010)CrossRef Lin, J., Dyer, C.: Data-intensive text processing with MapReduce. Synth. Lect. Hum. Lang. Technol. 3(1), 1–177 (2010)CrossRef
19.
Zurück zum Zitat Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 135–146 (2010, June) Malewicz, G., Austern, M.H., Bik, A.J., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a system for large-scale graph processing. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, pp. 135–146 (2010, June)
22.
Zurück zum Zitat Kwon, Y., Nunley, D., Gardner, J.P., Balazinska, M., Howe, B., Loebman, S.: Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. In: Scientific and Statistical Database Management, pp. 132–150. Springer, Berlin, Heidelberg (2010, January) Kwon, Y., Nunley, D., Gardner, J.P., Balazinska, M., Howe, B., Loebman, S.: Scalable clustering algorithm for N-body simulations in a shared-nothing cluster. In: Scientific and Statistical Database Management, pp. 132–150. Springer, Berlin, Heidelberg (2010, January)
23.
Zurück zum Zitat Kim, J., Lee, W., Song, J.J., Lee, S.B.: Optimized combinatorial clustering for stochastic processes. Clust. Comput. 20(2), 1135–1148 (2017)CrossRef Kim, J., Lee, W., Song, J.J., Lee, S.B.: Optimized combinatorial clustering for stochastic processes. Clust. Comput. 20(2), 1135–1148 (2017)CrossRef
24.
Zurück zum Zitat Chu, C., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. Adv. Neural Inf. Process. Syst. 6, 281–288 (2007) Chu, C., Kim, S.K., Lin, Y.A., Yu, Y., Bradski, G., Ng, A.Y., Olukotun, K.: Map-reduce for machine learning on multicore. Adv. Neural Inf. Process. Syst. 6, 281–288 (2007)
25.
Zurück zum Zitat Karau, H., Warren, R.: High Performance Spark: Best Practices for Scaling & Optimizing Apache Spark. O’Reilly Media, Inc. (2016) Karau, H., Warren, R.: High Performance Spark: Best Practices for Scaling & Optimizing Apache Spark. O’Reilly Media, Inc. (2016)
26.
Zurück zum Zitat Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.G., ICSI, V.: Making sense of performance in data analytics frameworks. NSDI 15, 293–307 (2015, May) Ousterhout, K., Rasti, R., Ratnasamy, S., Shenker, S., Chun, B.G., ICSI, V.: Making sense of performance in data analytics frameworks. NSDI 15, 293–307 (2015, May)
27.
Zurück zum Zitat Han, M., Daudjee, K., Ammar, K., Özsu, M.T., Wang, X., Jin, T.: An experimental comparison of pregel-like graph processing systems. Proc. VLDB Endow. 7(12), 1047–1058 (2014)CrossRef Han, M., Daudjee, K., Ammar, K., Özsu, M.T., Wang, X., Jin, T.: An experimental comparison of pregel-like graph processing systems. Proc. VLDB Endow. 7(12), 1047–1058 (2014)CrossRef
Metadaten
Titel
An experimental analysis of limitations of MapReduce for iterative algorithms on Spark
verfasst von
Minseo Kang
Jae-Gil Lee
Publikationsdatum
19.09.2017
Verlag
Springer US
Erschienen in
Cluster Computing / Ausgabe 4/2017
Print ISSN: 1386-7857
Elektronische ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-017-1167-y

Weitere Artikel der Ausgabe 4/2017

Cluster Computing 4/2017 Zur Ausgabe

Premium Partner