Skip to main content

2020 | OriginalPaper | Buchkapitel

A Comparative Study of Join Algorithms in Spark

verfasst von : Anh-Cang Phan, Thuong-Cang Phan, Thanh-Ngoan Trieu

Erschienen in: Future Data and Security Engineering

Verlag: Springer International Publishing

Aktivieren Sie unsere intelligente Suche um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

In the era of information explosion, the amount of data generated is increasing day by day, reached the threshold of petabytes or even zettabytes. In order to extract useful information from a variety of huge data sources, we need effectively computational operations performed in parallel and distributed manner on a cluster of computers. These operations involve a lot of complex and expensive processing operations. One of the typical and frequently used operations in queries is a join operation to combine more than one dataset into one. Currently, although there are some studies on join operations in Spark, there has not been any study showing an adequate and systematic comparison of join algorithms in the Spark environment. Therefore, this study is dedicated to the join operation aspects in Spark. It describes important strategies of implementing the join operation in detail, and exposes the advantages and disadvantages of each one. In addition, the work provides a more thorough comparison of the joins by using a mathematical cost model and experimental verification.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literatur
2.
Zurück zum Zitat Al-Badarneh, A.: Join algorithms under apache spark: revisited. In: Proceedings of the 2019 5th International Conference on Computer and Technology Applications, ICCTA 2019, pp. 56–62. Association for Computing Machinery, New York (2019) Al-Badarneh, A.: Join algorithms under apache spark: revisited. In: Proceedings of the 2019 5th International Conference on Computer and Technology Applications, ICCTA 2019, pp. 56–62. Association for Computing Machinery, New York (2019)
5.
Zurück zum Zitat Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD 2015, pp. 1383–1394. Association for Computing Machinery, New York (2015) Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD 2015, pp. 1383–1394. Association for Computing Machinery, New York (2015)
6.
Zurück zum Zitat Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 975–986. Association for Computing Machinery, New York (2010) Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 975–986. Association for Computing Machinery, New York (2010)
7.
Zurück zum Zitat Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRef Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRef
8.
Zurück zum Zitat Bratbergsengen, K.: Hashing methods and relational algebra operations. In: Proceedings of the 10th International Conference on Very Large Data Bases, VLDB 1984, pp. 323–333. Morgan Kaufmann Publishers Inc., San Francisco (1984) Bratbergsengen, K.: Hashing methods and relational algebra operations. In: Proceedings of the 10th International Conference on Very Large Data Bases, VLDB 1984, pp. 323–333. Morgan Kaufmann Publishers Inc., San Francisco (1984)
9.
Zurück zum Zitat Chen, S., Ailamaki, A., Gibbons, P.B., Mowry, T.C.: Improving hash join performance through prefetching. ACM Trans. Database Syst. 32(3), 17 (2007)CrossRef Chen, S., Ailamaki, A., Gibbons, P.B., Mowry, T.C.: Improving hash join performance through prefetching. ACM Trans. Database Syst. 32(3), 17 (2007)CrossRef
10.
Zurück zum Zitat Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
11.
Zurück zum Zitat Kirsch, A., Mitzenmacher, M.: Less hashing, same performance: building a better bloom filter. Random Struct. Algorithms 33(2), 187–218 (2008)MathSciNetCrossRef Kirsch, A., Mitzenmacher, M.: Less hashing, same performance: building a better bloom filter. Random Struct. Algorithms 33(2), 187–218 (2008)MathSciNetCrossRef
12.
Zurück zum Zitat Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. SIGMOD Rec. 40(4), 11–20 (2012)CrossRef Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. SIGMOD Rec. 40(4), 11–20 (2012)CrossRef
13.
Zurück zum Zitat Lee, T., Kim, K., Kim, H.J.: Join processing using bloom filter in MapReduce. In: Proceedings of the 2012 ACM Research in Applied Computation Symposium, RACS 2012, pp. 100–105. Association for Computing Machinery, New York (2012) Lee, T., Kim, K., Kim, H.J.: Join processing using bloom filter in MapReduce. In: Proceedings of the 2012 ACM Research in Applied Computation Symposium, RACS 2012, pp. 100–105. Association for Computing Machinery, New York (2012)
14.
Zurück zum Zitat Mackert, L.F., Lohman, G.M.: R* optimizer validation and performance evaluation for distributed queries. In: Proceedings of the 12th International Conference on Very Large Data Bases, VLDB 1986, pp. 149–159. Morgan Kaufmann Publishers Inc., San Francisco (1986) Mackert, L.F., Lohman, G.M.: R* optimizer validation and performance evaluation for distributed queries. In: Proceedings of the 12th International Conference on Very Large Data Bases, VLDB 1986, pp. 149–159. Morgan Kaufmann Publishers Inc., San Francisco (1986)
15.
Zurück zum Zitat Mehta, T., Mangla, N., Guragon, G.: A survey paper on big data analytics using map reduce and hive on Hadoop framework a survey paper on big data analytics using map reduce and hive on Hadoop framework, February 2016 Mehta, T., Mangla, N., Guragon, G.: A survey paper on big data analytics using map reduce and hive on Hadoop framework a survey paper on big data analytics using map reduce and hive on Hadoop framework, February 2016
16.
Zurück zum Zitat Michael, L., Nejdl, W., Papapetrou, O., Siberski, W.: Improving distributed join efficiency with extended bloom filter operations. In: Proceedings of the 21st International Conference on Advanced Networking and Applications, AINA 2007, pp. 187–194. IEEE Computer Society (2007) Michael, L., Nejdl, W., Papapetrou, O., Siberski, W.: Improving distributed join efficiency with extended bloom filter operations. In: Proceedings of the 21st International Conference on Advanced Networking and Applications, AINA 2007, pp. 187–194. IEEE Computer Society (2007)
17.
Zurück zum Zitat Mishra, P., Eich, M.H.: Join processing in relational databases. ACM Comput. Surv. 24(1), 63–113 (1992)CrossRef Mishra, P., Eich, M.H.: Join processing in relational databases. ACM Comput. Surv. 24(1), 63–113 (1992)CrossRef
18.
Zurück zum Zitat Phan, T.C., d’Orazio, L., Rigaux, P.: Toward intersection filter-based optimization for joins in mapreduce. In: Proceedings of the 2nd International Workshop on Cloud Intelligence, Cloud-I 2013. Association for Computing Machinery, New York (2013) Phan, T.C., d’Orazio, L., Rigaux, P.: Toward intersection filter-based optimization for joins in mapreduce. In: Proceedings of the 2nd International Workshop on Cloud Intelligence, Cloud-I 2013. Association for Computing Machinery, New York (2013)
19.
Zurück zum Zitat Phan, T.-C., d’Orazio, L., Rigaux, P.: A theoretical and experimental comparison of filter-based equijoins in MapReduce. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXV. LNCS, vol. 9620, pp. 33–70. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49534-6_2CrossRef Phan, T.-C., d’Orazio, L., Rigaux, P.: A theoretical and experimental comparison of filter-based equijoins in MapReduce. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXV. LNCS, vol. 9620, pp. 33–70. Springer, Heidelberg (2016). https://​doi.​org/​10.​1007/​978-3-662-49534-6_​2CrossRef
20.
Zurück zum Zitat Van Hieu, D., Smanchat, S., Meesad, P.: Mapreduce join strategies for key-value storage. In: 11th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 164–169, May 2014 Van Hieu, D., Smanchat, S., Meesad, P.: Mapreduce join strategies for key-value storage. In: 11th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 164–169, May 2014
21.
Zurück zum Zitat White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly Media Inc., Sebastopol (2015) White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly Media Inc., Sebastopol (2015)
22.
Zurück zum Zitat Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, p. 10. USENIX Association, USA (2010) Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, p. 10. USENIX Association, USA (2010)
Metadaten
Titel
A Comparative Study of Join Algorithms in Spark
verfasst von
Anh-Cang Phan
Thuong-Cang Phan
Thanh-Ngoan Trieu
Copyright-Jahr
2020
DOI
https://doi.org/10.1007/978-3-030-63924-2_11