Skip to main content
Top

2020 | OriginalPaper | Chapter

A Comparative Study of Join Algorithms in Spark

Authors : Anh-Cang Phan, Thuong-Cang Phan, Thanh-Ngoan Trieu

Published in: Future Data and Security Engineering

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In the era of information explosion, the amount of data generated is increasing day by day, reached the threshold of petabytes or even zettabytes. In order to extract useful information from a variety of huge data sources, we need effectively computational operations performed in parallel and distributed manner on a cluster of computers. These operations involve a lot of complex and expensive processing operations. One of the typical and frequently used operations in queries is a join operation to combine more than one dataset into one. Currently, although there are some studies on join operations in Spark, there has not been any study showing an adequate and systematic comparison of join algorithms in the Spark environment. Therefore, this study is dedicated to the join operation aspects in Spark. It describes important strategies of implementing the join operation in detail, and exposes the advantages and disadvantages of each one. In addition, the work provides a more thorough comparison of the joins by using a mathematical cost model and experimental verification.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
2.
go back to reference Al-Badarneh, A.: Join algorithms under apache spark: revisited. In: Proceedings of the 2019 5th International Conference on Computer and Technology Applications, ICCTA 2019, pp. 56–62. Association for Computing Machinery, New York (2019) Al-Badarneh, A.: Join algorithms under apache spark: revisited. In: Proceedings of the 2019 5th International Conference on Computer and Technology Applications, ICCTA 2019, pp. 56–62. Association for Computing Machinery, New York (2019)
5.
go back to reference Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD 2015, pp. 1383–1394. Association for Computing Machinery, New York (2015) Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD 2015, pp. 1383–1394. Association for Computing Machinery, New York (2015)
6.
go back to reference Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 975–986. Association for Computing Machinery, New York (2010) Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 975–986. Association for Computing Machinery, New York (2010)
7.
go back to reference Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRef Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRef
8.
go back to reference Bratbergsengen, K.: Hashing methods and relational algebra operations. In: Proceedings of the 10th International Conference on Very Large Data Bases, VLDB 1984, pp. 323–333. Morgan Kaufmann Publishers Inc., San Francisco (1984) Bratbergsengen, K.: Hashing methods and relational algebra operations. In: Proceedings of the 10th International Conference on Very Large Data Bases, VLDB 1984, pp. 323–333. Morgan Kaufmann Publishers Inc., San Francisco (1984)
9.
go back to reference Chen, S., Ailamaki, A., Gibbons, P.B., Mowry, T.C.: Improving hash join performance through prefetching. ACM Trans. Database Syst. 32(3), 17 (2007)CrossRef Chen, S., Ailamaki, A., Gibbons, P.B., Mowry, T.C.: Improving hash join performance through prefetching. ACM Trans. Database Syst. 32(3), 17 (2007)CrossRef
10.
go back to reference Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
11.
go back to reference Kirsch, A., Mitzenmacher, M.: Less hashing, same performance: building a better bloom filter. Random Struct. Algorithms 33(2), 187–218 (2008)MathSciNetCrossRef Kirsch, A., Mitzenmacher, M.: Less hashing, same performance: building a better bloom filter. Random Struct. Algorithms 33(2), 187–218 (2008)MathSciNetCrossRef
12.
go back to reference Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. SIGMOD Rec. 40(4), 11–20 (2012)CrossRef Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. SIGMOD Rec. 40(4), 11–20 (2012)CrossRef
13.
go back to reference Lee, T., Kim, K., Kim, H.J.: Join processing using bloom filter in MapReduce. In: Proceedings of the 2012 ACM Research in Applied Computation Symposium, RACS 2012, pp. 100–105. Association for Computing Machinery, New York (2012) Lee, T., Kim, K., Kim, H.J.: Join processing using bloom filter in MapReduce. In: Proceedings of the 2012 ACM Research in Applied Computation Symposium, RACS 2012, pp. 100–105. Association for Computing Machinery, New York (2012)
14.
go back to reference Mackert, L.F., Lohman, G.M.: R* optimizer validation and performance evaluation for distributed queries. In: Proceedings of the 12th International Conference on Very Large Data Bases, VLDB 1986, pp. 149–159. Morgan Kaufmann Publishers Inc., San Francisco (1986) Mackert, L.F., Lohman, G.M.: R* optimizer validation and performance evaluation for distributed queries. In: Proceedings of the 12th International Conference on Very Large Data Bases, VLDB 1986, pp. 149–159. Morgan Kaufmann Publishers Inc., San Francisco (1986)
15.
go back to reference Mehta, T., Mangla, N., Guragon, G.: A survey paper on big data analytics using map reduce and hive on Hadoop framework a survey paper on big data analytics using map reduce and hive on Hadoop framework, February 2016 Mehta, T., Mangla, N., Guragon, G.: A survey paper on big data analytics using map reduce and hive on Hadoop framework a survey paper on big data analytics using map reduce and hive on Hadoop framework, February 2016
16.
go back to reference Michael, L., Nejdl, W., Papapetrou, O., Siberski, W.: Improving distributed join efficiency with extended bloom filter operations. In: Proceedings of the 21st International Conference on Advanced Networking and Applications, AINA 2007, pp. 187–194. IEEE Computer Society (2007) Michael, L., Nejdl, W., Papapetrou, O., Siberski, W.: Improving distributed join efficiency with extended bloom filter operations. In: Proceedings of the 21st International Conference on Advanced Networking and Applications, AINA 2007, pp. 187–194. IEEE Computer Society (2007)
17.
go back to reference Mishra, P., Eich, M.H.: Join processing in relational databases. ACM Comput. Surv. 24(1), 63–113 (1992)CrossRef Mishra, P., Eich, M.H.: Join processing in relational databases. ACM Comput. Surv. 24(1), 63–113 (1992)CrossRef
18.
go back to reference Phan, T.C., d’Orazio, L., Rigaux, P.: Toward intersection filter-based optimization for joins in mapreduce. In: Proceedings of the 2nd International Workshop on Cloud Intelligence, Cloud-I 2013. Association for Computing Machinery, New York (2013) Phan, T.C., d’Orazio, L., Rigaux, P.: Toward intersection filter-based optimization for joins in mapreduce. In: Proceedings of the 2nd International Workshop on Cloud Intelligence, Cloud-I 2013. Association for Computing Machinery, New York (2013)
19.
go back to reference Phan, T.-C., d’Orazio, L., Rigaux, P.: A theoretical and experimental comparison of filter-based equijoins in MapReduce. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXV. LNCS, vol. 9620, pp. 33–70. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49534-6_2CrossRef Phan, T.-C., d’Orazio, L., Rigaux, P.: A theoretical and experimental comparison of filter-based equijoins in MapReduce. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXV. LNCS, vol. 9620, pp. 33–70. Springer, Heidelberg (2016). https://​doi.​org/​10.​1007/​978-3-662-49534-6_​2CrossRef
20.
go back to reference Van Hieu, D., Smanchat, S., Meesad, P.: Mapreduce join strategies for key-value storage. In: 11th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 164–169, May 2014 Van Hieu, D., Smanchat, S., Meesad, P.: Mapreduce join strategies for key-value storage. In: 11th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 164–169, May 2014
21.
go back to reference White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly Media Inc., Sebastopol (2015) White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly Media Inc., Sebastopol (2015)
22.
go back to reference Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, p. 10. USENIX Association, USA (2010) Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, p. 10. USENIX Association, USA (2010)
Metadata
Title
A Comparative Study of Join Algorithms in Spark
Authors
Anh-Cang Phan
Thuong-Cang Phan
Thanh-Ngoan Trieu
Copyright Year
2020
DOI
https://doi.org/10.1007/978-3-030-63924-2_11

Premium Partner