Top

Published in:

2020 | OriginalPaper | Chapter

A Comparative Study of Join Algorithms in Spark

Authors : Anh-Cang Phan, Thuong-Cang Phan, Thanh-Ngoan Trieu

Published in: Future Data and Security Engineering

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In the era of information explosion, the amount of data generated is increasing day by day, reached the threshold of petabytes or even zettabytes. In order to extract useful information from a variety of huge data sources, we need effectively computational operations performed in parallel and distributed manner on a cluster of computers. These operations involve a lot of complex and expensive processing operations. One of the typical and frequently used operations in queries is a join operation to combine more than one dataset into one. Currently, although there are some studies on join operations in Spark, there has not been any study showing an adequate and systematic comparison of join algorithms in the Spark environment. Therefore, this study is dedicated to the join operation aspects in Spark. It describes important strategies of implementing the join operation in detail, and exposes the advantages and disadvantages of each one. In addition, the work provides a more thorough comparison of the joins by using a mathematical cost model and experimental verification.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter On Norm-Based Locality Measures of 2-Dimensional Discrete Hilbert Curves

next chapter Blockchain-Based Forward and Reverse Supply Chains for E-waste Management

Ahmad, F.: Puma benchmarks and dataset downloads (2011). https://engineering.purdue.edu/~puma/datasets.htm. Accessed: 05 Apr 2019

Al-Badarneh, A.: Join algorithms under apache spark: revisited. In: Proceedings of the 2019 5th International Conference on Computer and Technology Applications, ICCTA 2019, pp. 56–62. Association for Computing Machinery, New York (2019)

Apache: Apache Hadoop (2002). https://hadoop.apache.org. Accessed 03 Apr 2019

Apache: Apache spark (2009). https://spark.apache.org. Accessed 03 Apr 2019

Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, SIGMOD 2015, pp. 1383–1394. Association for Computing Machinery, New York (2015)

Blanas, S., Patel, J.M., Ercegovac, V., Rao, J., Shekita, E.J., Tian, Y.: A comparison of join algorithms for log processing in MapReduce. In: Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data, SIGMOD 2010, pp. 975–986. Association for Computing Machinery, New York (2010)

Bloom, B.H.: Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13(7), 422–426 (1970)CrossRef

Bratbergsengen, K.: Hashing methods and relational algebra operations. In: Proceedings of the 10th International Conference on Very Large Data Bases, VLDB 1984, pp. 323–333. Morgan Kaufmann Publishers Inc., San Francisco (1984)

Chen, S., Ailamaki, A., Gibbons, P.B., Mowry, T.C.: Improving hash join performance through prefetching. ACM Trans. Database Syst. 32(3), 17 (2007)CrossRef

10.

Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef

11.

Kirsch, A., Mitzenmacher, M.: Less hashing, same performance: building a better bloom filter. Random Struct. Algorithms 33(2), 187–218 (2008)MathSciNetCrossRef

12.

Lee, K.H., Lee, Y.J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with MapReduce: a survey. SIGMOD Rec. 40(4), 11–20 (2012)CrossRef

13.

Lee, T., Kim, K., Kim, H.J.: Join processing using bloom filter in MapReduce. In: Proceedings of the 2012 ACM Research in Applied Computation Symposium, RACS 2012, pp. 100–105. Association for Computing Machinery, New York (2012)

14.

Mackert, L.F., Lohman, G.M.: R* optimizer validation and performance evaluation for distributed queries. In: Proceedings of the 12th International Conference on Very Large Data Bases, VLDB 1986, pp. 149–159. Morgan Kaufmann Publishers Inc., San Francisco (1986)

15.

Mehta, T., Mangla, N., Guragon, G.: A survey paper on big data analytics using map reduce and hive on Hadoop framework a survey paper on big data analytics using map reduce and hive on Hadoop framework, February 2016

16.

Michael, L., Nejdl, W., Papapetrou, O., Siberski, W.: Improving distributed join efficiency with extended bloom filter operations. In: Proceedings of the 21st International Conference on Advanced Networking and Applications, AINA 2007, pp. 187–194. IEEE Computer Society (2007)

17.

Mishra, P., Eich, M.H.: Join processing in relational databases. ACM Comput. Surv. 24(1), 63–113 (1992)CrossRef

18.

Phan, T.C., d’Orazio, L., Rigaux, P.: Toward intersection filter-based optimization for joins in mapreduce. In: Proceedings of the 2nd International Workshop on Cloud Intelligence, Cloud-I 2013. Association for Computing Machinery, New York (2013)

19.

Phan, T.-C., d’Orazio, L., Rigaux, P.: A theoretical and experimental comparison of filter-based equijoins in MapReduce. In: Hameurlain, A., Küng, J., Wagner, R. (eds.) Transactions on Large-Scale Data- and Knowledge-Centered Systems XXV. LNCS, vol. 9620, pp. 33–70. Springer, Heidelberg (2016). https://doi.org/10.1007/978-3-662-49534-6_2CrossRef

20.

Van Hieu, D., Smanchat, S., Meesad, P.: Mapreduce join strategies for key-value storage. In: 11th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp. 164–169, May 2014

21.

White, T.: Hadoop: The Definitive Guide, 4th edn. O’Reilly Media Inc., Sebastopol (2015)

22.

Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud 2010, p. 10. USENIX Association, USA (2010)

Title: A Comparative Study of Join Algorithms in Spark
Authors: Anh-Cang Phan
Thuong-Cang Phan
Thanh-Ngoan Trieu
Publisher: Springer International Publishing
Book: Future Data and Security Engineering
Print ISBN: 978-3-030-63923-5

Electronic ISBN: 978-3-030-63924-2

Copyright Year: 2020
DOI: https://doi.org/10.1007/978-3-030-63924-2_11

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"

Premium Partner