Skip to main content
Top
Published in: Cluster Computing 3/2014

01-09-2014

SmartJoin: a network-aware multiway join for MapReduce

Published in: Cluster Computing | Issue 3/2014

Log in

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

MapReduce is an effective tool for processing large amounts of data in parallel using a cluster of processors or computers. One common data processing task is the join operation, which combines two or more datasets based on values common to each. In this paper, we present a network aware multi-way join for MapReduce (SmartJoin) that improves performance and considers network traffic when redistributing workload amongst reducers. SmartJoin achieves this by dynamically redistributing tuples directly between reducers with an intelligent network aware algorithm. We show that our presented technique has significant potential to minimize the time required to join multiple datasets. In our evaluation, we show that SmartJoin has up to 39 % improvement compared to the non-redistribution method, a 26.8 % improvement over random redistribution and 27.6 % improvement over worst join redistribution.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)CrossRef
2.
go back to reference Hoefler, T., Lumsdaine, A., Dongarra, J.: Towards efficient MapReduce using MPI. Lecture Notes Comput. Sci. 5759, 240–249 (2009)CrossRef Hoefler, T., Lumsdaine, A., Dongarra, J.: Towards efficient MapReduce using MPI. Lecture Notes Comput. Sci. 5759, 240–249 (2009)CrossRef
3.
go back to reference White, T.: Hadoop the Definitive Guide, 2nd edn. O’Reilly, Sebastopol (2010) White, T.: Hadoop the Definitive Guide, 2nd edn. O’Reilly, Sebastopol (2010)
4.
go back to reference Xhafa, F.: Processing and analysing large log data files of a virtual campus. J. Converg. 3(3), 1–8 (2012) Xhafa, F.: Processing and analysing large log data files of a virtual campus. J. Converg. 3(3), 1–8 (2012)
5.
go back to reference Augusto, J., Callaghan, V., Cook, D., Kameas, A., Satoh, I., Saba, T., Chorianopoulos, K., Howard, N., Cambria, E., Gupta, V.: Intelligent environments: a manifesto. Human-centric Comput. Inf. Sci. 3(12), 1–18 (2013) Augusto, J., Callaghan, V., Cook, D., Kameas, A., Satoh, I., Saba, T., Chorianopoulos, K., Howard, N., Cambria, E., Gupta, V.: Intelligent environments: a manifesto. Human-centric Comput. Inf. Sci. 3(12), 1–18 (2013)
6.
go back to reference Ihm, H.: Mining consumer attitude and behavior, an exploratory study on movie audience attitude extracted from twitter. J. Converg. 4(2), 29–35 (2013) Ihm, H.: Mining consumer attitude and behavior, an exploratory study on movie audience attitude extracted from twitter. J. Converg. 4(2), 29–35 (2013)
7.
go back to reference Afrati, F.N., Ullman, J.D.: Optimizing multiway joins in a map-reduce environment. IEEE Knowl. Data Eng. 23(9), 1282–1298 (2011)CrossRef Afrati, F.N., Ullman, J.D.: Optimizing multiway joins in a map-reduce environment. IEEE Knowl. Data Eng. 23(9), 1282–1298 (2011)CrossRef
8.
go back to reference Lynden, S., Tanimura, Y., Kojima, I., Matono, A.: Dynamic data redistribution for MapReduce joins. In: IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom), pp. 717–723 (2011). Lynden, S., Tanimura, Y., Kojima, I., Matono, A.: Dynamic data redistribution for MapReduce joins. In: IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom), pp. 717–723 (2011).
9.
go back to reference Chandar, J.: Join Algorithms using Map/Reduce. Master of science. Thesis. School of informatics, University of Edinburgh (2010). Chandar, J.: Join Algorithms using Map/Reduce. Master of science. Thesis. School of informatics, University of Edinburgh (2010).
10.
go back to reference Palla, K.: A comparative analysis of join algorithms using the hadoop map/reduce framework. Master of science. Thesis. School of informatics, University of Edinburgh (2009). Palla, K.: A comparative analysis of join algorithms using the hadoop map/reduce framework. Master of science. Thesis. School of informatics, University of Edinburgh (2009).
11.
go back to reference Blanas, S., Li, Y., Patel, J.M.: Design and evaluation of main memory hash join algorithms for multi-core cpus. In: Proceedings of the ACM SIGMOD (2011). Blanas, S., Li, Y., Patel, J.M.: Design and evaluation of main memory hash join algorithms for multi-core cpus. In: Proceedings of the ACM SIGMOD (2011).
12.
go back to reference Atta, F., Viglas, S.D., Niazi, S.: SAND Join–A skew handling join algorithm for Google’s MapReduce framework. In: Multitopic Conference (INMIC), 2011 IEEE 14th, International, pp. 170–175. (2011). Atta, F., Viglas, S.D., Niazi, S.: SAND Join–A skew handling join algorithm for Google’s MapReduce framework. In: Multitopic Conference (INMIC), 2011 IEEE 14th, International, pp. 170–175. (2011).
13.
go back to reference Lee, T., Kim, K., Kim, H.J.: Join processing using Bloom filter in MapReduce. In: Proceedings of the 2012 ACM Research in Applied Computation Symposium 2012, pp. 100–105. (2012). Lee, T., Kim, K., Kim, H.J.: Join processing using Bloom filter in MapReduce. In: Proceedings of the 2012 ACM Research in Applied Computation Symposium 2012, pp. 100–105. (2012).
14.
go back to reference Wang, X., Burns, R., Terzis, A., Deshpande, A.: Network-aware join processing in global-scale database federations. In: IEEE 24th International Conference on Data Engineering, 2008. ICDE 2008, pp. 586–595. (2008). Wang, X., Burns, R., Terzis, A., Deshpande, A.: Network-aware join processing in global-scale database federations. In: IEEE 24th International Conference on Data Engineering, 2008. ICDE 2008, pp. 586–595. (2008).
15.
go back to reference Hunt, P., Konar, M., Junqueira, F.P., Reed, B.: ZooKeeper: Wait-free coordination for Internet-scale systems. In: USENIX ATC (2010). Hunt, P., Konar, M., Junqueira, F.P., Reed, B.: ZooKeeper: Wait-free coordination for Internet-scale systems. In: USENIX ATC (2010).
16.
go back to reference Palanisamy, B., Singh, A., Liu, L., Jain, B.: Purlieus: locality-aware resource allocation for MapReduce in a cloud. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis 2011, pp. 58. (2011). Palanisamy, B., Singh, A., Liu, L., Jain, B.: Purlieus: locality-aware resource allocation for MapReduce in a cloud. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis 2011, pp. 58. (2011).
17.
go back to reference Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: A platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation 2011, pp. 22–22. USENIX Association (2011). Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: A platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation 2011, pp. 22–22. USENIX Association (2011).
18.
go back to reference Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., Stoica, I.: Dominant resource fairness: fair allocation of multiple resource types. In: USENIX NSDI (2011). Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., Stoica, I.: Dominant resource fairness: fair allocation of multiple resource types. In: USENIX NSDI (2011).
19.
go back to reference Lee, G., Tolia, N., Ranganathan, P., Katz, R.H.: Topology-aware resource allocation for data-intensive workloads. In: Proceedings of the First ACM Asia-Pacific Workshop on Workshop on Systems 2010, pp. 1–6. ACM (2010). Lee, G., Tolia, N., Ranganathan, P., Katz, R.H.: Topology-aware resource allocation for data-intensive workloads. In: Proceedings of the First ACM Asia-Pacific Workshop on Workshop on Systems 2010, pp. 1–6. ACM (2010).
20.
go back to reference Wang, X., Shen, D., Nie, T., Kou, Y., Yu, G.: The equi-join processing and optimization on ring architecture key/value database. In: Web Technologies and Applications, pp. 243–254 (2012). Wang, X., Shen, D., Nie, T., Kou, Y., Yu, G.: The equi-join processing and optimization on ring architecture key/value database. In: Web Technologies and Applications, pp. 243–254 (2012).
21.
go back to reference Jiang, D., Tung, A.K.H., Chen, G.: Map-join-reduce: Toward scalable and efficient data analysis on large clusters. Knowl. Data Eng. IEEE Trans. 23(9), 1299–1311 (2011)CrossRef Jiang, D., Tung, A.K.H., Chen, G.: Map-join-reduce: Toward scalable and efficient data analysis on large clusters. Knowl. Data Eng. IEEE Trans. 23(9), 1299–1311 (2011)CrossRef
22.
go back to reference Lin, Y., Agrawal, D., Chen, C., Ooi, B.C., Wu, S.: Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. In: Proceedings of the 2011 International Conference on Management of Data, pp. 961–972. ACM (2011). Lin, Y., Agrawal, D., Chen, C., Ooi, B.C., Wu, S.: Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. In: Proceedings of the 2011 International Conference on Management of Data, pp. 961–972. ACM (2011).
23.
go back to reference Myung, J., Lee, S.: Matrix chain multiplication via multi-way join algorithms in MapReduce. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication 2012, pp. 53. ACM (2012). Myung, J., Lee, S.: Matrix chain multiplication via multi-way join algorithms in MapReduce. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication 2012, pp. 53. ACM (2012).
24.
go back to reference Seidl, T., Fries, S., Boden, B.: MR-DSJ: distance-based self-join for large-scale vector data analysis with MapReduce. In: 15th BTW Conference on Database Systems for Business, Technology, and Web, Magdeburg, pp. 37–56 (2013). Seidl, T., Fries, S., Boden, B.: MR-DSJ: distance-based self-join for large-scale vector data analysis with MapReduce. In: 15th BTW Conference on Database Systems for Business, Technology, and Web, Magdeburg, pp. 37–56 (2013).
25.
go back to reference Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A., Ullman, J.D.: Fuzzy joins using MapReduce. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 498–509. IEEE (2012). Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A., Ullman, J.D.: Fuzzy joins using MapReduce. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 498–509. IEEE (2012).
26.
go back to reference Metwally, A., Faloutsos, C.: V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endow. 5(8), 704–715 (2012)CrossRef Metwally, A., Faloutsos, C.: V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endow. 5(8), 704–715 (2012)CrossRef
27.
go back to reference Gowraj, N., Ravi, P.V., Mouniga, V., Sumalatha, M.: S2MART: smart sql to Map-Reduce translators. In: Web Technologies and Applications, pp. 571–582. Springer (2013). Gowraj, N., Ravi, P.V., Mouniga, V., Sumalatha, M.: S2MART: smart sql to Map-Reduce translators. In: Web Technologies and Applications, pp. 571–582. Springer (2013).
28.
go back to reference Lee, R., Luo, T., Huai, Y., Wang, F., He, Y., Zhang, X.: Ysmart: Yet another sql-to-mapreduce translator. In: 31st International Conference on Distributed Computing Systems (ICDCS), pp. 25–36. IEEE (2011). Lee, R., Luo, T., Huai, Y., Wang, F., He, Y., Zhang, X.: Ysmart: Yet another sql-to-mapreduce translator. In: 31st International Conference on Distributed Computing Systems (ICDCS), pp. 25–36. IEEE (2011).
29.
go back to reference Xu, Y., Hu, S.: QMapper: a tool for SQL optimization on hive using query rewriting. In: Proceedings of the 22nd International Conference on World Wide Web Companion, pp. 211–212. (2013). Xu, Y., Hu, S.: QMapper: a tool for SQL optimization on hive using query rewriting. In: Proceedings of the 22nd International Conference on World Wide Web Companion, pp. 211–212. (2013).
30.
go back to reference Lu, J., Guting, R.H.: Parallel secondo: boosting database engines with hadoop. In: IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS), pp. 738–743. IEEE (2012). Lu, J., Guting, R.H.: Parallel secondo: boosting database engines with hadoop. In: IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS), pp. 738–743. IEEE (2012).
31.
go back to reference Chung, W.-C., Lin, H.-P., Chen, S.-C., Jiang, M.-F., Chung, Y.-C.: JackHare: a framework for SQL to NoSQL translation using MapReduce. Autom. Softw. Eng. 1–20 (2013). Chung, W.-C., Lin, H.-P., Chen, S.-C., Jiang, M.-F., Chung, Y.-C.: JackHare: a framework for SQL to NoSQL translation using MapReduce. Autom. Softw. Eng. 1–20 (2013).
32.
go back to reference Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)CrossRef Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)CrossRef
Metadata
Title
SmartJoin: a network-aware multiway join for MapReduce
Publication date
01-09-2014
Published in
Cluster Computing / Issue 3/2014
Print ISSN: 1386-7857
Electronic ISSN: 1573-7543
DOI
https://doi.org/10.1007/s10586-014-0348-1

Other articles of this Issue 3/2014

Cluster Computing 3/2014 Go to the issue

Premium Partner