Skip to main content
Log in

SmartJoin: a network-aware multiway join for MapReduce

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

MapReduce is an effective tool for processing large amounts of data in parallel using a cluster of processors or computers. One common data processing task is the join operation, which combines two or more datasets based on values common to each. In this paper, we present a network aware multi-way join for MapReduce (SmartJoin) that improves performance and considers network traffic when redistributing workload amongst reducers. SmartJoin achieves this by dynamically redistributing tuples directly between reducers with an intelligent network aware algorithm. We show that our presented technique has significant potential to minimize the time required to join multiple datasets. In our evaluation, we show that SmartJoin has up to 39 % improvement compared to the non-redistribution method, a 26.8 % improvement over random redistribution and 27.6 % improvement over worst join redistribution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  2. Hoefler, T., Lumsdaine, A., Dongarra, J.: Towards efficient MapReduce using MPI. Lecture Notes Comput. Sci. 5759, 240–249 (2009)

    Article  Google Scholar 

  3. White, T.: Hadoop the Definitive Guide, 2nd edn. O’Reilly, Sebastopol (2010)

    Google Scholar 

  4. Xhafa, F.: Processing and analysing large log data files of a virtual campus. J. Converg. 3(3), 1–8 (2012)

    Google Scholar 

  5. Augusto, J., Callaghan, V., Cook, D., Kameas, A., Satoh, I., Saba, T., Chorianopoulos, K., Howard, N., Cambria, E., Gupta, V.: Intelligent environments: a manifesto. Human-centric Comput. Inf. Sci. 3(12), 1–18 (2013)

    Google Scholar 

  6. Ihm, H.: Mining consumer attitude and behavior, an exploratory study on movie audience attitude extracted from twitter. J. Converg. 4(2), 29–35 (2013)

    Google Scholar 

  7. Afrati, F.N., Ullman, J.D.: Optimizing multiway joins in a map-reduce environment. IEEE Knowl. Data Eng. 23(9), 1282–1298 (2011)

    Article  Google Scholar 

  8. Lynden, S., Tanimura, Y., Kojima, I., Matono, A.: Dynamic data redistribution for MapReduce joins. In: IEEE Third International Conference on Cloud Computing Technology and Science (CloudCom), pp. 717–723 (2011).

  9. Chandar, J.: Join Algorithms using Map/Reduce. Master of science. Thesis. School of informatics, University of Edinburgh (2010).

  10. Palla, K.: A comparative analysis of join algorithms using the hadoop map/reduce framework. Master of science. Thesis. School of informatics, University of Edinburgh (2009).

  11. Blanas, S., Li, Y., Patel, J.M.: Design and evaluation of main memory hash join algorithms for multi-core cpus. In: Proceedings of the ACM SIGMOD (2011).

  12. Atta, F., Viglas, S.D., Niazi, S.: SAND Join–A skew handling join algorithm for Google’s MapReduce framework. In: Multitopic Conference (INMIC), 2011 IEEE 14th, International, pp. 170–175. (2011).

  13. Lee, T., Kim, K., Kim, H.J.: Join processing using Bloom filter in MapReduce. In: Proceedings of the 2012 ACM Research in Applied Computation Symposium 2012, pp. 100–105. (2012).

  14. Wang, X., Burns, R., Terzis, A., Deshpande, A.: Network-aware join processing in global-scale database federations. In: IEEE 24th International Conference on Data Engineering, 2008. ICDE 2008, pp. 586–595. (2008).

  15. Hunt, P., Konar, M., Junqueira, F.P., Reed, B.: ZooKeeper: Wait-free coordination for Internet-scale systems. In: USENIX ATC (2010).

  16. Palanisamy, B., Singh, A., Liu, L., Jain, B.: Purlieus: locality-aware resource allocation for MapReduce in a cloud. In: Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis 2011, pp. 58. (2011).

  17. Hindman, B., Konwinski, A., Zaharia, M., Ghodsi, A., Joseph, A.D., Katz, R., Shenker, S., Stoica, I.: Mesos: A platform for fine-grained resource sharing in the data center. In: Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation 2011, pp. 22–22. USENIX Association (2011).

  18. Ghodsi, A., Zaharia, M., Hindman, B., Konwinski, A., Shenker, S., Stoica, I.: Dominant resource fairness: fair allocation of multiple resource types. In: USENIX NSDI (2011).

  19. Lee, G., Tolia, N., Ranganathan, P., Katz, R.H.: Topology-aware resource allocation for data-intensive workloads. In: Proceedings of the First ACM Asia-Pacific Workshop on Workshop on Systems 2010, pp. 1–6. ACM (2010).

  20. Wang, X., Shen, D., Nie, T., Kou, Y., Yu, G.: The equi-join processing and optimization on ring architecture key/value database. In: Web Technologies and Applications, pp. 243–254 (2012).

  21. Jiang, D., Tung, A.K.H., Chen, G.: Map-join-reduce: Toward scalable and efficient data analysis on large clusters. Knowl. Data Eng. IEEE Trans. 23(9), 1299–1311 (2011)

    Article  Google Scholar 

  22. Lin, Y., Agrawal, D., Chen, C., Ooi, B.C., Wu, S.: Llama: leveraging columnar storage for scalable join processing in the mapreduce framework. In: Proceedings of the 2011 International Conference on Management of Data, pp. 961–972. ACM (2011).

  23. Myung, J., Lee, S.: Matrix chain multiplication via multi-way join algorithms in MapReduce. In: Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication 2012, pp. 53. ACM (2012).

  24. Seidl, T., Fries, S., Boden, B.: MR-DSJ: distance-based self-join for large-scale vector data analysis with MapReduce. In: 15th BTW Conference on Database Systems for Business, Technology, and Web, Magdeburg, pp. 37–56 (2013).

  25. Afrati, F.N., Sarma, A.D., Menestrina, D., Parameswaran, A., Ullman, J.D.: Fuzzy joins using MapReduce. In: 2012 IEEE 28th International Conference on Data Engineering (ICDE), pp. 498–509. IEEE (2012).

  26. Metwally, A., Faloutsos, C.: V-smart-join: a scalable mapreduce framework for all-pair similarity joins of multisets and vectors. Proc. VLDB Endow. 5(8), 704–715 (2012)

    Article  Google Scholar 

  27. Gowraj, N., Ravi, P.V., Mouniga, V., Sumalatha, M.: S2MART: smart sql to Map-Reduce translators. In: Web Technologies and Applications, pp. 571–582. Springer (2013).

  28. Lee, R., Luo, T., Huai, Y., Wang, F., He, Y., Zhang, X.: Ysmart: Yet another sql-to-mapreduce translator. In: 31st International Conference on Distributed Computing Systems (ICDCS), pp. 25–36. IEEE (2011).

  29. Xu, Y., Hu, S.: QMapper: a tool for SQL optimization on hive using query rewriting. In: Proceedings of the 22nd International Conference on World Wide Web Companion, pp. 211–212. (2013).

  30. Lu, J., Guting, R.H.: Parallel secondo: boosting database engines with hadoop. In: IEEE 18th International Conference on Parallel and Distributed Systems (ICPADS), pp. 738–743. IEEE (2012).

  31. Chung, W.-C., Lin, H.-P., Chen, S.-C., Jiang, M.-F., Chung, Y.-C.: JackHare: a framework for SQL to NoSQL translation using MapReduce. Autom. Softw. Eng. 1–20 (2013).

  32. Stonebraker, M., Abadi, D., DeWitt, D.J., Madden, S., Paulson, E., Pavlo, A., Rasin, A.: MapReduce and parallel DBMSs: friends or foes? Commun. ACM 53(1), 64–71 (2010)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ching-Hsien Hsu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Slagter, K., Hsu, CH., Chung, YC. et al. SmartJoin: a network-aware multiway join for MapReduce. Cluster Comput 17, 629–641 (2014). https://doi.org/10.1007/s10586-014-0348-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-014-0348-1

Keywords

Navigation