skip to main content
research-article

Advanced join strategies for large-scale distributed computation

Published:01 August 2014Publication History
Skip Abstract Section

Abstract

Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets (e.g., search logs, click streams, and web graph data). For cost and performance reasons, processing is typically done on large clusters of thousands of commodity machines by using high level scripting languages. In the recent past, there has been significant progress in adapting well-known techniques from traditional relational DBMSs to this new scenario. However, important challenges remain open. In this paper we study the very common join operation, discuss some unique challenges in the large-scale distributed scenario, and explain how to efficiently and robustly process joins in a distributed way. Specifically, we introduce novel execution strategies that leverage opportunities not available in centralized scenarios, and others that robustly handle data skew. We report experimental validations of our approaches on Scope production clusters, which power the Applications and Services Group at Microsoft.

References

  1. F. Afrati and J. Ullman. Optimizing multiway joins in a map-reduce environment. IEEE Transactions on Knowledge and Data Engineering, 23(9):1282--1298, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Apache. Hadoop. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  3. S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In Proc. of the SIGMOD Conf., pages 975--986, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of OSDI Conference, pages 10--10, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. DeWitt, J. Naughton, D. Schneider, and S. S. Seshadri. Practical skew handling in parallel joins. In Proc. of the 18th VLDB Conf., pages 27--40, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of map-reduce: the pig experience. Proc. of the VLDB Endowment, 2: 1414--1425, August 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2):73--169, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. He Yongqiang. handle skewed keys for a join in a separate job. https://issues.apache.org/jira/browse/HIVE-964.Google ScholarGoogle Scholar
  9. K. Hua and C. Lee. Handling data skew in multiprocessor database computers using partition tuning. In Proc. of the 17th VLDB Conf., pages 525--535, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Hua, C. Lee, and C. Hua. Dynamic load balancing in multicomputer database systems using partition tuning. IEEE TKDE, 7(6):968--983, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In Proc. of EuroSys Conference, pages 59--72, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. M. Kitsuregawa and Y. Ogawa. Bucket spreading parallel hash: a new, robust, parallel hash join method for data skew in the super database computer (sdc). In Proc. of the 16th VLDB Conf., pages 210--221, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. W. Li, D. Gao, and R. Snodgrass. Skew handling techniques in sort-merge join. In Proc. of the SIGMOD Conf., pages 169--180, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Namit Jain. Skewed Join Optimization. https://issues.apache.org/jira/browse/HIVE-3086.Google ScholarGoogle Scholar
  15. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In Proceedings of SIGMOD Conference, pages 1099--1110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Schelter, C. Boden, M. Schenck, A. Alexandrov, and V. Markl. Distributed matrix factorization with mapreduce using a series of broadcast-joins. In Proc. of the 7th ACM conf. on Recommender Systems, pages 281--284, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Shatdal and J. F. Naughton. Using shared virtual memory for parallel join processing. In Proc. of the SIGMOD Conf., pages 119--128, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Sriranjan Manjunath. support for skewed outer join. https://issues.apache.org/jira/browse/PIG-1035.Google ScholarGoogle Scholar
  19. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive -- a petabyte scale data warehouse using Hadoop. In Proceedings of ICDE Conference, pages 996--1005, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  20. J. L. Wolf, D. M. Dias, and P. S. Yu. An effective algorithm for parallelizing sort merge joins in the presence of data skew. In Proceedings of the second international symposium on databases in parallel and distributed systems, DPDS '90, pages 103--115, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. In Proc. of the 7th ICDE Conf., pages 200--209, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Y. Xu and P. Kostamaa. Efficient outer join data skew handling in parallel dbms. Proc. of the VLDB Endowment, 2(2):1390--1396, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Y. Xu, P. Kostamaa, X. Zhou, and L. Chen. Handling data skew in parallel joins in shared-nothing systems. In Proc. of the SIGMOD Conf., pages 1043--1052, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. SCOPE: Parallel databases meet mapreduce. The VLDB Journal, 21(5):611--636, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Zhou, P.-Å. Larson, and R. Chaiken. Incorporating partitioning and parallel plans into the SCOPE optimizer. In Proceedings of ICDE Conference, pages 1060--1071, 2010.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Advanced join strategies for large-scale distributed computation
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 7, Issue 13
          August 2014
          466 pages
          ISSN:2150-8097
          Issue’s Table of Contents

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 August 2014
          Published in pvldb Volume 7, Issue 13

          Qualifiers

          • research-article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader