Abstract
Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets (e.g., search logs, click streams, and web graph data). For cost and performance reasons, processing is typically done on large clusters of thousands of commodity machines by using high level scripting languages. In the recent past, there has been significant progress in adapting well-known techniques from traditional relational DBMSs to this new scenario. However, important challenges remain open. In this paper we study the very common join operation, discuss some unique challenges in the large-scale distributed scenario, and explain how to efficiently and robustly process joins in a distributed way. Specifically, we introduce novel execution strategies that leverage opportunities not available in centralized scenarios, and others that robustly handle data skew. We report experimental validations of our approaches on Scope production clusters, which power the Applications and Services Group at Microsoft.
- F. Afrati and J. Ullman. Optimizing multiway joins in a map-reduce environment. IEEE Transactions on Knowledge and Data Engineering, 23(9):1282--1298, 2011. Google ScholarDigital Library
- Apache. Hadoop. http://hadoop.apache.org/.Google Scholar
- S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In Proc. of the SIGMOD Conf., pages 975--986, 2010. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of OSDI Conference, pages 10--10, 2004. Google ScholarDigital Library
- D. DeWitt, J. Naughton, D. Schneider, and S. S. Seshadri. Practical skew handling in parallel joins. In Proc. of the 18th VLDB Conf., pages 27--40, 1992. Google ScholarDigital Library
- A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of map-reduce: the pig experience. Proc. of the VLDB Endowment, 2: 1414--1425, August 2009. Google ScholarDigital Library
- G. Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2):73--169, 1993. Google ScholarDigital Library
- He Yongqiang. handle skewed keys for a join in a separate job. https://issues.apache.org/jira/browse/HIVE-964.Google Scholar
- K. Hua and C. Lee. Handling data skew in multiprocessor database computers using partition tuning. In Proc. of the 17th VLDB Conf., pages 525--535, 1991. Google ScholarDigital Library
- K. Hua, C. Lee, and C. Hua. Dynamic load balancing in multicomputer database systems using partition tuning. IEEE TKDE, 7(6):968--983, 1995. Google ScholarDigital Library
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In Proc. of EuroSys Conference, pages 59--72, 2007. Google ScholarDigital Library
- M. Kitsuregawa and Y. Ogawa. Bucket spreading parallel hash: a new, robust, parallel hash join method for data skew in the super database computer (sdc). In Proc. of the 16th VLDB Conf., pages 210--221, 1990. Google ScholarDigital Library
- W. Li, D. Gao, and R. Snodgrass. Skew handling techniques in sort-merge join. In Proc. of the SIGMOD Conf., pages 169--180, 2002. Google ScholarDigital Library
- Namit Jain. Skewed Join Optimization. https://issues.apache.org/jira/browse/HIVE-3086.Google Scholar
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In Proceedings of SIGMOD Conference, pages 1099--1110, 2008. Google ScholarDigital Library
- S. Schelter, C. Boden, M. Schenck, A. Alexandrov, and V. Markl. Distributed matrix factorization with mapreduce using a series of broadcast-joins. In Proc. of the 7th ACM conf. on Recommender Systems, pages 281--284, 2013. Google ScholarDigital Library
- A. Shatdal and J. F. Naughton. Using shared virtual memory for parallel join processing. In Proc. of the SIGMOD Conf., pages 119--128, 1993. Google ScholarDigital Library
- Sriranjan Manjunath. support for skewed outer join. https://issues.apache.org/jira/browse/PIG-1035.Google Scholar
- A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive -- a petabyte scale data warehouse using Hadoop. In Proceedings of ICDE Conference, pages 996--1005, 2010.Google ScholarCross Ref
- J. L. Wolf, D. M. Dias, and P. S. Yu. An effective algorithm for parallelizing sort merge joins in the presence of data skew. In Proceedings of the second international symposium on databases in parallel and distributed systems, DPDS '90, pages 103--115, 1990. Google ScholarDigital Library
- J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. In Proc. of the 7th ICDE Conf., pages 200--209, 1991. Google ScholarDigital Library
- Y. Xu and P. Kostamaa. Efficient outer join data skew handling in parallel dbms. Proc. of the VLDB Endowment, 2(2):1390--1396, 2009. Google ScholarDigital Library
- Y. Xu, P. Kostamaa, X. Zhou, and L. Chen. Handling data skew in parallel joins in shared-nothing systems. In Proc. of the SIGMOD Conf., pages 1043--1052, 2008. Google ScholarDigital Library
- J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. SCOPE: Parallel databases meet mapreduce. The VLDB Journal, 21(5):611--636, 2012. Google ScholarDigital Library
- J. Zhou, P.-Å. Larson, and R. Chaiken. Incorporating partitioning and parallel plans into the SCOPE optimizer. In Proceedings of ICDE Conference, pages 1060--1071, 2010.Google ScholarCross Ref
Index Terms
- Advanced join strategies for large-scale distributed computation
Recommendations
Efficient large-scale distance-based join queries in spatialhadoop
Efficient processing of Distance-Based Join Queries (DBJQs) in spatial databases is of paramount importance in many application domains. The most representative and known DBJQs are the K Closest Pairs Query (KCPQ) and the ź Distance Join Query (źDJQ). ...
Lightweight Distributed Execution Engine for Large-Scale Spatial Join Query Processing
BIGDATACONGRESS '15: Proceedings of the 2015 IEEE International Congress on Big DataExisting Big Data systems are mostly designed for relational data. They are either incapable or inefficient in processing large-scale semi-structured data efficiently due to the inherent limitations on data abstraction, indexing support and exposure to ...
Scalable Distributed Stream Join Processing
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of DataEfficient and scalable stream joins play an important role in performing real-time analytics for many cloud applications. However, like in conventional database processing, online theta-joins over data streams are computationally expensive and moreover, ...
Comments