research-article

Advanced join strategies for large-scale distributed computation

Authors:
Nicolas Bruno

Microsoft Corp.

Microsoft Corp.
View Profile

,
YongChul Kwon

Microsoft Corp.

Microsoft Corp.
View Profile

,
Ming-Chuan Wu

Microsoft Corp.

Microsoft Corp.
View Profile

Proceedings of the VLDB Endowment Volume 7 Issue 13pp 1484–1495https://doi.org/10.14778/2733004.2733020

Published:01 August 2014Publication History

Proceedings of the VLDB Endowment

Abstract

Companies providing cloud-scale data services have increasing needs to store and analyze massive data sets (e.g., search logs, click streams, and web graph data). For cost and performance reasons, processing is typically done on large clusters of thousands of commodity machines by using high level scripting languages. In the recent past, there has been significant progress in adapting well-known techniques from traditional relational DBMSs to this new scenario. However, important challenges remain open. In this paper we study the very common join operation, discuss some unique challenges in the large-scale distributed scenario, and explain how to efficiently and robustly process joins in a distributed way. Specifically, we introduce novel execution strategies that leverage opportunities not available in centralized scenarios, and others that robustly handle data skew. We report experimental validations of our approaches on Scope production clusters, which power the Applications and Services Group at Microsoft.

References

F. Afrati and J. Ullman. Optimizing multiway joins in a map-reduce environment. IEEE Transactions on Knowledge and Data Engineering, 23(9):1282--1298, 2011. Google ScholarDigital Library
Apache. Hadoop. http://hadoop.apache.org/.Google Scholar
S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In Proc. of the SIGMOD Conf., pages 975--986, 2010. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of OSDI Conference, pages 10--10, 2004. Google ScholarDigital Library
D. DeWitt, J. Naughton, D. Schneider, and S. S. Seshadri. Practical skew handling in parallel joins. In Proc. of the 18th VLDB Conf., pages 27--40, 1992. Google ScholarDigital Library
A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of map-reduce: the pig experience. Proc. of the VLDB Endowment, 2: 1414--1425, August 2009. Google ScholarDigital Library
G. Graefe. Query evaluation techniques for large databases. ACM Computing Surveys, 25(2):73--169, 1993. Google ScholarDigital Library
He Yongqiang. handle skewed keys for a join in a separate job. https://issues.apache.org/jira/browse/HIVE-964.Google Scholar
K. Hua and C. Lee. Handling data skew in multiprocessor database computers using partition tuning. In Proc. of the 17th VLDB Conf., pages 525--535, 1991. Google ScholarDigital Library
K. Hua, C. Lee, and C. Hua. Dynamic load balancing in multicomputer database systems using partition tuning. IEEE TKDE, 7(6):968--983, 1995. Google ScholarDigital Library
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In Proc. of EuroSys Conference, pages 59--72, 2007. Google ScholarDigital Library
M. Kitsuregawa and Y. Ogawa. Bucket spreading parallel hash: a new, robust, parallel hash join method for data skew in the super database computer (sdc). In Proc. of the 16th VLDB Conf., pages 210--221, 1990. Google ScholarDigital Library
W. Li, D. Gao, and R. Snodgrass. Skew handling techniques in sort-merge join. In Proc. of the SIGMOD Conf., pages 169--180, 2002. Google ScholarDigital Library
Namit Jain. Skewed Join Optimization. https://issues.apache.org/jira/browse/HIVE-3086.Google Scholar
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In Proceedings of SIGMOD Conference, pages 1099--1110, 2008. Google ScholarDigital Library
S. Schelter, C. Boden, M. Schenck, A. Alexandrov, and V. Markl. Distributed matrix factorization with mapreduce using a series of broadcast-joins. In Proc. of the 7th ACM conf. on Recommender Systems, pages 281--284, 2013. Google ScholarDigital Library
A. Shatdal and J. F. Naughton. Using shared virtual memory for parallel join processing. In Proc. of the SIGMOD Conf., pages 119--128, 1993. Google ScholarDigital Library
Sriranjan Manjunath. support for skewed outer join. https://issues.apache.org/jira/browse/PIG-1035.Google Scholar
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murthy. Hive -- a petabyte scale data warehouse using Hadoop. In Proceedings of ICDE Conference, pages 996--1005, 2010.Google ScholarCross Ref
J. L. Wolf, D. M. Dias, and P. S. Yu. An effective algorithm for parallelizing sort merge joins in the presence of data skew. In Proceedings of the second international symposium on databases in parallel and distributed systems, DPDS '90, pages 103--115, 1990. Google ScholarDigital Library
J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. In Proc. of the 7th ICDE Conf., pages 200--209, 1991. Google ScholarDigital Library
Y. Xu and P. Kostamaa. Efficient outer join data skew handling in parallel dbms. Proc. of the VLDB Endowment, 2(2):1390--1396, 2009. Google ScholarDigital Library
Y. Xu, P. Kostamaa, X. Zhou, and L. Chen. Handling data skew in parallel joins in shared-nothing systems. In Proc. of the SIGMOD Conf., pages 1043--1052, 2008. Google ScholarDigital Library
J. Zhou, N. Bruno, M.-C. Wu, P.-Å. Larson, R. Chaiken, and D. Shakib. SCOPE: Parallel databases meet mapreduce. The VLDB Journal, 21(5):611--636, 2012. Google ScholarDigital Library
J. Zhou, P.-Å. Larson, and R. Chaiken. Incorporating partitioning and parallel plans into the SCOPE optimizer. In Proceedings of ICDE Conference, pages 1060--1071, 2010.Google ScholarCross Ref

Index Terms

Advanced join strategies for large-scale distributed computation
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
      2. Parallel and distributed DBMSs
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Index terms have been assigned to the content through auto-classification.

Recommendations

Efficient large-scale distance-based join queries in spatialhadoop

Efficient processing of Distance-Based Join Queries (DBJQs) in spatial databases is of paramount importance in many application domains. The most representative and known DBJQs are the K Closest Pairs Query (KCPQ) and the ź Distance Join Query (źDJQ). ...
Read More
Lightweight Distributed Execution Engine for Large-Scale Spatial Join Query Processing
BIGDATACONGRESS '15: Proceedings of the 2015 IEEE International Congress on Big Data

Existing Big Data systems are mostly designed for relational data. They are either incapable or inefficient in processing large-scale semi-structured data efficiently due to the inherent limitations on data abstraction, indexing support and exposure to ...
Read More
Scalable Distributed Stream Join Processing
SIGMOD '15: Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data

Efficient and scalable stream joins play an important role in performing real-time analytics for many cloud applications. However, like in conventional database processing, online theta-joins over data streams are computationally expensive and moreover, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 7, Issue 13
August 2014
466 pages
ISSN:2150-8097
Editors:
H. V. Jagadish
University of Michigan
,
Aoying Zhou
East Normal University, China
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2014
Published in pvldb Volume 7, Issue 13
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 337
  Total Downloads
- Downloads (Last 12 months)26
- Downloads (Last 6 weeks)5
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Advanced join strategies for large-scale distributed computation

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Efficient large-scale distance-based join queries in spatialhadoop

Lightweight Distributed Execution Engine for Large-Scale Spatial Join Query Processing

Scalable Distributed Stream Join Processing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Advanced join strategies for large-scale distributed computation

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Efficient large-scale distance-based join queries in spatialhadoop

Lightweight Distributed Execution Engine for Large-Scale Spatial Join Query Processing

Scalable Distributed Stream Join Processing

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media