research-article

Efficient multi-way theta-join processing using MapReduce

Authors:
Xiaofei Zhang

HKUST, Hong Kong

HKUST, Hong Kong
View Profile

,
Lei Chen

HKUST, Hong Kong

HKUST, Hong Kong
View Profile

,
Min Wang

HP Labs China, Beijing, China

HP Labs China, Beijing, China
View Profile

Proceedings of the VLDB Endowment Volume 5 Issue 11pp 1184–1195https://doi.org/10.14778/2350229.2350238

Published:01 July 2012Publication History

Proceedings of the VLDB Endowment

Abstract

Multi-way Theta-join queries are powerful in describing complex relations and therefore widely employed in real practices. However, existing solutions from traditional distributed and parallel databases for multi-way Theta-join queries cannot be easily extended to fit a shared-nothing distributed computing paradigm, which is proven to be able to support OLAP applications over immense data volumes. In this work, we study the problem of efficient processing of multi-way Theta-join queries using MapReduce from a cost-effective perspective. Although there have been some works using the (key, value) pair-based programming model to support join operations, efficient processing of multi-way Theta-join queries has never been fully explored. The substantial challenge lies in, given a number of processing units (that can run Map or Reduce tasks), mapping a multi-way Theta-join query to a number of MapReduce jobs and having them executed in a well scheduled sequence, such that the total processing time span is minimized. Our solution mainly includes two parts: 1) cost metrics for both single MapReduce job and a number of MapReduce jobs executed in a certain order; 2) the efficient execution of a chain-typed Theta-join with only one MapReduce job. Comparing with the query evaluation strategy proposed in [23] and the widely adopted Pig Latin and Hive SQL solutions, our method achieves significant improvement of the join processing efficiency.

References

Transaction processing performance council. http://www.tpc.org/.Google Scholar
F. Afrati and J. Ullman. Optimizing multiway joins in a map-reduce environment. TKDE, 23(9):1282--1298, 2011. Google ScholarDigital Library
P. Agrawal and et al. Scheduling shared scans of large data files. PVLDB, 1(1):958--969, 2008. Google ScholarDigital Library
D. Battré and et al. Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In SoCC, pages 119--130, 2010. Google ScholarDigital Library
D. Borthakur and et al. Apache hadoop goes realtime at facebook. In SIGMOD, pages 1071--1080, 2011. Google ScholarDigital Library
G. Brightwell and et al. Note on counting eulerian circuits. CoRR, cs.CC/0405067, 2004.Google Scholar
Y. Cao and et al. Es²: A cloud data storage system for supporting both oltp and olap. In ICDE, pages 291--302, 2011. Google ScholarDigital Library
S. Chaudhuri and et al. Optimization of real conjunctive queries. In PODS, pages 59--70, 1993. Google ScholarDigital Library
T. Condie and et al. Mapreduce online. In NSDI, pages 313--328, 2010. Google ScholarDigital Library
T. H. Cormen and et al. Introduction to Algorithms (3. ed.). MIT Press, 2009. Google ScholarDigital Library
S. Das and et al. G-store: a scalable data store for transactional multi key access in the cloud. In SoCC, pages 163--174, 2010. Google ScholarDigital Library
J. Dean and et al. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008. Google ScholarDigital Library
J. Dittrich and et al. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB, 3(1):518--529, 2010. Google ScholarDigital Library
U. Feige. A threshold of ln n for approximating set cover. J. ACM, 45(4):634--652, 1998. Google ScholarDigital Library
I. Gelfand and et al. Calculus of Variations. Dover Publ., 2000.Google Scholar
A. Gibbons. Algorithmic Graph Theory. Cambridge University Press, 1985.Google Scholar
Y. He and et al. Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In ICDE, pages 1199--1208, 2011. Google ScholarDigital Library
A. Iosup and et al. Performance analysis of cloud computing services for many-tasks scientific computing. TPDS, pages 931--945, 2011. Google ScholarDigital Library
K. Jansen. Scheduling malleable parallel tasks: An asymptotic fully polynomial time approximation scheme. Algorithmica, 39(1):59--81, 2004. Google ScholarDigital Library
E. T. Jaynes. Probability theory: The logic of science. Cambridge University Press, Cambridge, 2003.Google Scholar
D. Jiang and et al. The performance of mapreduce: An in-depth study. PVLDB, 3(1):472--483, 2010. Google ScholarDigital Library
C. Lee and et al. Optimizing large join queries using a graph-based approach. TKDE, 13(2):298--315, 2001. Google ScholarDigital Library
R. Lee and et al. Ysmart: Yet another sql-to-mapreduce translator. In ICDCS, pages 25--36, 2011. Google ScholarDigital Library
T. Nykiel and et al. Mrshare: Sharing across multiple queries in mapreduce. PVLDB, 3(1):494--505, 2010. Google ScholarDigital Library
A. Okcan and et al. Processing theta-joins using mapreduce. In SIGMOD, pages 949--960, 2011. Google ScholarDigital Library
K.-L. Tan and et al. A note on the strategy space of multiway join query optimization problem in parallel systems. SIGMOD Record, 20(4):81--82, 1991. Google ScholarDigital Library
R. Vernica and et al. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, pages 495--506, 2010. Google ScholarDigital Library
S. Wu and et al. Query optimization for massively parallel data processing. In SoCC, pages 1--13, 2011. Google ScholarDigital Library
J. Zhou and et al. Incorporating partitioning and parallel plans into the scope optimizer. In ICDE, pages 1060--1071, 2010.Google ScholarCross Ref

Recommendations

Processing theta-joins using MapReduce
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Joins are essential for many data analysis tasks, but are not supported directly by the MapReduce paradigm. While there has been progress on equi-joins, implementation of join algorithms in MapReduce in general is not sufficiently understood. We study ...
Read More
Two MRJs for Multi-way Theta-Join in MapReduce
IDCS 2013: Proceedings of the 6th International Conference on Internet and Distributed Computing Systems - Volume 8223

MapReduce is the most popular platform used in cloud computing for large-scale data processing. Generally, data processing involves multi-way Theta-joins join operations.Although multi-way Theta-joins could be processed in MapReduce by using a sequence ...
Read More
Multi-way spatial join selectivity for the ring join graph

Efficient spatial query processing is very important since the applications of the spatial DBMS (e.g. GIS, CAD/CAM, LBS) handle massive amount of data and consume much time. Many spatial queries contain the multi-way spatial join due to the fact that ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 5, Issue 11
July 2012
608 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 July 2012
Published in pvldb Volume 5, Issue 11
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 34
  Total Citations
  View Citations
- 743
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Efficient multi-way theta-join processing using MapReduce

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Processing theta-joins using MapReduce

Two MRJs for Multi-way Theta-Join in MapReduce

Multi-way spatial join selectivity for the ring join graph

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Efficient multi-way theta-join processing using MapReduce

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Processing theta-joins using MapReduce

Two MRJs for Multi-way Theta-Join in MapReduce

Multi-way spatial join selectivity for the ring join graph

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media