Abstract
Multi-way Theta-join queries are powerful in describing complex relations and therefore widely employed in real practices. However, existing solutions from traditional distributed and parallel databases for multi-way Theta-join queries cannot be easily extended to fit a shared-nothing distributed computing paradigm, which is proven to be able to support OLAP applications over immense data volumes. In this work, we study the problem of efficient processing of multi-way Theta-join queries using MapReduce from a cost-effective perspective. Although there have been some works using the (key, value) pair-based programming model to support join operations, efficient processing of multi-way Theta-join queries has never been fully explored. The substantial challenge lies in, given a number of processing units (that can run Map or Reduce tasks), mapping a multi-way Theta-join query to a number of MapReduce jobs and having them executed in a well scheduled sequence, such that the total processing time span is minimized. Our solution mainly includes two parts: 1) cost metrics for both single MapReduce job and a number of MapReduce jobs executed in a certain order; 2) the efficient execution of a chain-typed Theta-join with only one MapReduce job. Comparing with the query evaluation strategy proposed in [23] and the widely adopted Pig Latin and Hive SQL solutions, our method achieves significant improvement of the join processing efficiency.
- Transaction processing performance council. http://www.tpc.org/.Google Scholar
- F. Afrati and J. Ullman. Optimizing multiway joins in a map-reduce environment. TKDE, 23(9):1282--1298, 2011. Google ScholarDigital Library
- P. Agrawal and et al. Scheduling shared scans of large data files. PVLDB, 1(1):958--969, 2008. Google ScholarDigital Library
- D. Battré and et al. Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In SoCC, pages 119--130, 2010. Google ScholarDigital Library
- D. Borthakur and et al. Apache hadoop goes realtime at facebook. In SIGMOD, pages 1071--1080, 2011. Google ScholarDigital Library
- G. Brightwell and et al. Note on counting eulerian circuits. CoRR, cs.CC/0405067, 2004.Google Scholar
- Y. Cao and et al. Es2: A cloud data storage system for supporting both oltp and olap. In ICDE, pages 291--302, 2011. Google ScholarDigital Library
- S. Chaudhuri and et al. Optimization of real conjunctive queries. In PODS, pages 59--70, 1993. Google ScholarDigital Library
- T. Condie and et al. Mapreduce online. In NSDI, pages 313--328, 2010. Google ScholarDigital Library
- T. H. Cormen and et al. Introduction to Algorithms (3. ed.). MIT Press, 2009. Google ScholarDigital Library
- S. Das and et al. G-store: a scalable data store for transactional multi key access in the cloud. In SoCC, pages 163--174, 2010. Google ScholarDigital Library
- J. Dean and et al. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008. Google ScholarDigital Library
- J. Dittrich and et al. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB, 3(1):518--529, 2010. Google ScholarDigital Library
- U. Feige. A threshold of ln n for approximating set cover. J. ACM, 45(4):634--652, 1998. Google ScholarDigital Library
- I. Gelfand and et al. Calculus of Variations. Dover Publ., 2000.Google Scholar
- A. Gibbons. Algorithmic Graph Theory. Cambridge University Press, 1985.Google Scholar
- Y. He and et al. Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In ICDE, pages 1199--1208, 2011. Google ScholarDigital Library
- A. Iosup and et al. Performance analysis of cloud computing services for many-tasks scientific computing. TPDS, pages 931--945, 2011. Google ScholarDigital Library
- K. Jansen. Scheduling malleable parallel tasks: An asymptotic fully polynomial time approximation scheme. Algorithmica, 39(1):59--81, 2004. Google ScholarDigital Library
- E. T. Jaynes. Probability theory: The logic of science. Cambridge University Press, Cambridge, 2003.Google Scholar
- D. Jiang and et al. The performance of mapreduce: An in-depth study. PVLDB, 3(1):472--483, 2010. Google ScholarDigital Library
- C. Lee and et al. Optimizing large join queries using a graph-based approach. TKDE, 13(2):298--315, 2001. Google ScholarDigital Library
- R. Lee and et al. Ysmart: Yet another sql-to-mapreduce translator. In ICDCS, pages 25--36, 2011. Google ScholarDigital Library
- T. Nykiel and et al. Mrshare: Sharing across multiple queries in mapreduce. PVLDB, 3(1):494--505, 2010. Google ScholarDigital Library
- A. Okcan and et al. Processing theta-joins using mapreduce. In SIGMOD, pages 949--960, 2011. Google ScholarDigital Library
- K.-L. Tan and et al. A note on the strategy space of multiway join query optimization problem in parallel systems. SIGMOD Record, 20(4):81--82, 1991. Google ScholarDigital Library
- R. Vernica and et al. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, pages 495--506, 2010. Google ScholarDigital Library
- S. Wu and et al. Query optimization for massively parallel data processing. In SoCC, pages 1--13, 2011. Google ScholarDigital Library
- J. Zhou and et al. Incorporating partitioning and parallel plans into the scope optimizer. In ICDE, pages 1060--1071, 2010.Google ScholarCross Ref
Recommendations
Processing theta-joins using MapReduce
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of dataJoins are essential for many data analysis tasks, but are not supported directly by the MapReduce paradigm. While there has been progress on equi-joins, implementation of join algorithms in MapReduce in general is not sufficiently understood. We study ...
Two MRJs for Multi-way Theta-Join in MapReduce
IDCS 2013: Proceedings of the 6th International Conference on Internet and Distributed Computing Systems - Volume 8223MapReduce is the most popular platform used in cloud computing for large-scale data processing. Generally, data processing involves multi-way Theta-joins join operations.Although multi-way Theta-joins could be processed in MapReduce by using a sequence ...
Multi-way spatial join selectivity for the ring join graph
Efficient spatial query processing is very important since the applications of the spatial DBMS (e.g. GIS, CAD/CAM, LBS) handle massive amount of data and consume much time. Many spatial queries contain the multi-way spatial join due to the fact that ...
Comments