skip to main content
research-article

Efficient multi-way theta-join processing using MapReduce

Published:01 July 2012Publication History
Skip Abstract Section

Abstract

Multi-way Theta-join queries are powerful in describing complex relations and therefore widely employed in real practices. However, existing solutions from traditional distributed and parallel databases for multi-way Theta-join queries cannot be easily extended to fit a shared-nothing distributed computing paradigm, which is proven to be able to support OLAP applications over immense data volumes. In this work, we study the problem of efficient processing of multi-way Theta-join queries using MapReduce from a cost-effective perspective. Although there have been some works using the (key, value) pair-based programming model to support join operations, efficient processing of multi-way Theta-join queries has never been fully explored. The substantial challenge lies in, given a number of processing units (that can run Map or Reduce tasks), mapping a multi-way Theta-join query to a number of MapReduce jobs and having them executed in a well scheduled sequence, such that the total processing time span is minimized. Our solution mainly includes two parts: 1) cost metrics for both single MapReduce job and a number of MapReduce jobs executed in a certain order; 2) the efficient execution of a chain-typed Theta-join with only one MapReduce job. Comparing with the query evaluation strategy proposed in [23] and the widely adopted Pig Latin and Hive SQL solutions, our method achieves significant improvement of the join processing efficiency.

References

  1. Transaction processing performance council. http://www.tpc.org/.Google ScholarGoogle Scholar
  2. F. Afrati and J. Ullman. Optimizing multiway joins in a map-reduce environment. TKDE, 23(9):1282--1298, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Agrawal and et al. Scheduling shared scans of large data files. PVLDB, 1(1):958--969, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. Battré and et al. Nephele/pacts: a programming model and execution framework for web-scale analytical processing. In SoCC, pages 119--130, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Borthakur and et al. Apache hadoop goes realtime at facebook. In SIGMOD, pages 1071--1080, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. G. Brightwell and et al. Note on counting eulerian circuits. CoRR, cs.CC/0405067, 2004.Google ScholarGoogle Scholar
  7. Y. Cao and et al. Es2: A cloud data storage system for supporting both oltp and olap. In ICDE, pages 291--302, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Chaudhuri and et al. Optimization of real conjunctive queries. In PODS, pages 59--70, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. T. Condie and et al. Mapreduce online. In NSDI, pages 313--328, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. T. H. Cormen and et al. Introduction to Algorithms (3. ed.). MIT Press, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. S. Das and et al. G-store: a scalable data store for transactional multi key access in the cloud. In SoCC, pages 163--174, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Dean and et al. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1):107--113, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. J. Dittrich and et al. Hadoop++: Making a yellow elephant run like a cheetah (without it even noticing). PVLDB, 3(1):518--529, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. U. Feige. A threshold of ln n for approximating set cover. J. ACM, 45(4):634--652, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. I. Gelfand and et al. Calculus of Variations. Dover Publ., 2000.Google ScholarGoogle Scholar
  16. A. Gibbons. Algorithmic Graph Theory. Cambridge University Press, 1985.Google ScholarGoogle Scholar
  17. Y. He and et al. Rcfile: A fast and space-efficient data placement structure in mapreduce-based warehouse systems. In ICDE, pages 1199--1208, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Iosup and et al. Performance analysis of cloud computing services for many-tasks scientific computing. TPDS, pages 931--945, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. K. Jansen. Scheduling malleable parallel tasks: An asymptotic fully polynomial time approximation scheme. Algorithmica, 39(1):59--81, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. E. T. Jaynes. Probability theory: The logic of science. Cambridge University Press, Cambridge, 2003.Google ScholarGoogle Scholar
  21. D. Jiang and et al. The performance of mapreduce: An in-depth study. PVLDB, 3(1):472--483, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. C. Lee and et al. Optimizing large join queries using a graph-based approach. TKDE, 13(2):298--315, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Lee and et al. Ysmart: Yet another sql-to-mapreduce translator. In ICDCS, pages 25--36, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Nykiel and et al. Mrshare: Sharing across multiple queries in mapreduce. PVLDB, 3(1):494--505, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. A. Okcan and et al. Processing theta-joins using mapreduce. In SIGMOD, pages 949--960, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. K.-L. Tan and et al. A note on the strategy space of multiway join query optimization problem in parallel systems. SIGMOD Record, 20(4):81--82, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. R. Vernica and et al. Efficient parallel set-similarity joins using mapreduce. In SIGMOD, pages 495--506, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. S. Wu and et al. Query optimization for massively parallel data processing. In SoCC, pages 1--13, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Zhou and et al. Incorporating partitioning and parallel plans into the scope optimizer. In ICDE, pages 1060--1071, 2010.Google ScholarGoogle ScholarCross RefCross Ref

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 5, Issue 11
    July 2012
    608 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 July 2012
    Published in pvldb Volume 5, Issue 11

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader