skip to main content
research-article

Multi-query optimization in MapReduce framework

Published:01 November 2013Publication History
Skip Abstract Section

Abstract

MapReduce has recently emerged as a new paradigm for large-scale data analysis due to its high scalability, fine-grained fault tolerance and easy programming model. Since different jobs often share similar work (e.g., several jobs scan the same input file or produce the same map output), there are many opportunities to optimize the performance for a batch of jobs. In this paper, we propose two new techniques for multi-job optimization in the MapReduce framework. The first is a generalized grouping technique (which generalizes the recently proposed MRShare technique) that merges multiple jobs into a single job thereby enabling the merged jobs to share both the scan of the input file as well as the communication of the common map output. The second is a materialization technique that enables multiple jobs to share both the scan of the input file as well as the communication of the common map output via partial materialization of the map output of some jobs (in the map and/or reduce phase). Our second contribution is the proposal of a new optimization algorithm that given an input batch of jobs, produces an optimal plan by a judicious partitioning of the jobs into groups and an optimal assignment of the processing technique to each group. Our experimental results on Hadoop demonstrate that our new approach significantly outperforms the state-of-the-art technique, MRShare, by up to 107%.

References

  1. F. N. Afrati and J. D. Ullman. Optimizing joins in a mapreduce environment. In EDBT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. I. Elghandour and A. Aboulnaga. Restore: Reusing results of mapreduce jobs. In VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Fegaras, C. Li, and U. Gupta. An optimization framework for map-reduce queries. In EDBT, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of map-reduce: the pig experience. In VLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. H. Herodotou and S. Babu. Profiling, what-if analysis, and cost-based optimization of mapreduce programs. In VLDB, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. H. Herodotou, F. Dong, and S. Babu. Mapreduce programming and cost-based optimization? crossing this chasm with starfish. In VLDB, 2011.Google ScholarGoogle Scholar
  8. E. Jahani, M. J. Cafarella, and C. Ré. Automatic optimization for mapreduce programs. In VLDB, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Jestes, K. Yi, and F. Li. Building wavelet histograms on large data in mapreduce. In VLDB, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. F. Li, B. C. Ooi, M. T. Özsu, and S. Wu. Distributed data management using mapreduce. ACM Computing Surveys. To appear in 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. H. Lim, H. Herodotou, and S. Babu. Stubby: A transformation-based optimizer for mapreduce workflows. In VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. T. Nykiel, M. Potamias, C. Mishra, G. Kollios, and N. Koudas. Mrshare: sharing across multiple queries in mapreduce. In VLDB, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic optimization of parallel dataflow programs. In ATC, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: A not-so-foreign language for data processing. In SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Y. Shi, X. Meng, F. Wang, and Y. Gan. Hedc: a histogram estimator for data in the cloud. In CloudDb, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive-a warehousing solution over a map-reduce framework. In VLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, and H. Liu. Hive-a petabyte scale data warehouse using hadoop. In ICDE, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  18. G. Wang and C.-Y. Chan. Multi-query optimization in mapreduce framework. Technical Report http://www.comp.nus.edu.sg/~g0800170/techreport-MJQ.pdf, National University of Singapore, February 2013.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. White. Hadoop: The Definitive Guide. O'Reilly Media, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. Wu, F. Li, S. Mehrotra, and B. C. Ooi. Query optimization for massively parallel data processing. In SOCC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 7, Issue 3
    November 2013
    84 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 November 2013
    Published in pvldb Volume 7, Issue 3

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader