skip to main content
research-article

MRShare: sharing across multiple queries in MapReduce

Published:01 September 2010Publication History
Skip Abstract Section

Abstract

Large-scale data analysis lies in the core of modern enterprises and scientific research. With the emergence of cloud computing, the use of an analytical query processing infrastructure (e.g., Amazon EC2) can be directly mapped to monetary value. MapReduce has been a popular framework in the context of cloud computing, designed to serve long running queries (jobs) which can be processed in batch mode. Taking into account that different jobs often perform similar work, there are many opportunities for sharing. In principle, sharing similar work reduces the overall amount of work, which can lead to reducing monetary charges incurred while utilizing the processing infrastructure. In this paper we propose a sharing framework tailored to MapReduce.

Our framework, MRShare, transforms a batch of queries into a new batch that will be executed more efficiently, by merging jobs into groups and evaluating each group as a single query. Based on our cost model for MapReduce, we define an optimization problem and we provide a solution that derives the optimal grouping of queries. Experiments in our prototype, built on top of Hadoop, demonstrate the overall effectiveness of our approach and substantial savings.

References

  1. Amazon EC2. http://aws.amazon.com/ec2/.Google ScholarGoogle Scholar
  2. Blogscope. http://www.blogscope.net/.Google ScholarGoogle Scholar
  3. Hadoop project. http://hadoop.apache.org/.Google ScholarGoogle Scholar
  4. Saving energy in datacenters. http://www1.eere.energy.gov/industry/datacenters/.Google ScholarGoogle Scholar
  5. A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Rasin, and A. Silberschatz. HadoopDB: An architectural hybrid of mapreduce and dbms technologies for analytical workloads. In VLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Agrawal, D. Kifer, and C. Olston. Scheduling shared scans of large data files. Proc. VLDB Endow., 1(1):958--969, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Candea, N. Polyzotis, and R. Vingralek. A scalable, predictable join operator for highly concurrent data warehouses. In VLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. Proc. VLDB Endow., 1(2):1265--1276, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Chaudhuri and K. Shim. Optimization of queries with user-defined predicates. ACM Trans. Database Syst., 24(2):177--228, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. Mad skills: New analysis practices for big data. PVLDB, 2(2):1481--1492, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI '04, pages 137--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Finkelstein. Common expression analysis in database applications. In SIGMOD '82, pages 235--245, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. E. Friedman, P. Pawlowski, and J. Cieslewicz. Sql/mapreduce: A practical approach to self-describing, polymorphic, and parallelizable user-defined functions. In VLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. A. Gates, O. Natkovich, S. Chopra, P. Kamath, S. Narayanam, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a highlevel dataflow system on top of mapreduce: The pig experience. PVLDB, 2(2):1414--1425, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Harizopoulos, V. Shkapenyuk, and A. Ailamaki. Qpipe: a simultaneously pipelined relational query engine. In SIGMOD '05, pages 383--394, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic optimization of parallel dataflow programs. In USENIX Annual Tech. Conf., pages 267--273, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In SIGMOD, pages 1099--1110, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD '09, pages 165--178, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with sawzall. Scientific Programming, 13(4):277--298, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. L. Qiao, V. Raman, F. Reiss, P. J. Haas, and G. M. Lohman. Main-memory scan sharing for multi-core cpus. Proc. VLDB Endow., 1(1):610--621, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. T. K. Sellis. Multiple-query optimization. ACM Trans. Database Syst., 13(1):23--52, 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - a warehousing solution over a map-reduce framework. In VLDB, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In SIGMOD, pages 1029--1040, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. Dryadlinq: A system for general-purpose distributed data-parallel computing using a high-level language. In OSDI, pages 1--14, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Zhou, P.-A. Larson, J.-C. Freytag, and W. Lehner. Efficient exploitation of similar subexpressions for query processing. In SIGMOD '07, pages 533--544, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Zukowski, S. Héman, N. Nes, and P. Boncz. Cooperative scans: dynamic bandwidth sharing in a dbms. In VLDB '07, pages 723--734, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 3, Issue 1-2
    September 2010
    1658 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 September 2010
    Published in pvldb Volume 3, Issue 1-2

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader