skip to main content
10.1145/3183713.3196891acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article
Public Access

Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications

Published:27 May 2018Publication History

ABSTRACT

MapReduce is a popular programming paradigm for developing large-scale, data-intensive computation. Many frameworks that implement this paradigm have recently been developed. To leverage these frameworks, however, developers must become familiar with their APIs and rewrite existing code. We present Casper, a new tool that automatically translates sequential Java programs into the MapReduce paradigm. Casper identifies potential code fragments to rewrite and translates them in two steps: (1) Casper uses program synthesis to search for a program summary (i.e., a functional specification) of each code fragment. The summary is expressed using a high-level intermediate language resembling the MapReduce paradigm and verified to be semantically equivalent to the original using a theorem prover. (2) Casper generates executable code from the summary, using either the Hadoop, Spark, or Flink API. We evaluated Casper by automatically converting real-world, sequential Java benchmarks to MapReduce. The resulting benchmarks perform up to 48.2x faster compared to the original.

References

  1. Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006. Compilers: Principles, Techniques, and Tools (2Nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alexander Alexandrov, Asterios Katsifodimos, Georgi Krastev, and Volker Markl. 2016. Implicit Parallelism Through Deep Language Embedding. SIGMOD Rec. 45, 1 (June 2016), 51--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Saman P. Amarasinghe, Jennifer-Ann M. Anderson, Monica S. Lam, and Chau- Wen Tseng. 1995. An Overview of the SUIF Compiler for Scalable Parallel Machines. In PPSC. 662--667.Google ScholarGoogle Scholar
  4. Apache Flink 2018. https://flink.apache.org/. (2018). Accessed on: 2018-04-09.Google ScholarGoogle Scholar
  5. Apache Hadoop 2018. http://hadoop.apache.org. (2018). Accessed on: 2018-04-09.Google ScholarGoogle Scholar
  6. Apache Hive 2018. http://hive.apache.org. (2018). Accessed on: 2018-04-09.Google ScholarGoogle Scholar
  7. Apache Pig 2018. https://pig.apache.org/. (2018). Accessed on: 2018-04-09.Google ScholarGoogle Scholar
  8. Apache Spark 2018. https://spark.apache.org. (2018). Accessed on: 2018-04-09.Google ScholarGoogle Scholar
  9. Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 1383--1394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. William Blume, Rudolf Eigenmann, Jay Hoeflinger, David A. Padua, Paul Petersen, Lawrence Rauchwerger, and Peng Tu. 1994. Automatic Detection of Parallelism: A grand challenge for high performance computing. IEEE P&DT 2, 3 (1994), 37--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Rastislav Bodík and Barbara Jobstmann. 2013. Algorithmic program synthesis: introduction. International Journal on Software Tools for Technology Transfer 15 (2013), 397--411. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, and Shirish Tatikonda. 2016. SystemML: Declarative Machine Learning on Spark. Proc. VLDB Endow. 9, 13 (Sept. 2016), 1425--1436. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. 2005. A Non-Local Algorithm for Image Denoising. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05). IEEE Computer Society, Washington, DC, USA, 60--65. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Yu-Fang Chen, Lei Song, and ZhilinWu. 2016. The Commutativity Problem of the MapReduce Framework: A Transducer-based Approach. CoRR abs/1605.01497 (2016).Google ScholarGoogle Scholar
  15. Alvin Cheung, Samuel Madden, Armando Solar-Lezama, Owen Arden, and Andrew C. Myers. 2014. Using Program Analysis to Improve Database Applications. IEEE Data Eng. Bull. 37, 1 (2014), 48--59.Google ScholarGoogle Scholar
  16. Alvin Cheung and Armando Solar-Lezama. 2016. Computer-Assisted Query Formulation. Foundations and Trends in Programming Languages 3, 1 (2016), 1--94. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Alvin Cheung, Armando Solar-Lezama, and Samuel Madden. 2013. Optimizing Database-backed Applications with Query Synthesis. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM, New York, NY, USA, 3--14. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Ugur Çetintemel, and Stanley B. Zdonik. 2014. Tupleware: Redefining Modern Analytics. CoRR abs/1406.6667 (2014).Google ScholarGoogle Scholar
  19. Przemyslaw Daca, Thomas A. Henzinger, and Andrey Kupriyanov. 2016. Array Folds Logic. CoRR abs/1603.06850 (2016).Google ScholarGoogle Scholar
  20. Jerome Darbon, Alexandre Cunha, Tony F. Chan, Stanley Osher, and Grant J. Jensen. 2008. Fast nonlocal filtering applied to electron cryomicroscopy.. In ISBI. IEEE, 1331--1334.Google ScholarGoogle Scholar
  21. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. K. Venkatesh Emani, Karthik Ramachandra, Subhro Bhattacharya, and S. Sudarshan. 2016. Extracting Equivalent SQL from Imperative Code in Database Applications. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). ACM, New York, NY, USA, 1781--1796. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Grigory Fedyukovich, Maaz Bin Safeer Ahmad, and Rastislav Bodik. 2017. Gradual Synthesis for Static Parallelization of Single-pass Array-processing Programs. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). ACM, New York, NY, USA, 572--585. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Fiji: ImageJ 2018. https://github.com/fiji. (2018). Accessed on: 2018-04-09.Google ScholarGoogle Scholar
  25. Sumit Gulwani. 2010. Dimensions in Program Synthesis. In Proceedings of the 12th International ACM SIGPLAN Symposium on Principles and Practice of Declarative Programming (PPDP '10). ACM, New York, NY, USA, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. C. A. R. Hoare. 1969. An Axiomatic Basis for Computer Programming. Commun. ACM 12, 10 (Oct. 1969), 576--580. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. ImageJ 2018. https://imagej.net/Welcome. (2018). Accessed on: 2018-04-09.Google ScholarGoogle Scholar
  28. Shoaib Kamil, Alvin Cheung, Shachar Itzhaky, and Armando Solar-Lezama. 2016. Verified Lifting of Stencil Computations. SIGPLAN Not. 51, 6 (June 2016), 711--726. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Alfons Kemper, Thomas Neumann, Florian Funke, Viktor Leis, and Henrik Mühe. 2012. HyPer: Adapting Columnar Main-Memory Data Management for Transactional AND Query Processing. IEEE Data Eng. Bull. 35, 1 (2012), 46--51.Google ScholarGoogle Scholar
  30. Yannis Klonatos, Christoph Koch, Tiark Rompf, and Hassan Chafi. 2014. Building Efficient Query Engines in a High-level Language. Proc. VLDB Endow. 7, 10 (June 2014), 853--864. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. In Proceedings of the 16th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning (LPAR'10). Springer-Verlag, Berlin, Heidelberg, 348--370. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. MagPie Analysis Repository 2018. https://github.com/thisMagpie/Analysis. (2018). Accessed on: 2018-04-09.Google ScholarGoogle Scholar
  33. John Matthews, J. Strother Moore, Sandip Ray, and Daron Vroon. 2006. Verification Condition Generation via Theorem Proving. In Proceedings of the 13th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning (LPAR'06). Springer-Verlag, Berlin, Heidelberg, 362--376. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Cedric Nugteren and Henk Corporaal. 2012. Introducing 'Bones': A Parallelizing Source-to-source Compiler Based on Algorithmic Skeletons. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units (GPGPU-5). ACM, New York, NY, USA, 1--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Shoumik Palkar, James J. Thomas, Anil Shanbhag, Deepak Narayanan, Holger Pirk, Malte Schwarzkopf, Saman Amarasinghe, and Matei Zaharia. 2017. Weld: A Common Runtime for High Performance Data Analytics. (January 2017).Google ScholarGoogle Scholar
  36. Spiros Papadimitriou and Jimeng Sun. 2008. DisCo: Distributed Co-clustering with Map-Reduce: A Case Study Towards Petabyte-Scale End-to-End Mining. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (ICDM '08). IEEE Computer Society, Washington, DC, USA, 512--521. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Polyglot 2018. http://www.cs.cornell.edu/Projects/polyglot/. (2018). Accessed on: 2018-04-09.Google ScholarGoogle Scholar
  38. Cosmin Radoi, Stephen J. Fink, Rodric Rabbah, and Manu Sridharan. 2014. Translating Imperative Code to MapReduce. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages &Applications (OOPSLA '14). ACM, New York, NY, USA, 909--927. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. 2007. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture (HPCA '07). IEEE Computer Society,Washington, DC, USA, 13--24. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Veselin Raychev, Madanlal Musuvathi, and Todd Mytkowicz. 2015. Parallelizing User-defined Aggregations Using Symbolic Execution. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15). ACM, New York, NY, USA, 153--167. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. P. Griffiths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. 1979. Access Path Selection in a Relational Database Management System. In Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data (SIGMOD '79). ACM, New York, NY, USA, 23--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Sketch 2018. https://people.csail.mit.edu/asolar/. (2018). Accessed on: 2018-04-09.Google ScholarGoogle Scholar
  43. Yannis Smaragdakis and George Balatsouras. 2015. Pointer Analysis. Found. Trends Program. Lang. 2, 1 (April 2015), 1--69. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Calvin Smith and Aws Albarghouthi. 2016. MapReduce Program Synthesis. SIGPLAN Not. 51, 6 (June 2016), 326--340. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Armando Solar-Lezama. 2008. Program Synthesis by Sketching. Ph.D. Dissertation. Berkeley, CA, USA. Advisor(s) Bodik, Rastislav. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Spark GitHub Repository 2018. https://github.com/apache/spark/tree/master/ examples/src/main/scala/org/apache/spark/examples. (2018). Accessed on: 2018- 01--20.Google ScholarGoogle Scholar
  47. Glynn Winskel. 1993. The Formal Semantics of Programming Languages: An Introduction. MIT Press, Cambridge, MA, USA. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
        May 2018
        1874 pages
        ISBN:9781450347037
        DOI:10.1145/3183713

        Copyright © 2018 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 May 2018

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SIGMOD '18 Paper Acceptance Rate90of461submissions,20%Overall Acceptance Rate785of4,003submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader