research-article

Public Access

Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications

Authors:
Maaz Bin Safeer Ahmad

University of Washington, Seattle, WA, USA

University of Washington, Seattle, WA, USA
View Profile

,
Alvin Cheung

University of Washington, Seattle, WA, USA

University of Washington, Seattle, WA, USA
View Profile

SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataMay 2018Pages 1205–1220https://doi.org/10.1145/3183713.3196891

Published:27 May 2018Publication History

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

Pages 1205–1220

ABSTRACT

MapReduce is a popular programming paradigm for developing large-scale, data-intensive computation. Many frameworks that implement this paradigm have recently been developed. To leverage these frameworks, however, developers must become familiar with their APIs and rewrite existing code. We present Casper, a new tool that automatically translates sequential Java programs into the MapReduce paradigm. Casper identifies potential code fragments to rewrite and translates them in two steps: (1) Casper uses program synthesis to search for a program summary (i.e., a functional specification) of each code fragment. The summary is expressed using a high-level intermediate language resembling the MapReduce paradigm and verified to be semantically equivalent to the original using a theorem prover. (2) Casper generates executable code from the summary, using either the Hadoop, Spark, or Flink API. We evaluated Casper by automatically converting real-world, sequential Java benchmarks to MapReduce. The resulting benchmarks perform up to 48.2x faster compared to the original.

References

Alfred V. Aho, Monica S. Lam, Ravi Sethi, and Jeffrey D. Ullman. 2006. Compilers: Principles, Techniques, and Tools (2Nd Edition). Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA. Google ScholarDigital Library
Alexander Alexandrov, Asterios Katsifodimos, Georgi Krastev, and Volker Markl. 2016. Implicit Parallelism Through Deep Language Embedding. SIGMOD Rec. 45, 1 (June 2016), 51--58. Google ScholarDigital Library
Saman P. Amarasinghe, Jennifer-Ann M. Anderson, Monica S. Lam, and Chau- Wen Tseng. 1995. An Overview of the SUIF Compiler for Scalable Parallel Machines. In PPSC. 662--667.Google Scholar
Apache Flink 2018. https://flink.apache.org/. (2018). Accessed on: 2018-04-09.Google Scholar
Apache Hadoop 2018. http://hadoop.apache.org. (2018). Accessed on: 2018-04-09.Google Scholar
Apache Hive 2018. http://hive.apache.org. (2018). Accessed on: 2018-04-09.Google Scholar
Apache Pig 2018. https://pig.apache.org/. (2018). Accessed on: 2018-04-09.Google Scholar
Apache Spark 2018. https://spark.apache.org. (2018). Accessed on: 2018-04-09.Google Scholar
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, New York, NY, USA, 1383--1394. Google ScholarDigital Library
William Blume, Rudolf Eigenmann, Jay Hoeflinger, David A. Padua, Paul Petersen, Lawrence Rauchwerger, and Peng Tu. 1994. Automatic Detection of Parallelism: A grand challenge for high performance computing. IEEE P&DT 2, 3 (1994), 37--47. Google ScholarDigital Library
Rastislav Bodík and Barbara Jobstmann. 2013. Algorithmic program synthesis: introduction. International Journal on Software Tools for Technology Transfer 15 (2013), 397--411. Google ScholarDigital Library
Matthias Boehm, Michael W. Dusenberry, Deron Eriksson, Alexandre V. Evfimievski, Faraz Makari Manshadi, Niketan Pansare, Berthold Reinwald, Frederick R. Reiss, Prithviraj Sen, Arvind C. Surve, and Shirish Tatikonda. 2016. SystemML: Declarative Machine Learning on Spark. Proc. VLDB Endow. 9, 13 (Sept. 2016), 1425--1436. Google ScholarDigital Library
Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. 2005. A Non-Local Algorithm for Image Denoising. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '05). IEEE Computer Society, Washington, DC, USA, 60--65. Google ScholarDigital Library
Yu-Fang Chen, Lei Song, and ZhilinWu. 2016. The Commutativity Problem of the MapReduce Framework: A Transducer-based Approach. CoRR abs/1605.01497 (2016).Google Scholar
Alvin Cheung, Samuel Madden, Armando Solar-Lezama, Owen Arden, and Andrew C. Myers. 2014. Using Program Analysis to Improve Database Applications. IEEE Data Eng. Bull. 37, 1 (2014), 48--59.Google Scholar
Alvin Cheung and Armando Solar-Lezama. 2016. Computer-Assisted Query Formulation. Foundations and Trends in Programming Languages 3, 1 (2016), 1--94. Google ScholarDigital Library
Alvin Cheung, Armando Solar-Lezama, and Samuel Madden. 2013. Optimizing Database-backed Applications with Query Synthesis. In Proceedings of the 34th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '13). ACM, New York, NY, USA, 3--14. Google ScholarDigital Library
Andrew Crotty, Alex Galakatos, Kayhan Dursun, Tim Kraska, Ugur Çetintemel, and Stanley B. Zdonik. 2014. Tupleware: Redefining Modern Analytics. CoRR abs/1406.6667 (2014).Google Scholar
Przemyslaw Daca, Thomas A. Henzinger, and Andrey Kupriyanov. 2016. Array Folds Logic. CoRR abs/1603.06850 (2016).Google Scholar
Jerome Darbon, Alexandre Cunha, Tony F. Chan, Stanley Osher, and Grant J. Jensen. 2008. Fast nonlocal filtering applied to electron cryomicroscopy.. In ISBI. IEEE, 1331--1334.Google Scholar
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified Data Processing on Large Clusters. Commun. ACM 51, 1 (Jan. 2008), 107--113. Google ScholarDigital Library
K. Venkatesh Emani, Karthik Ramachandra, Subhro Bhattacharya, and S. Sudarshan. 2016. Extracting Equivalent SQL from Imperative Code in Database Applications. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD '16). ACM, New York, NY, USA, 1781--1796. Google ScholarDigital Library
Grigory Fedyukovich, Maaz Bin Safeer Ahmad, and Rastislav Bodik. 2017. Gradual Synthesis for Static Parallelization of Single-pass Array-processing Programs. In Proceedings of the 38th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI 2017). ACM, New York, NY, USA, 572--585. Google ScholarDigital Library
Fiji: ImageJ 2018. https://github.com/fiji. (2018). Accessed on: 2018-04-09.Google Scholar
Sumit Gulwani. 2010. Dimensions in Program Synthesis. In Proceedings of the 12th International ACM SIGPLAN Symposium on Principles and Practice of Declarative Programming (PPDP '10). ACM, New York, NY, USA, 13--24. Google ScholarDigital Library
C. A. R. Hoare. 1969. An Axiomatic Basis for Computer Programming. Commun. ACM 12, 10 (Oct. 1969), 576--580. Google ScholarDigital Library
ImageJ 2018. https://imagej.net/Welcome. (2018). Accessed on: 2018-04-09.Google Scholar
Shoaib Kamil, Alvin Cheung, Shachar Itzhaky, and Armando Solar-Lezama. 2016. Verified Lifting of Stencil Computations. SIGPLAN Not. 51, 6 (June 2016), 711--726. Google ScholarDigital Library
Alfons Kemper, Thomas Neumann, Florian Funke, Viktor Leis, and Henrik Mühe. 2012. HyPer: Adapting Columnar Main-Memory Data Management for Transactional AND Query Processing. IEEE Data Eng. Bull. 35, 1 (2012), 46--51.Google Scholar
Yannis Klonatos, Christoph Koch, Tiark Rompf, and Hassan Chafi. 2014. Building Efficient Query Engines in a High-level Language. Proc. VLDB Endow. 7, 10 (June 2014), 853--864. Google ScholarDigital Library
K. Rustan M. Leino. 2010. Dafny: An Automatic Program Verifier for Functional Correctness. In Proceedings of the 16th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning (LPAR'10). Springer-Verlag, Berlin, Heidelberg, 348--370. Google ScholarDigital Library
MagPie Analysis Repository 2018. https://github.com/thisMagpie/Analysis. (2018). Accessed on: 2018-04-09.Google Scholar
John Matthews, J. Strother Moore, Sandip Ray, and Daron Vroon. 2006. Verification Condition Generation via Theorem Proving. In Proceedings of the 13th International Conference on Logic for Programming, Artificial Intelligence, and Reasoning (LPAR'06). Springer-Verlag, Berlin, Heidelberg, 362--376. Google ScholarDigital Library
Cedric Nugteren and Henk Corporaal. 2012. Introducing 'Bones': A Parallelizing Source-to-source Compiler Based on Algorithmic Skeletons. In Proceedings of the 5th Annual Workshop on General Purpose Processing with Graphics Processing Units (GPGPU-5). ACM, New York, NY, USA, 1--10. Google ScholarDigital Library
Shoumik Palkar, James J. Thomas, Anil Shanbhag, Deepak Narayanan, Holger Pirk, Malte Schwarzkopf, Saman Amarasinghe, and Matei Zaharia. 2017. Weld: A Common Runtime for High Performance Data Analytics. (January 2017).Google Scholar
Spiros Papadimitriou and Jimeng Sun. 2008. DisCo: Distributed Co-clustering with Map-Reduce: A Case Study Towards Petabyte-Scale End-to-End Mining. In Proceedings of the 2008 Eighth IEEE International Conference on Data Mining (ICDM '08). IEEE Computer Society, Washington, DC, USA, 512--521. Google ScholarDigital Library
Polyglot 2018. http://www.cs.cornell.edu/Projects/polyglot/. (2018). Accessed on: 2018-04-09.Google Scholar
Cosmin Radoi, Stephen J. Fink, Rodric Rabbah, and Manu Sridharan. 2014. Translating Imperative Code to MapReduce. In Proceedings of the 2014 ACM International Conference on Object Oriented Programming Systems Languages &Applications (OOPSLA '14). ACM, New York, NY, USA, 909--927. Google ScholarDigital Library
Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary Bradski, and Christos Kozyrakis. 2007. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture (HPCA '07). IEEE Computer Society,Washington, DC, USA, 13--24. Google ScholarDigital Library
Veselin Raychev, Madanlal Musuvathi, and Todd Mytkowicz. 2015. Parallelizing User-defined Aggregations Using Symbolic Execution. In Proceedings of the 25th Symposium on Operating Systems Principles (SOSP '15). ACM, New York, NY, USA, 153--167. Google ScholarDigital Library
P. Griffiths Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. 1979. Access Path Selection in a Relational Database Management System. In Proceedings of the 1979 ACM SIGMOD International Conference on Management of Data (SIGMOD '79). ACM, New York, NY, USA, 23--34. Google ScholarDigital Library
Sketch 2018. https://people.csail.mit.edu/asolar/. (2018). Accessed on: 2018-04-09.Google Scholar
Yannis Smaragdakis and George Balatsouras. 2015. Pointer Analysis. Found. Trends Program. Lang. 2, 1 (April 2015), 1--69. Google ScholarDigital Library
Calvin Smith and Aws Albarghouthi. 2016. MapReduce Program Synthesis. SIGPLAN Not. 51, 6 (June 2016), 326--340. Google ScholarDigital Library
Armando Solar-Lezama. 2008. Program Synthesis by Sketching. Ph.D. Dissertation. Berkeley, CA, USA. Advisor(s) Bodik, Rastislav. Google ScholarDigital Library
Spark GitHub Repository 2018. https://github.com/apache/spark/tree/master/ examples/src/main/scala/org/apache/spark/examples. (2018). Accessed on: 2018- 01--20.Google Scholar
Glynn Winskel. 1993. The Formal Semantics of Programming Languages: An Introduction. MIT Press, Cambridge, MA, USA. Google ScholarDigital Library

Index Terms

Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs
        MapReduce-based systems
2. Software and its engineering
  1. Software notations and tools
    1. Compilers

Recommendations

Optimizing Data-Intensive Applications Automatically By Leveraging Parallel Data Processing Frameworks
SIGMOD '17: Proceedings of the 2017 ACM International Conference on Management of Data

In this demonstration we will showcase CASPER, a novel tool that enables sequential data-intensive programs to automatically leverage the optimizations provided by parallel data processing frameworks. The goal of CASPER is to reduce the inertia against ...
Read More
MapReduce: Review and open challenges

The continuous increase in computational capacity over the past years has produced an overwhelming flow of data or big data, which exceeds the capabilities of conventional processing tools. Big data signify a new era in data exploration and utilization. ...
Read More
MapReduce program synthesis
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation

By abstracting away the complexity of distributed systems, large-scale data processing platforms—MapReduce, Hadoop, Spark, Dryad, etc.—have provided developers with simple means for harnessing the power of the cloud. In this paper, we ask whether we ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data
May 2018
1874 pages
ISBN:9781450347037
DOI:10.1145/3183713
General Chairs:
Gautam Das
University of Texas at Arlington, USA
,
Christopher Jermaine
Rice University, USA
,
Philip Bernstein
Microsoft Research, USA
Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 May 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
compilers
mapreduce
program synthesis
verification
verified lifting
Qualifiers
- research-article
Conference

Acceptance Rates
SIGMOD '18 Paper Acceptance Rate90of461submissions,20%Overall Acceptance Rate785of4,003submissions,20%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 22
  Total Citations
  View Citations
- 1,317
  Total Downloads
- Downloads (Last 12 months)184
- Downloads (Last 6 weeks)15
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Optimizing Data-Intensive Applications Automatically By Leveraging Parallel Data Processing Frameworks

MapReduce: Review and open challenges

MapReduce program synthesis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications

SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Optimizing Data-Intensive Applications Automatically By Leveraging Parallel Data Processing Frameworks

MapReduce: Review and open challenges

MapReduce program synthesis

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media