article

MapReduce program synthesis

Authors:
Calvin Smith

University of Wisconsin-Madison, USA

University of Wisconsin-Madison, USA
View Profile

,
Aws Albarghouthi

University of Wisconsin-Madison, USA

University of Wisconsin-Madison, USA
View Profile

Authors Info & Claims

ACM SIGPLAN Notices Volume 51 Issue 6June 2016pp 326–340https://doi.org/10.1145/2980983.2908102

Published:02 June 2016Publication History

ACM SIGPLAN Notices

Abstract

By abstracting away the complexity of distributed systems, large-scale data processing platforms—MapReduce, Hadoop, Spark, Dryad, etc.—have provided developers with simple means for harnessing the power of the cloud. In this paper, we ask whether we can automatically synthesize MapReduce-style distributed programs from input–output examples. Our ultimate goal is to enable end users to specify large-scale data analyses through the simple interface of examples. We thus present a new algorithm and tool for synthesizing programs composed of efficient data-parallel operations that can execute on cloud computing infrastructure. We evaluate our tool on a range of real-world big-data analysis tasks and general computations. Our results demonstrate the efficiency of our approach and the small number of examples it requires to synthesize correct, scalable programs.

References

Amazon web services. aws.amazon.com.Google Scholar
Apache flink. flink.apache.org.Google Scholar
Apache spark. spark.apache.org.Google Scholar
Enron emails dataset. cs.cmu.edu/˜enron/.Google Scholar
Google cloud. cloud.google.com.Google Scholar
Improving spark application performance. chapeau.freevariable.com/2014/09/ improving-spark-application-performance.html.Google Scholar
Microsoft azure. azure.microsoft.com.Google Scholar
Twitter streaming apis. dev.twitter.com/streaming.Google Scholar
Wikipedia dumps. dumps.wikimedia.org.Google Scholar
Yelp dataset challenge. yelp.com/dataset_challenge.Google Scholar
Aws Albarghouthi, Sumit Gulwani, and Zachary Kincaid. Recursive program synthesis. In CAV, 2013.Google Scholar
Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak R. Borkar, Yingyi Bu, Michael J. Carey, Inci Cetindil, Madhusudan Cheelangi, Khurram Faraaz, Eugenia Gabrielova, Raman Grover, Zachary Heilbron, Young-Seok Kim, Chen Li, Guangqiang Li, Ji Mahn Ok, Nicola Onose, Pouria Pirzadeh, Vassilis J. Tsotras, Rares Vernica, Jian Wen, and Till Westmann. Asterixdb: A scalable, open source BDMS. PVLDB, (14), 2014. Google ScholarDigital Library
Daniel W. Barowy, Sumit Gulwani, Ted Hart, and Benjamin G. Zorn. Flashrelate: extracting relational data from semi-structured spreadsheets using examples. In PLDI, 2015. Google ScholarDigital Library
Gilles Barthe, Juan Manuel Crespo, Sumit Gulwani, César Kunz, and Mark Marron. From relational verification to SIMD loop synthesis. In PPOPP, 2013. Google ScholarDigital Library
Gilles Barthe, Juan Manuel Crespo, and César Kunz. Relational verification using product programs. In FM, 2011. Google ScholarDigital Library
Gilles Barthe, Pedro R. D’Argenio, and Tamara Rezk. Secure information flow by self-composition. In CSFW, 2004. Google ScholarDigital Library
Gilles Barthe, Marco Gaboardi, Emilio Jes´us Gallego Arias, Justin Hsu, César Kunz, and Pierre-Yves Strub. Proving differential privacy in hoare logic. In CSF, 2014. Google ScholarDigital Library
Nick Benton. Simple relational correctness proofs for static analyses and program transformations. In POPL, 2004. Google ScholarDigital Library
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. JMLR, 3:993–1022, 2003. Google ScholarDigital Library
P. Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin. Summingbird: A framework for integrating batch and online mapreduce computations. PVLDB, 7(13):1441–1451, 2014. Google ScholarDigital Library
Michael Carbin, Deokhwan Kim, Sasa Misailovic, and Martin C. Rinard. Proving acceptability properties of relaxed nondeterministic approximate programs. In PLDI, 2012. Google ScholarDigital Library
Yu-Fang Chen, Chih-Duo Hong, Nishant Sinha, and Bow-Yaw Wang. Commutativity of reducers. In TACAS, 2015. Google ScholarDigital Library
Michael R. Clarkson and Fred B. Schneider. Hyperproperties. JCS, (6), 2010. Google ScholarDigital Library
Leonardo Mendonc¸a de Moura and Nikolaj Bjørner. Z3: an efficient SMT solver. In TACAS, 2008.Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
John K. Feser, Swarat Chaudhuri, and Isil Dillig. Synthesizing data structure transformations from input-output examples. In PLDI, 2015. Google ScholarDigital Library
Tim Freeman and Frank Pfenning. Refinement types for ML. In David S. Wise, editor, PLDI, 1991. Google ScholarDigital Library
Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. Graphx: Graph processing in a distributed dataflow framework. In OSDI, 2014. Google ScholarDigital Library
Sumit Gulwani. Automating string processing in spreadsheets using input-output examples. In POPL, 2011. Google ScholarDigital Library
Sumit Gulwani, William R. Harris, and Rishabh Singh. Spreadsheet data manipulation using examples. CACM, (8), 2012. Google ScholarDigital Library
Tihomir Gvero, Viktor Kuncak, Ivan Kuraj, and Ruzica Piskac. Complete completion using types and weights. In PLDI, 2013. Google ScholarDigital Library
Daniel Halperin, Victor Teixeira de Almeida, Lee Lee Choo, Shumo Chu, Paraschos Koutris, Dominik Moritz, Jennifer Ortiz, Vaspol Ruamviboonsuk, Jingjing Wang, Andrew Whitaker, Shengliang Xu, Magdalena Balazinska, Bill Howe, and Dan Suciu. Demonstration of the myria big data management service. In Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu, editors, SIGMOD, 2014. Google ScholarDigital Library
William R. Harris and Sumit Gulwani. Spreadsheet table transformations from examples. In PLDI, 2011. Google ScholarDigital Library
Susmit Jha, Sumit Gulwani, Sanjit A. Seshia, and Ashish Tiwari. Oracle-guided component-based program synthesis. In ICSE, 2010. Google ScholarDigital Library
David Walker Jonathan Frankle, Peter-Michael Osera and Steve Zdancewic. Example-directed synthesis: A typetheoretic interpretation. In POPL, 2016. Google ScholarDigital Library
Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. Wrangler: interactive visual specification of data transformation scripts. In Desney S. Tan, Saleema Amershi, Bo Begole, Wendy A. Kellogg, and Manas Tungare, editors, CHI, pages 3363–3372. ACM, 2011. Google ScholarDigital Library
Emanuel Kitzelmann and Ute Schmid. Inductive synthesis of functional programs: An explanation based generalization approach. JMLR, 2006. Google ScholarDigital Library
Etienne Kneuss, Ivan Kuraj, Viktor Kuncak, and Philippe Suter. Synthesis modulo recursive functions. In OOPSLA, 2013. Google ScholarDigital Library
Viktor Kuncak, Mika¨el Mayer, Ruzica Piskac, and Philippe Suter. Complete functional synthesis. In PLDI, 2010. Google ScholarDigital Library
Vu Le and Sumit Gulwani. Flashextract: a framework for data extraction by examples. In PLDI, 2014. Google ScholarDigital Library
Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining of Massive Datasets, 2nd Ed. Cambridge University Press, 2014. Google ScholarDigital Library
Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. Graphlab: A new framework for parallel machine learning. In UAI, 2010.Google ScholarDigital Library
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010. Google ScholarDigital Library
Donald Miner and Adam Shook. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O’Reilly, 1st edition, 2012. Google ScholarDigital Library
Peter-Michael Osera and Steve Zdancewic. Type-andexample-directed program synthesis. In PLDI, 2015. Google ScholarDigital Library
Daniel Perelman, Sumit Gulwani, Dan Grossman, and Peter Provost. Test-driven synthesis. In PLDI, 2014. Google ScholarDigital Library
Nadia Polikarpova and Armando Solar-Lezama. Program synthesis from polymorphic refinement types. CoRR, abs/1510.08419, 2015.Google Scholar
Oleksander Polozov and Sumit Gulwani. Flashmeta: A framework for inductive program synthesis. In OOPSLA, 2015. Google ScholarDigital Library
Dimitrios Prountzos, Roman Manevich, and Keshav Pingali. Elixir: a system for synthesizing concurrent graph programs. In OOPSLA, 2012. Google ScholarDigital Library
Dimitrios Prountzos, Roman Manevich, and Keshav Pingali. Synthesizing parallel graph programs via automated planning. In PLDI, 2015. Google ScholarDigital Library
Cosmin Radoi, Stephen J. Fink, Rodric M. Rabbah, and Manu Sridharan. Translating imperative code to mapreduce. In Andrew P. Black and Todd D. Millstein, editors, OOPSLA, 2014. Google ScholarDigital Library
Veselin Raychev, Madanlal Musuvathi, and Todd Mytkowicz. Parallelizing user-defined aggregations using symbolic execution. In SOSP, 2015. Google ScholarDigital Library
Rishabh Singh and Sumit Gulwani. Learning semantic string transformations from examples. PVLDB, (8), 2012. Google ScholarDigital Library
Rishabh Singh and Sumit Gulwani. Synthesizing number transformations from input-output examples. In CAV, 2012. Google ScholarDigital Library
Phillip D. Summers. A methodology for lisp program construction from examples. In POPL, 1976. Google ScholarDigital Library
Philippe Suter, Ali Sinan Köksal, and Viktor Kuncak. Satisfiability modulo recursive programs. In SAS, 2011. Google ScholarDigital Library
Quoc Trung Tran, Chee-Yong Chan, and Srinivasan Parthasarathy. Query by output. In Ugur C ¸ etintemel, Stanley B. Zdonik, Donald Kossmann, and Nesime Tatbul, editors, SIGMOD, pages 535–548. ACM, 2009. Google ScholarDigital Library
Tom White. Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale. 2015. Google ScholarDigital Library
Zhilei Xu, Shoaib Kamil, and Armando Solar-Lezama. MSL: A synthesis enabled language for distributed implementations. In SC, 2014. Google ScholarDigital Library
Yuan Yu, Pradeep Kumar Gunda, and Michael Isard. Distributed aggregation for data-parallel computing: interfaces and implementations. In SOSP, 2009. Google ScholarDigital Library
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, ´ Ulfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. Dryadlinq: A system for general-purpose distributed dataparallel computing using a high-level language. In OSDI, 2008. Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarDigital Library
Anna Zaks and Amir Pnueli. Covac: Compiler validation by program analysis of the cross-product. In FM, 2008. Google ScholarDigital Library
Sai Zhang and Yuyin Sun. Automatically synthesizing SQL queries from input-output examples. In Ewen Denney, Tevfik Bultan, and Andreas Zeller, editors, ASE, pages 224–234. IEEE, 2013.Google Scholar

Recommendations

MapReduce program synthesis
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation

By abstracting away the complexity of distributed systems, large-scale data processing platforms—MapReduce, Hadoop, Spark, Dryad, etc.—have provided developers with simple means for harnessing the power of the cloud. In this paper, we ask whether we ...
Read More
Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications
SIGMOD '18: Proceedings of the 2018 International Conference on Management of Data

MapReduce is a popular programming paradigm for developing large-scale, data-intensive computation. Many frameworks that implement this paradigm have recently been developed. To leverage these frameworks, however, developers must become familiar with ...
Read More
Algorithmic program synthesis: introduction

Program synthesis is a process of producing an executable program from a specification. Algorithmic synthesis produces the program automatically, without an intervention from an expert. While classical compilation falls under the definition of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGPLAN Notices Volume 51, Issue 6
PLDI '16
June 2016
726 pages
ISSN:0362-1340
EISSN:1558-1160
DOI:10.1145/2980983
Editor:
Andy Gill
University of Kansas, Lawrence, KS
Issue’s Table of Contents
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation
June 2016
726 pages
ISBN:9781450342612
DOI:10.1145/2908080
General Chair:
Chandra Krintz
University of California at Santa Barbara, USA
,
Program Chair:
Emery Berger
University of Massachusetts at Amherst, USA
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 June 2016
Check for updates
Author Tags
data analysis
program synthesis
verification
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 75
  Total Citations
  View Citations
- 877
  Total Downloads
- Downloads (Last 12 months)38
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

MapReduce program synthesis

ACM SIGPLAN Notices

Abstract

References

Cited By

Recommendations

MapReduce program synthesis

Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications

Algorithmic program synthesis: introduction