Abstract
By abstracting away the complexity of distributed systems, large-scale data processing platforms—MapReduce, Hadoop, Spark, Dryad, etc.—have provided developers with simple means for harnessing the power of the cloud. In this paper, we ask whether we can automatically synthesize MapReduce-style distributed programs from input–output examples. Our ultimate goal is to enable end users to specify large-scale data analyses through the simple interface of examples. We thus present a new algorithm and tool for synthesizing programs composed of efficient data-parallel operations that can execute on cloud computing infrastructure. We evaluate our tool on a range of real-world big-data analysis tasks and general computations. Our results demonstrate the efficiency of our approach and the small number of examples it requires to synthesize correct, scalable programs.
- Amazon web services. aws.amazon.com.Google Scholar
- Apache flink. flink.apache.org.Google Scholar
- Apache spark. spark.apache.org.Google Scholar
- Enron emails dataset. cs.cmu.edu/˜enron/.Google Scholar
- Google cloud. cloud.google.com.Google Scholar
- Improving spark application performance. chapeau.freevariable.com/2014/09/ improving-spark-application-performance.html.Google Scholar
- Microsoft azure. azure.microsoft.com.Google Scholar
- Twitter streaming apis. dev.twitter.com/streaming.Google Scholar
- Wikipedia dumps. dumps.wikimedia.org.Google Scholar
- Yelp dataset challenge. yelp.com/dataset_challenge.Google Scholar
- Aws Albarghouthi, Sumit Gulwani, and Zachary Kincaid. Recursive program synthesis. In CAV, 2013.Google Scholar
- Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak R. Borkar, Yingyi Bu, Michael J. Carey, Inci Cetindil, Madhusudan Cheelangi, Khurram Faraaz, Eugenia Gabrielova, Raman Grover, Zachary Heilbron, Young-Seok Kim, Chen Li, Guangqiang Li, Ji Mahn Ok, Nicola Onose, Pouria Pirzadeh, Vassilis J. Tsotras, Rares Vernica, Jian Wen, and Till Westmann. Asterixdb: A scalable, open source BDMS. PVLDB, (14), 2014. Google ScholarDigital Library
- Daniel W. Barowy, Sumit Gulwani, Ted Hart, and Benjamin G. Zorn. Flashrelate: extracting relational data from semi-structured spreadsheets using examples. In PLDI, 2015. Google ScholarDigital Library
- Gilles Barthe, Juan Manuel Crespo, Sumit Gulwani, César Kunz, and Mark Marron. From relational verification to SIMD loop synthesis. In PPOPP, 2013. Google ScholarDigital Library
- Gilles Barthe, Juan Manuel Crespo, and César Kunz. Relational verification using product programs. In FM, 2011. Google ScholarDigital Library
- Gilles Barthe, Pedro R. D’Argenio, and Tamara Rezk. Secure information flow by self-composition. In CSFW, 2004. Google ScholarDigital Library
- Gilles Barthe, Marco Gaboardi, Emilio Jes´us Gallego Arias, Justin Hsu, César Kunz, and Pierre-Yves Strub. Proving differential privacy in hoare logic. In CSF, 2014. Google ScholarDigital Library
- Nick Benton. Simple relational correctness proofs for static analyses and program transformations. In POPL, 2004. Google ScholarDigital Library
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. JMLR, 3:993–1022, 2003. Google ScholarDigital Library
- P. Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin. Summingbird: A framework for integrating batch and online mapreduce computations. PVLDB, 7(13):1441–1451, 2014. Google ScholarDigital Library
- Michael Carbin, Deokhwan Kim, Sasa Misailovic, and Martin C. Rinard. Proving acceptability properties of relaxed nondeterministic approximate programs. In PLDI, 2012. Google ScholarDigital Library
- Yu-Fang Chen, Chih-Duo Hong, Nishant Sinha, and Bow-Yaw Wang. Commutativity of reducers. In TACAS, 2015. Google ScholarDigital Library
- Michael R. Clarkson and Fred B. Schneider. Hyperproperties. JCS, (6), 2010. Google ScholarDigital Library
- Leonardo Mendonc¸a de Moura and Nikolaj Bjørner. Z3: an efficient SMT solver. In TACAS, 2008.Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarDigital Library
- John K. Feser, Swarat Chaudhuri, and Isil Dillig. Synthesizing data structure transformations from input-output examples. In PLDI, 2015. Google ScholarDigital Library
- Tim Freeman and Frank Pfenning. Refinement types for ML. In David S. Wise, editor, PLDI, 1991. Google ScholarDigital Library
- Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. Graphx: Graph processing in a distributed dataflow framework. In OSDI, 2014. Google ScholarDigital Library
- Sumit Gulwani. Automating string processing in spreadsheets using input-output examples. In POPL, 2011. Google ScholarDigital Library
- Sumit Gulwani, William R. Harris, and Rishabh Singh. Spreadsheet data manipulation using examples. CACM, (8), 2012. Google ScholarDigital Library
- Tihomir Gvero, Viktor Kuncak, Ivan Kuraj, and Ruzica Piskac. Complete completion using types and weights. In PLDI, 2013. Google ScholarDigital Library
- Daniel Halperin, Victor Teixeira de Almeida, Lee Lee Choo, Shumo Chu, Paraschos Koutris, Dominik Moritz, Jennifer Ortiz, Vaspol Ruamviboonsuk, Jingjing Wang, Andrew Whitaker, Shengliang Xu, Magdalena Balazinska, Bill Howe, and Dan Suciu. Demonstration of the myria big data management service. In Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu, editors, SIGMOD, 2014. Google ScholarDigital Library
- William R. Harris and Sumit Gulwani. Spreadsheet table transformations from examples. In PLDI, 2011. Google ScholarDigital Library
- Susmit Jha, Sumit Gulwani, Sanjit A. Seshia, and Ashish Tiwari. Oracle-guided component-based program synthesis. In ICSE, 2010. Google ScholarDigital Library
- David Walker Jonathan Frankle, Peter-Michael Osera and Steve Zdancewic. Example-directed synthesis: A typetheoretic interpretation. In POPL, 2016. Google ScholarDigital Library
- Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. Wrangler: interactive visual specification of data transformation scripts. In Desney S. Tan, Saleema Amershi, Bo Begole, Wendy A. Kellogg, and Manas Tungare, editors, CHI, pages 3363–3372. ACM, 2011. Google ScholarDigital Library
- Emanuel Kitzelmann and Ute Schmid. Inductive synthesis of functional programs: An explanation based generalization approach. JMLR, 2006. Google ScholarDigital Library
- Etienne Kneuss, Ivan Kuraj, Viktor Kuncak, and Philippe Suter. Synthesis modulo recursive functions. In OOPSLA, 2013. Google ScholarDigital Library
- Viktor Kuncak, Mika¨el Mayer, Ruzica Piskac, and Philippe Suter. Complete functional synthesis. In PLDI, 2010. Google ScholarDigital Library
- Vu Le and Sumit Gulwani. Flashextract: a framework for data extraction by examples. In PLDI, 2014. Google ScholarDigital Library
- Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining of Massive Datasets, 2nd Ed. Cambridge University Press, 2014. Google ScholarDigital Library
- Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. Graphlab: A new framework for parallel machine learning. In UAI, 2010.Google ScholarDigital Library
- Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010. Google ScholarDigital Library
- Donald Miner and Adam Shook. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O’Reilly, 1st edition, 2012. Google ScholarDigital Library
- Peter-Michael Osera and Steve Zdancewic. Type-andexample-directed program synthesis. In PLDI, 2015. Google ScholarDigital Library
- Daniel Perelman, Sumit Gulwani, Dan Grossman, and Peter Provost. Test-driven synthesis. In PLDI, 2014. Google ScholarDigital Library
- Nadia Polikarpova and Armando Solar-Lezama. Program synthesis from polymorphic refinement types. CoRR, abs/1510.08419, 2015.Google Scholar
- Oleksander Polozov and Sumit Gulwani. Flashmeta: A framework for inductive program synthesis. In OOPSLA, 2015. Google ScholarDigital Library
- Dimitrios Prountzos, Roman Manevich, and Keshav Pingali. Elixir: a system for synthesizing concurrent graph programs. In OOPSLA, 2012. Google ScholarDigital Library
- Dimitrios Prountzos, Roman Manevich, and Keshav Pingali. Synthesizing parallel graph programs via automated planning. In PLDI, 2015. Google ScholarDigital Library
- Cosmin Radoi, Stephen J. Fink, Rodric M. Rabbah, and Manu Sridharan. Translating imperative code to mapreduce. In Andrew P. Black and Todd D. Millstein, editors, OOPSLA, 2014. Google ScholarDigital Library
- Veselin Raychev, Madanlal Musuvathi, and Todd Mytkowicz. Parallelizing user-defined aggregations using symbolic execution. In SOSP, 2015. Google ScholarDigital Library
- Rishabh Singh and Sumit Gulwani. Learning semantic string transformations from examples. PVLDB, (8), 2012. Google ScholarDigital Library
- Rishabh Singh and Sumit Gulwani. Synthesizing number transformations from input-output examples. In CAV, 2012. Google ScholarDigital Library
- Phillip D. Summers. A methodology for lisp program construction from examples. In POPL, 1976. Google ScholarDigital Library
- Philippe Suter, Ali Sinan Köksal, and Viktor Kuncak. Satisfiability modulo recursive programs. In SAS, 2011. Google ScholarDigital Library
- Quoc Trung Tran, Chee-Yong Chan, and Srinivasan Parthasarathy. Query by output. In Ugur C ¸ etintemel, Stanley B. Zdonik, Donald Kossmann, and Nesime Tatbul, editors, SIGMOD, pages 535–548. ACM, 2009. Google ScholarDigital Library
- Tom White. Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale. 2015. Google ScholarDigital Library
- Zhilei Xu, Shoaib Kamil, and Armando Solar-Lezama. MSL: A synthesis enabled language for distributed implementations. In SC, 2014. Google ScholarDigital Library
- Yuan Yu, Pradeep Kumar Gunda, and Michael Isard. Distributed aggregation for data-parallel computing: interfaces and implementations. In SOSP, 2009. Google ScholarDigital Library
- Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, ´ Ulfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. Dryadlinq: A system for general-purpose distributed dataparallel computing using a high-level language. In OSDI, 2008. Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarDigital Library
- Anna Zaks and Amir Pnueli. Covac: Compiler validation by program analysis of the cross-product. In FM, 2008. Google ScholarDigital Library
- Sai Zhang and Yuyin Sun. Automatically synthesizing SQL queries from input-output examples. In Ewen Denney, Tevfik Bultan, and Andreas Zeller, editors, ASE, pages 224–234. IEEE, 2013.Google Scholar
Recommendations
MapReduce program synthesis
PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and ImplementationBy abstracting away the complexity of distributed systems, large-scale data processing platforms—MapReduce, Hadoop, Spark, Dryad, etc.—have provided developers with simple means for harnessing the power of the cloud. In this paper, we ask whether we ...
Automatically Leveraging MapReduce Frameworks for Data-Intensive Applications
SIGMOD '18: Proceedings of the 2018 International Conference on Management of DataMapReduce is a popular programming paradigm for developing large-scale, data-intensive computation. Many frameworks that implement this paradigm have recently been developed. To leverage these frameworks, however, developers must become familiar with ...
Algorithmic program synthesis: introduction
Program synthesis is a process of producing an executable program from a specification. Algorithmic synthesis produces the program automatically, without an intervention from an expert. While classical compilation falls under the definition of ...
Comments