skip to main content
article

MapReduce program synthesis

Published:02 June 2016Publication History
Skip Abstract Section

Abstract

By abstracting away the complexity of distributed systems, large-scale data processing platforms—MapReduce, Hadoop, Spark, Dryad, etc.—have provided developers with simple means for harnessing the power of the cloud. In this paper, we ask whether we can automatically synthesize MapReduce-style distributed programs from input–output examples. Our ultimate goal is to enable end users to specify large-scale data analyses through the simple interface of examples. We thus present a new algorithm and tool for synthesizing programs composed of efficient data-parallel operations that can execute on cloud computing infrastructure. We evaluate our tool on a range of real-world big-data analysis tasks and general computations. Our results demonstrate the efficiency of our approach and the small number of examples it requires to synthesize correct, scalable programs.

References

  1. Amazon web services. aws.amazon.com.Google ScholarGoogle Scholar
  2. Apache flink. flink.apache.org.Google ScholarGoogle Scholar
  3. Apache spark. spark.apache.org.Google ScholarGoogle Scholar
  4. Enron emails dataset. cs.cmu.edu/˜enron/.Google ScholarGoogle Scholar
  5. Google cloud. cloud.google.com.Google ScholarGoogle Scholar
  6. Improving spark application performance. chapeau.freevariable.com/2014/09/ improving-spark-application-performance.html.Google ScholarGoogle Scholar
  7. Microsoft azure. azure.microsoft.com.Google ScholarGoogle Scholar
  8. Twitter streaming apis. dev.twitter.com/streaming.Google ScholarGoogle Scholar
  9. Wikipedia dumps. dumps.wikimedia.org.Google ScholarGoogle Scholar
  10. Yelp dataset challenge. yelp.com/dataset_challenge.Google ScholarGoogle Scholar
  11. Aws Albarghouthi, Sumit Gulwani, and Zachary Kincaid. Recursive program synthesis. In CAV, 2013.Google ScholarGoogle Scholar
  12. Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak R. Borkar, Yingyi Bu, Michael J. Carey, Inci Cetindil, Madhusudan Cheelangi, Khurram Faraaz, Eugenia Gabrielova, Raman Grover, Zachary Heilbron, Young-Seok Kim, Chen Li, Guangqiang Li, Ji Mahn Ok, Nicola Onose, Pouria Pirzadeh, Vassilis J. Tsotras, Rares Vernica, Jian Wen, and Till Westmann. Asterixdb: A scalable, open source BDMS. PVLDB, (14), 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Daniel W. Barowy, Sumit Gulwani, Ted Hart, and Benjamin G. Zorn. Flashrelate: extracting relational data from semi-structured spreadsheets using examples. In PLDI, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Gilles Barthe, Juan Manuel Crespo, Sumit Gulwani, César Kunz, and Mark Marron. From relational verification to SIMD loop synthesis. In PPOPP, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Gilles Barthe, Juan Manuel Crespo, and César Kunz. Relational verification using product programs. In FM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Gilles Barthe, Pedro R. D’Argenio, and Tamara Rezk. Secure information flow by self-composition. In CSFW, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Gilles Barthe, Marco Gaboardi, Emilio Jes´us Gallego Arias, Justin Hsu, César Kunz, and Pierre-Yves Strub. Proving differential privacy in hoare logic. In CSF, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Nick Benton. Simple relational correctness proofs for static analyses and program transformations. In POPL, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. JMLR, 3:993–1022, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Oscar Boykin, Sam Ritchie, Ian O’Connell, and Jimmy Lin. Summingbird: A framework for integrating batch and online mapreduce computations. PVLDB, 7(13):1441–1451, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Michael Carbin, Deokhwan Kim, Sasa Misailovic, and Martin C. Rinard. Proving acceptability properties of relaxed nondeterministic approximate programs. In PLDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Yu-Fang Chen, Chih-Duo Hong, Nishant Sinha, and Bow-Yaw Wang. Commutativity of reducers. In TACAS, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Michael R. Clarkson and Fred B. Schneider. Hyperproperties. JCS, (6), 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Leonardo Mendonc¸a de Moura and Nikolaj Bjørner. Z3: an efficient SMT solver. In TACAS, 2008.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. John K. Feser, Swarat Chaudhuri, and Isil Dillig. Synthesizing data structure transformations from input-output examples. In PLDI, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Tim Freeman and Frank Pfenning. Refinement types for ML. In David S. Wise, editor, PLDI, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Joseph E. Gonzalez, Reynold S. Xin, Ankur Dave, Daniel Crankshaw, Michael J. Franklin, and Ion Stoica. Graphx: Graph processing in a distributed dataflow framework. In OSDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Sumit Gulwani. Automating string processing in spreadsheets using input-output examples. In POPL, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Sumit Gulwani, William R. Harris, and Rishabh Singh. Spreadsheet data manipulation using examples. CACM, (8), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Tihomir Gvero, Viktor Kuncak, Ivan Kuraj, and Ruzica Piskac. Complete completion using types and weights. In PLDI, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Daniel Halperin, Victor Teixeira de Almeida, Lee Lee Choo, Shumo Chu, Paraschos Koutris, Dominik Moritz, Jennifer Ortiz, Vaspol Ruamviboonsuk, Jingjing Wang, Andrew Whitaker, Shengliang Xu, Magdalena Balazinska, Bill Howe, and Dan Suciu. Demonstration of the myria big data management service. In Curtis E. Dyreson, Feifei Li, and M. Tamer Özsu, editors, SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. William R. Harris and Sumit Gulwani. Spreadsheet table transformations from examples. In PLDI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Susmit Jha, Sumit Gulwani, Sanjit A. Seshia, and Ashish Tiwari. Oracle-guided component-based program synthesis. In ICSE, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. David Walker Jonathan Frankle, Peter-Michael Osera and Steve Zdancewic. Example-directed synthesis: A typetheoretic interpretation. In POPL, 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Sean Kandel, Andreas Paepcke, Joseph M. Hellerstein, and Jeffrey Heer. Wrangler: interactive visual specification of data transformation scripts. In Desney S. Tan, Saleema Amershi, Bo Begole, Wendy A. Kellogg, and Manas Tungare, editors, CHI, pages 3363–3372. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Emanuel Kitzelmann and Ute Schmid. Inductive synthesis of functional programs: An explanation based generalization approach. JMLR, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Etienne Kneuss, Ivan Kuraj, Viktor Kuncak, and Philippe Suter. Synthesis modulo recursive functions. In OOPSLA, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Viktor Kuncak, Mika¨el Mayer, Ruzica Piskac, and Philippe Suter. Complete functional synthesis. In PLDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Vu Le and Sumit Gulwani. Flashextract: a framework for data extraction by examples. In PLDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Jure Leskovec, Anand Rajaraman, and Jeffrey D. Ullman. Mining of Massive Datasets, 2nd Ed. Cambridge University Press, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos Guestrin, and Joseph M. Hellerstein. Graphlab: A new framework for parallel machine learning. In UAI, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. Pregel: a system for large-scale graph processing. In SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Donald Miner and Adam Shook. MapReduce Design Patterns: Building Effective Algorithms and Analytics for Hadoop and Other Systems. O’Reilly, 1st edition, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Peter-Michael Osera and Steve Zdancewic. Type-andexample-directed program synthesis. In PLDI, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Daniel Perelman, Sumit Gulwani, Dan Grossman, and Peter Provost. Test-driven synthesis. In PLDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Nadia Polikarpova and Armando Solar-Lezama. Program synthesis from polymorphic refinement types. CoRR, abs/1510.08419, 2015.Google ScholarGoogle Scholar
  48. Oleksander Polozov and Sumit Gulwani. Flashmeta: A framework for inductive program synthesis. In OOPSLA, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Dimitrios Prountzos, Roman Manevich, and Keshav Pingali. Elixir: a system for synthesizing concurrent graph programs. In OOPSLA, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Dimitrios Prountzos, Roman Manevich, and Keshav Pingali. Synthesizing parallel graph programs via automated planning. In PLDI, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Cosmin Radoi, Stephen J. Fink, Rodric M. Rabbah, and Manu Sridharan. Translating imperative code to mapreduce. In Andrew P. Black and Todd D. Millstein, editors, OOPSLA, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Veselin Raychev, Madanlal Musuvathi, and Todd Mytkowicz. Parallelizing user-defined aggregations using symbolic execution. In SOSP, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Rishabh Singh and Sumit Gulwani. Learning semantic string transformations from examples. PVLDB, (8), 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Rishabh Singh and Sumit Gulwani. Synthesizing number transformations from input-output examples. In CAV, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Phillip D. Summers. A methodology for lisp program construction from examples. In POPL, 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Philippe Suter, Ali Sinan Köksal, and Viktor Kuncak. Satisfiability modulo recursive programs. In SAS, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Quoc Trung Tran, Chee-Yong Chan, and Srinivasan Parthasarathy. Query by output. In Ugur C ¸ etintemel, Stanley B. Zdonik, Donald Kossmann, and Nesime Tatbul, editors, SIGMOD, pages 535–548. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Tom White. Hadoop - The Definitive Guide: Storage and Analysis at Internet Scale. 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Zhilei Xu, Shoaib Kamil, and Armando Solar-Lezama. MSL: A synthesis enabled language for distributed implementations. In SC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Yuan Yu, Pradeep Kumar Gunda, and Michael Isard. Distributed aggregation for data-parallel computing: interfaces and implementations. In SOSP, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, ´ Ulfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. Dryadlinq: A system for general-purpose distributed dataparallel computing using a high-level language. In OSDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Anna Zaks and Amir Pnueli. Covac: Compiler validation by program analysis of the cross-product. In FM, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Sai Zhang and Yuyin Sun. Automatically synthesizing SQL queries from input-output examples. In Ewen Denney, Tevfik Bultan, and Andreas Zeller, editors, ASE, pages 224–234. IEEE, 2013.Google ScholarGoogle Scholar

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM SIGPLAN Notices
    ACM SIGPLAN Notices  Volume 51, Issue 6
    PLDI '16
    June 2016
    726 pages
    ISSN:0362-1340
    EISSN:1558-1160
    DOI:10.1145/2980983
    • Editor:
    • Andy Gill
    Issue’s Table of Contents
    • cover image ACM Conferences
      PLDI '16: Proceedings of the 37th ACM SIGPLAN Conference on Programming Language Design and Implementation
      June 2016
      726 pages
      ISBN:9781450342612
      DOI:10.1145/2908080
      • General Chair:
      • Chandra Krintz,
      • Program Chair:
      • Emery Berger

    Copyright © 2016 ACM

    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 2 June 2016

    Check for updates

    Qualifiers

    • article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader