skip to main content
research-article

Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Published:01 August 2009Publication History
Skip Abstract Section

Abstract

Increasingly, organizations capture, transform and analyze enormous data sets. Prominent examples include internet companies and e-science. The Map-Reduce scalable dataflow paradigm has become popular for these applications. Its simple, explicit dataflow programming model is favored by some over the traditional high-level declarative approach: SQL. On the other hand, the extreme simplicity of Map-Reduce leads to much low-level hacking to deal with the many-step, branching dataflows that arise in practice. Moreover, users must repeatedly code standard operations such as join by hand. These practices waste time, introduce bugs, harm readability, and impede optimizations.

Pig is a high-level dataflow system that aims at a sweet spot between SQL and Map-Reduce. Pig offers SQL-style high-level data manipulation constructs, which can be assembled in an explicit dataflow and interleaved with custom Map- and Reduce-style functions or executables. Pig programs are compiled into sequences of Map-Reduce jobs, and executed in the Hadoop Map-Reduce environment. Both Pig and Hadoop are open-source projects administered by the Apache Software Foundation.

This paper describes the challenges we faced in developing Pig, and reports performance comparisons between Pig execution and raw Map-Reduce execution.

References

  1. Hadoop: Open-source implementation of MapReduce. http://hadoop.apache.org.Google ScholarGoogle Scholar
  2. Pig Mix Benchmark. http://wiki.apache.org/pig/PigMix.Google ScholarGoogle Scholar
  3. The Hive Project. http://hadoop.apache.org/hive/.Google ScholarGoogle Scholar
  4. The Pig Project. http://hadoop.apache.org/pig.Google ScholarGoogle Scholar
  5. K. Beyer, V. Ercegovac, and E. Shekita. Jaql: A JSON query language. http://www.jaql.org/.Google ScholarGoogle Scholar
  6. R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. In Proc. VLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Cloudera. http://www.cloudera.com.Google ScholarGoogle Scholar
  8. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. OSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In Proc. VLDB, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. R. Epstein, M. Stonebraker, and E. Wong. Distributed query processing in a relational data base system. In Proc. ACM SIGMOD, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. G. Graefe. Volcano -- an extensible and parallel query evaluation system. IEEE TKDE, 6(1), 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining and Knowledge Discovery, 1(1), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. T. Hofmann. Latent semantic models for collaborative filtering. ACM Trans. Information Systems, 22(1), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Olston, S. Chopra, and U. Srivastava. Generating example data for dataflow programs. In Proc. ACM SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In Proc. ACM SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with sawzall. Scientific Programming Journal, 13(4):227--298, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. Rao, H. Pirahesh, C. Mohan, and G. Lohman. Compiled query execution engine using JVM. In Proc. ICDE, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. A. Shah, M. J. Franklin, S. Madden, and J. M. Hellerstein. Java support for data-intensive systems: Experiences building the telegraph dataflow system. ACM SIGMOD Record, 30(4), 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. T. Strohman. Efficient Processing of Complex Features for Information Retrieval. PhD thesis, University of Massachusetts Amherst, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Y. Yu, M. Isard, D. Fetterly, M. Badiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proc. OSDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Building a high-level dataflow system on top of Map-Reduce: the Pig experience

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          Full Access

          • Published in

            cover image Proceedings of the VLDB Endowment
            Proceedings of the VLDB Endowment  Volume 2, Issue 2
            August 2009
            367 pages

            Publisher

            VLDB Endowment

            Publication History

            • Published: 1 August 2009
            Published in pvldb Volume 2, Issue 2

            Qualifiers

            • research-article

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader