Abstract
Increasingly, organizations capture, transform and analyze enormous data sets. Prominent examples include internet companies and e-science. The Map-Reduce scalable dataflow paradigm has become popular for these applications. Its simple, explicit dataflow programming model is favored by some over the traditional high-level declarative approach: SQL. On the other hand, the extreme simplicity of Map-Reduce leads to much low-level hacking to deal with the many-step, branching dataflows that arise in practice. Moreover, users must repeatedly code standard operations such as join by hand. These practices waste time, introduce bugs, harm readability, and impede optimizations.
Pig is a high-level dataflow system that aims at a sweet spot between SQL and Map-Reduce. Pig offers SQL-style high-level data manipulation constructs, which can be assembled in an explicit dataflow and interleaved with custom Map- and Reduce-style functions or executables. Pig programs are compiled into sequences of Map-Reduce jobs, and executed in the Hadoop Map-Reduce environment. Both Pig and Hadoop are open-source projects administered by the Apache Software Foundation.
This paper describes the challenges we faced in developing Pig, and reports performance comparisons between Pig execution and raw Map-Reduce execution.
- Hadoop: Open-source implementation of MapReduce. http://hadoop.apache.org.Google Scholar
- Pig Mix Benchmark. http://wiki.apache.org/pig/PigMix.Google Scholar
- The Hive Project. http://hadoop.apache.org/hive/.Google Scholar
- The Pig Project. http://hadoop.apache.org/pig.Google Scholar
- K. Beyer, V. Ercegovac, and E. Shekita. Jaql: A JSON query language. http://www.jaql.org/.Google Scholar
- R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. In Proc. VLDB, 2008. Google ScholarDigital Library
- Cloudera. http://www.cloudera.com.Google Scholar
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. OSDI, 2004. Google ScholarDigital Library
- D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In Proc. VLDB, 1992. Google ScholarDigital Library
- R. Epstein, M. Stonebraker, and E. Wong. Distributed query processing in a relational data base system. In Proc. ACM SIGMOD, 1978. Google ScholarDigital Library
- G. Graefe. Volcano -- an extensible and parallel query evaluation system. IEEE TKDE, 6(1), 1994. Google ScholarDigital Library
- J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining and Knowledge Discovery, 1(1), 1997. Google ScholarDigital Library
- T. Hofmann. Latent semantic models for collaborative filtering. ACM Trans. Information Systems, 22(1), 2004. Google ScholarDigital Library
- C. Olston, S. Chopra, and U. Srivastava. Generating example data for dataflow programs. In Proc. ACM SIGMOD, 2009. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In Proc. ACM SIGMOD, 2008. Google ScholarDigital Library
- R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with sawzall. Scientific Programming Journal, 13(4):227--298, 2005. Google ScholarDigital Library
- J. Rao, H. Pirahesh, C. Mohan, and G. Lohman. Compiled query execution engine using JVM. In Proc. ICDE, 2006. Google ScholarDigital Library
- M. A. Shah, M. J. Franklin, S. Madden, and J. M. Hellerstein. Java support for data-intensive systems: Experiences building the telegraph dataflow system. ACM SIGMOD Record, 30(4), 2001. Google ScholarDigital Library
- T. Strohman. Efficient Processing of Complex Features for Information Retrieval. PhD thesis, University of Massachusetts Amherst, 2007. Google ScholarDigital Library
- Y. Yu, M. Isard, D. Fetterly, M. Badiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proc. OSDI, 2008. Google ScholarDigital Library
Index Terms
- Building a high-level dataflow system on top of Map-Reduce: the Pig experience
Recommendations
Scale-out beyond map-reduce
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data miningThe amount and variety of data being collected in the enterprise is growing at a staggering pace. The default now is to capture and store any and all data, in anticipation of potential future strategic value, and vast amounts of data are being generated ...
In-Map/In-Reduce: Concurrent Job Execution in MapReduce
TRUSTCOM '14: Proceedings of the 2014 IEEE 13th International Conference on Trust, Security and Privacy in Computing and CommunicationsHadoop based Map Reduce (MR) has emerged as big data processing mechanism in terms of its data intensive applications. In data intensive systems, analysis and visualizations as a result of various algorithms can lead to differentiable and comparable ...
Scale-out Beyond Map-Reduce
HIPC '15: Proceedings of the 2015 IEEE 22nd International Conference on High Performance Computing (HiPC)Until recently, data was gathered for well-defined objectives such as auditing, forensics, reporting and line-ofbusiness operations; now, exploratory and predictive analysis is becoming ubiquitous, and the default increasingly is to capture and store ...
Comments