research-article

Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Authors:
Alan F. Gates

Yahoo!, Inc.

Yahoo!, Inc.
View Profile

,
Olga Natkovich

Yahoo!, Inc.

Yahoo!, Inc.
View Profile

,
Shubham Chopra

Yahoo!, Inc.

Yahoo!, Inc.
View Profile

,
Pradeep Kamath

Yahoo!, Inc.

Yahoo!, Inc.
View Profile

,
Shravan M. Narayanamurthy

Yahoo!, Inc.

Yahoo!, Inc.
View Profile

,
Christopher Olston

Yahoo!, Inc.

Yahoo!, Inc.
View Profile

,
Benjamin Reed

Yahoo!, Inc.

Yahoo!, Inc.
View Profile

,
Santhosh Srinivasan

Yahoo!, Inc.

Yahoo!, Inc.
View Profile

,
Utkarsh Srivastava

Yahoo!, Inc.

Yahoo!, Inc.
View Profile

Proceedings of the VLDB Endowment Volume 2 Issue 2pp 1414–1425https://doi.org/10.14778/1687553.1687568

Published:01 August 2009Publication History

Proceedings of the VLDB Endowment

Abstract

Increasingly, organizations capture, transform and analyze enormous data sets. Prominent examples include internet companies and e-science. The Map-Reduce scalable dataflow paradigm has become popular for these applications. Its simple, explicit dataflow programming model is favored by some over the traditional high-level declarative approach: SQL. On the other hand, the extreme simplicity of Map-Reduce leads to much low-level hacking to deal with the many-step, branching dataflows that arise in practice. Moreover, users must repeatedly code standard operations such as join by hand. These practices waste time, introduce bugs, harm readability, and impede optimizations.

Pig is a high-level dataflow system that aims at a sweet spot between SQL and Map-Reduce. Pig offers SQL-style high-level data manipulation constructs, which can be assembled in an explicit dataflow and interleaved with custom Map- and Reduce-style functions or executables. Pig programs are compiled into sequences of Map-Reduce jobs, and executed in the Hadoop Map-Reduce environment. Both Pig and Hadoop are open-source projects administered by the Apache Software Foundation.

This paper describes the challenges we faced in developing Pig, and reports performance comparisons between Pig execution and raw Map-Reduce execution.

References

Hadoop: Open-source implementation of MapReduce. http://hadoop.apache.org.Google Scholar
Pig Mix Benchmark. http://wiki.apache.org/pig/PigMix.Google Scholar
The Hive Project. http://hadoop.apache.org/hive/.Google Scholar
The Pig Project. http://hadoop.apache.org/pig.Google Scholar
K. Beyer, V. Ercegovac, and E. Shekita. Jaql: A JSON query language. http://www.jaql.org/.Google Scholar
R. Chaiken, B. Jenkins, P. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. In Proc. VLDB, 2008. Google ScholarDigital Library
Cloudera. http://www.cloudera.com.Google Scholar
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proc. OSDI, 2004. Google ScholarDigital Library
D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In Proc. VLDB, 1992. Google ScholarDigital Library
R. Epstein, M. Stonebraker, and E. Wong. Distributed query processing in a relational data base system. In Proc. ACM SIGMOD, 1978. Google ScholarDigital Library
G. Graefe. Volcano -- an extensible and parallel query evaluation system. IEEE TKDE, 6(1), 1994. Google ScholarDigital Library
J. Gray, S. Chaudhuri, A. Bosworth, A. Layman, D. Reichart, M. Venkatrao, F. Pellow, and H. Pirahesh. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. J. Data Mining and Knowledge Discovery, 1(1), 1997. Google ScholarDigital Library
T. Hofmann. Latent semantic models for collaborative filtering. ACM Trans. Information Systems, 22(1), 2004. Google ScholarDigital Library
C. Olston, S. Chopra, and U. Srivastava. Generating example data for dataflow programs. In Proc. ACM SIGMOD, 2009. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. In Proc. ACM SIGMOD, 2008. Google ScholarDigital Library
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with sawzall. Scientific Programming Journal, 13(4):227--298, 2005. Google ScholarDigital Library
J. Rao, H. Pirahesh, C. Mohan, and G. Lohman. Compiled query execution engine using JVM. In Proc. ICDE, 2006. Google ScholarDigital Library
M. A. Shah, M. J. Franklin, S. Madden, and J. M. Hellerstein. Java support for data-intensive systems: Experiences building the telegraph dataflow system. ACM SIGMOD Record, 30(4), 2001. Google ScholarDigital Library
T. Strohman. Efficient Processing of Complex Features for Information Retrieval. PhD thesis, University of Massachusetts Amherst, 2007. Google ScholarDigital Library
Y. Yu, M. Isard, D. Fetterly, M. Badiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. In Proc. OSDI, 2008. Google ScholarDigital Library

Index Terms

Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Recommendations

Scale-out beyond map-reduce
KDD '13: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining

The amount and variety of data being collected in the enterprise is growing at a staggering pace. The default now is to capture and store any and all data, in anticipation of potential future strategic value, and vast amounts of data are being generated ...
Read More
In-Map/In-Reduce: Concurrent Job Execution in MapReduce
TRUSTCOM '14: Proceedings of the 2014 IEEE 13th International Conference on Trust, Security and Privacy in Computing and Communications

Hadoop based Map Reduce (MR) has emerged as big data processing mechanism in terms of its data intensive applications. In data intensive systems, analysis and visualizations as a result of various algorithms can lead to differentiable and comparable ...
Read More
Scale-out Beyond Map-Reduce
HIPC '15: Proceedings of the 2015 IEEE 22nd International Conference on High Performance Computing (HiPC)

Until recently, data was gathered for well-defined objectives such as auditing, forensics, reporting and line-ofbusiness operations; now, exploratory and predictive analysis is becoming ubiquitous, and the default increasingly is to capture and store ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 2, Issue 2
August 2009
367 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2009
Published in pvldb Volume 2, Issue 2
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 111
  Total Citations
  View Citations
- 2,251
  Total Downloads
- Downloads (Last 12 months)28
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Scale-out beyond map-reduce

In-Map/In-Reduce: Concurrent Job Execution in MapReduce

Scale-out Beyond Map-Reduce

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Building a high-level dataflow system on top of Map-Reduce: the Pig experience

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Index Terms

Recommendations

Scale-out beyond map-reduce

In-Map/In-Reduce: Concurrent Job Execution in MapReduce

Scale-out Beyond Map-Reduce

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media