research-article

Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

Authors:
Dominic Battré

Technische Universität Berlin, Berlin, Germany

Technische Universität Berlin, Berlin, Germany
View Profile

,
Stephan Ewen

Technische Universität Berlin, Berlin, Germany

Technische Universität Berlin, Berlin, Germany
View Profile

,
Fabian Hueske

Technische Universität Berlin, Berlin, Germany

Technische Universität Berlin, Berlin, Germany
View Profile

,
Odej Kao

Technische Universität Berlin, Berlin, Germany

Technische Universität Berlin, Berlin, Germany
View Profile

,
Volker Markl

Technische Universität Berlin, Berlin, Germany

Technische Universität Berlin, Berlin, Germany
View Profile

,
Daniel Warneke

Technische Universität Berlin, Berlin, Germany

Technische Universität Berlin, Berlin, Germany
View Profile

SoCC '10: Proceedings of the 1st ACM symposium on Cloud computingJune 2010Pages 119–130https://doi.org/10.1145/1807128.1807148

Published:10 June 2010Publication History

SoCC '10: Proceedings of the 1st ACM symposium on Cloud computing

Pages 119–130

ABSTRACT

We present a parallel data processor centered around a programming model of so called Parallelization Contracts (PACTs) and the scalable parallel execution engine Nephele [18]. The PACT programming model is a generalization of the well-known map/reduce programming model, extending it with further second-order functions, as well as with Output Contracts that give guarantees about the behavior of a function. We describe methods to transform a PACT program into a data flow for Nephele, which executes its sequential building blocks in parallel and deals with communication, synchronization and fault tolerance. Our definition of PACTs allows to apply several types of optimizations on the data flow during the transformation.

The system as a whole is designed to be as generic as (and compatible to) map/reduce systems, while overcoming several of their major weaknesses: 1) The functions map and reduce alone are not sufficient to express many data processing tasks both naturally and efficiently. 2) Map/reduce ties a program to a single fixed execution strategy, which is robust but highly suboptimal for many tasks. 3) Map/reduce makes no assumptions about the behavior of the functions. Hence, it offers only very limited optimization opportunities. With a set of examples and experiments, we illustrate how our system is able to naturally represent and efficiently execute several tasks that do not fit the map/reduce model well.

References

Hadoop. URL: http://hadoop.apache.org.Google Scholar
TPC-H. URL: http://www.tpc.org/tpch/.Google Scholar
K. Beyer, V. Ercegovac, J. Rao, and E. Shekita. Jaql: A JSON Query Language. URL: http://jaql.org.Google Scholar
R. Chaiken, B. Jenkins, P.-A. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and Efficient Parallel Processing of Massive Data Sets. PVLDB, 1(2):1265--1276, 2008. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004. Google ScholarDigital Library
J. Delmerico, N. Byrnes, A. Bruno, M. Jones, S. Gallo, and V. Chaudhary. Comparing the Performance of Clusters, Hadoop, and Active Disks on Microarray Correlation Computations. In International Conference on High Performance Computing, 2009.Google ScholarCross Ref
D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna. GAMMA - A High Performance Dataflow Database Machine. In W. W. Chu, G. Gardarin, S. Ohsuga, and Y. Kambayashi, editors, VLDB, pages 228--237. Morgan Kaufmann, 1986. Google ScholarDigital Library
E. Friedman, P. M. Pawlowski, and J. Cieslewicz. SQL/MapReduce: A Practical Approach to Self-describing, Polymorphic, and Parallelizable User-defined Functions. PVLDB, 2(2):1402--1413, 2009. Google ScholarDigital Library
S. Fushimi, M. Kitsuregawa, and H. Tanaka. An Overview of The System Software of A Parallel Relational Database Machine GRACE. In W. W. Chu, G. Gardarin, S. Ohsuga, and Y. Kambayashi, editors, VLDB, pages 209--219. Morgan Kaufmann, 1986. Google ScholarDigital Library
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks. In P. Ferreira, T. R. Gross, and L. Veiga, editors, EuroSys, pages 59--72. ACM, 2007. Google ScholarDigital Library
C. Olston, B. Reed, A. Silberstein, and U. Srivastava. Automatic Optimization of Parallel Dataflow Programs. In R. Isaacs and Y. Zhou, editors, USENIX Annual Technical Conference, pages 267--273. USENIX Association, 2008. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A Not-So-Foreign Language for Data Processing. In J. T.-L. Wang, editor, SIGMOD Conference, pages 1099--1110. ACM, 2008. Google ScholarDigital Library
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A Comparison of Approaches to Large-Scale Data Analysis. In U. Cetintemel, S. B. Zdonik, D. Kossmann, and N. Tatbul, editors, SIGMOD Conference, pages 165--178. ACM, 2009. Google ScholarDigital Library
D. A. Schneider and D. J. DeWitt. A Performance Evaluation of Four Parallel Join Algorithms in a Shared-Nothing Multiprocessor Environment. In J. Clifford, B. G. Lindsay, and D. Maier, editors, SIGMOD Conference, pages 110--121. ACM Press, 1989. Google ScholarDigital Library
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access Path Selection in a Relational Database Management System. In P. A. Bernstein, editor, SIGMOD Conference, pages 23--34. ACM, 1979. Google ScholarDigital Library
J. W. Stamos and H. C. Young. A Symmetric Fragment and Replicate Algorithm for Distributed Joins. IEEE Trans. Parallel Distrib. Syst., 4(12):1345--1354, 1993. Google ScholarDigital Library
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB, 2(2):1626--1629, 2009. Google ScholarDigital Library
D. Warneke and O. Kao. Nephele: Efficient Parallel Data Processing in the Cloud. In I. Raicu, I. T. Foster, and Y. Zhao, editors, SC-MTAGS. ACM, 2009. Google ScholarDigital Library
C. Yang, C. Yen, C. Tan, and S. Madden. Osprey: Implementing MapReduce-Style Fault Tolerance in a Shared-Nothing Distributed Database. In ICDE, 2009.Google Scholar
H. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-Reduce-Merge: Simplified Relational Data Processing on Large Clusters. In C. Y. Chan, B. C. Ooi, and A. Zhou, editors, SIGMOD Conference, pages 1029--1040. ACM, 2007. Google ScholarDigital Library
Y. Yu, M. Isard, D. Fetterly, M. Budiu, U. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In R. Draves and R. van Renesse, editors, OSDI, pages 1--14. USENIX Association, 2008. Google ScholarDigital Library

Index Terms

Nephele/PACTs: a programming model and execution framework for web-scale analytical processing
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs

Recommendations

Nephele: efficient parallel data processing in the cloud
MTAGS '09: Proceedings of the 2nd Workshop on Many-Task Computing on Grids and Supercomputers

In recent years Cloud Computing has emerged as a promising new approach for ad-hoc parallel data processing. Major cloud computing companies have started to integrate frameworks for parallel data processing in their product portfolio, making it easy for ...
Read More
Massively parallel data analysis with PACTs on Nephele

Large-scale data analysis applications require processing and analyzing of Terabytes or even Petabytes of data, particularly in the areas of web analysis or scientific data management. This trend has been discussed as "web-scale data management" in a ...
Read More
BigBench: towards an industry standard benchmark for big data analytics
SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data

There is a tremendous interest in big data by academia, industry and a large user base. Several commercial and open source providers unleashed a variety of products to support big data storage and processing. As these products mature, there is a need to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SoCC '10: Proceedings of the 1st ACM symposium on Cloud computing
June 2010
264 pages
ISBN:9781450300360
DOI:10.1145/1807128
General Chair:
Joseph M. Hellerstein
University of California, Berkeley, USA
,
Program Chairs:
Surajit Chaudhuri
Microsoft Research, USA
,
Mendel Rosenblum
Stanford University, USA
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 10 June 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cloud computing
map reduce
web-scale data
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate169of722submissions,23%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 200
  Total Citations
  View Citations
- 1,112
  Total Downloads
- Downloads (Last 12 months)30
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Nephele/PACTs: a programming model and execution framework for web-scale analytical processing

SoCC '10: Proceedings of the 1st ACM symposium on Cloud computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Nephele: efficient parallel data processing in the cloud

Massively parallel data analysis with PACTs on Nephele

BigBench: towards an industry standard benchmark for big data analytics