research-article

Spinning fast iterative data flows

Authors:
Stephan Ewen

Technische Universität Berlin, Germany

Technische Universität Berlin, Germany
View Profile

,
Kostas Tzoumas

Technische Universität Berlin, Germany

Technische Universität Berlin, Germany
View Profile

,
Moritz Kaufmann

Technische Universität Berlin, Germany

Technische Universität Berlin, Germany
View Profile

,
Volker Markl

Technische Universität Berlin, Germany

Technische Universität Berlin, Germany
View Profile

Proceedings of the VLDB Endowment Volume 5 Issue 11pp 1268–1279https://doi.org/10.14778/2350229.2350245

Published:01 July 2012Publication History

Proceedings of the VLDB Endowment

Abstract

Parallel dataflow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk iterative algorithms are supported by novel dataflow frameworks, these systems cannot exploit computational dependencies present in many algorithms, such as graph algorithms. As a result, these algorithms are inefficiently executed and have led to specialized systems based on other paradigms, such as message passing or shared memory.

We propose a method to integrate incremental iterations, a form of workset iterations, with parallel dataflows. After showing how to integrate bulk iterations into a dataflow system and its optimizer, we present an extension to the programming model for incremental iterations. The extension alleviates for the lack of mutable state in dataflows and allows for exploiting the sparse computational dependencies inherent in many iterative algorithms. The evaluation of a prototypical implementation shows that those aspects lead to up to two orders of magnitude speedup in algorithm runtime, when exploited. In our experiments, the improved dataflow system is highly competitive with specialized systems while maintaining a transparent and unified dataflow abstraction.

References

L. Afanasiev, T. Grust, M. Marx, J. Rittinger, and J. Teubner. Recursion in XQuery: Put your distributivity safety belt on. In EDBT, pp. 345--356, 2009. Google ScholarDigital Library
F. N. Afrati, V. R. Borkar, M. J. Carey, N. Polyzotis, and J. D. Ullman. Map-reduce extensions and recursive queries. In EDBT, pp. 1--8, 2011. Google ScholarDigital Library
Apache Giraph. http://incubator.apache.org/giraph/.Google Scholar
Apache Hadoop. http://hadoop.apache.org.Google Scholar
Apache Mahout. http://mahout.apache.org.Google Scholar
F. Bancilhon, R. Ramakrishnan. An amateur's introduction to recursive query processing strategies. In SIGMOD, pp. 16--52, 1986. Google ScholarDigital Library
D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In SoCC, pp. 119--130, 2010. Google ScholarDigital Library
C. Beeri and R. Ramakrishnan. On the power of Magic. In PODS, pp. 269--284, 1987. Google ScholarDigital Library
A. Behm, V. R. Borkar, M. J. Carey, R. Grover, C. Li, N. Onose, R. Vernica, A. Deutsch, Y. Papakonstantinou, and V. J. Tsotras. ASTERIX: Towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases, 29(3):185--216, 2011. Google ScholarDigital Library
P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. In WWW, pp. 595--601, 2004. Google ScholarDigital Library
V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pp. 1151--1162, 2011. Google ScholarDigital Library
Y. Bu, V. R. Borkar, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, R. Ramakrishnan. Scaling Datalog for machine learning on Big Data. In CoRR, abs/1203.0160, 2012.Google Scholar
Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient iterative data processing on large clusters. PVLDB, 3(1):285--296, 2010. Google ScholarDigital Library
J. Cai and R. Paige. Program derivation by fixed point computation. Sci. Comput. Program., 11(3):197--261, 1989. Google ScholarDigital Library
R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008. Google ScholarDigital Library
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pp. 137--150, 2004. Google ScholarDigital Library
D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna. GAMMA - A high performance dataflow database machine. In VLDB, pp. 228--237, 1986. Google ScholarDigital Library
J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: A runtime for iterative MapReduce. In HPDC, pp. 810--818, 2010. Google ScholarDigital Library
S. Fushimi, M. Kitsuregawa, and H. Tanaka. An overview of the system software of a parallel relational database machine GRACE. In VLDB, pp. 209--219, 1986. Google ScholarDigital Library
G. Graefe and W. J. McKenna. The Volcano optimizer generator: Extensibility and efficient search. In ICDE, pp. 209--218, 1993. Google ScholarDigital Library
U. Güntzer, W. Kießling, and R. Bayer. On the evaluation of recursion in (deductive) database systems by efficient differential fixpoint iteration. In ICDE, pp. 120--129, 1987. Google ScholarDigital Library
F. Hueske, M. Peters, M. J. Sax, A. Rheinländer, R. Bergmann, A. Krettek, K. Tzoumas. Opening the black boxes in data flow optimization. PVLDB, Vol. 5, 2012 (to appear). Google ScholarDigital Library
M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys, pp. 59--72, 2007. Google ScholarDigital Library
J. B. Kam and J. D. Ullman. Monotone data flow analysis frameworks. Acta Inf., Vol. 7, No. 3: pp. 305--317, 1977.Google ScholarDigital Library
S. Kamvar, T. Haveliwala, and G. Golub. Adaptive methods for the computation of PageRank. Technical Report 2003--26, Stanford InfoLab, 2003.Google Scholar
U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system. In ICDM, pp. 229--238, 2009. Google ScholarDigital Library
T-H. Lai, Y-C. Tseng, X. Dong. A more efficient message-optimal algorithm for distributed termination detection. In IPPS, pp. 646--649, 1992 Google ScholarDigital Library
Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A new framework for parallel machine learning. In UAI, pp. 340--349, 2010.Google Scholar
G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD, pp. 135--146, 2010. Google ScholarDigital Library
D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand. Ciel: A universal execution engine for distributed data-flow computing. In NSDI, 2011. Google ScholarDigital Library
L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report No. 1999--66, Stanford InfoLab, 1999.Google Scholar
D. A. Schneider and D. J. DeWitt. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In SIGMOD, pp. 110--121, 1989. Google ScholarDigital Library
P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In SIGMOD, pp. 23--34, 1979. Google ScholarDigital Library
L. G. Valiant. General purpose parallel architectures. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity (A), pp. 943--972, 1990. Google ScholarDigital Library
M. Weimer, T. Condie, and R. Ramakrishnan. Machine learning in ScalOps, a higher order cloud computing language. In NIPS BigLearn, Vol. 9, pp. 389--396, 2011.Google Scholar
M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In HotCloud, pp. 1--7, 2010. Google ScholarDigital Library
M. Zinkevich, M. Weimer, A. J. Smola, and L. Li. Parallelized stochastic gradient descent. In NIPS, pp. 2595--2603, 2010.Google ScholarDigital Library

Recommendations

Efficient Iterative Methods Applied to the Solution of Transonic Flows

We investigate the use of an inexact Newton's method to solve the potential equations in the transonic regime. As a test case, we solve the two-dimensional steady transonic small disturbance equation. Approximate factorization/ADI techniques have ...
Read More
Comparison of the Parallel Fast Marching Method, the Fast Iterative Method, and the Parallel Semi-ordered Fast Iterative Method

Solving the eikonal equation allows to compute a monotone front propagation of anisotropic nature and is thus a widely applied technique in different areas of science and engineering. Various methods are available out of which only a subset is suitable ...
Read More
Iterative methods for variational and complementarity problems

In this paper, we study both the local and global convergence of various iterative methods for solving the variational inequality and the nonlinear complementarity problems. Included among such methods are the Newton and several successive ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

Proceedings of the VLDB Endowment Volume 5, Issue 11
July 2012
608 pages
ISSN:2150-8097
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 July 2012
Published in pvldb Volume 5, Issue 11
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 48
  Total Citations
  View Citations
- 541
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Spinning fast iterative data flows

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Efficient Iterative Methods Applied to the Solution of Transonic Flows

Comparison of the Parallel Fast Marching Method, the Fast Iterative Method, and the Parallel Semi-ordered Fast Iterative Method

Iterative methods for variational and complementarity problems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Spinning fast iterative data flows

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Efficient Iterative Methods Applied to the Solution of Transonic Flows

Comparison of the Parallel Fast Marching Method, the Fast Iterative Method, and the Parallel Semi-ordered Fast Iterative Method

Iterative methods for variational and complementarity problems

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media