Abstract
Parallel dataflow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk iterative algorithms are supported by novel dataflow frameworks, these systems cannot exploit computational dependencies present in many algorithms, such as graph algorithms. As a result, these algorithms are inefficiently executed and have led to specialized systems based on other paradigms, such as message passing or shared memory.
We propose a method to integrate incremental iterations, a form of workset iterations, with parallel dataflows. After showing how to integrate bulk iterations into a dataflow system and its optimizer, we present an extension to the programming model for incremental iterations. The extension alleviates for the lack of mutable state in dataflows and allows for exploiting the sparse computational dependencies inherent in many iterative algorithms. The evaluation of a prototypical implementation shows that those aspects lead to up to two orders of magnitude speedup in algorithm runtime, when exploited. In our experiments, the improved dataflow system is highly competitive with specialized systems while maintaining a transparent and unified dataflow abstraction.
- L. Afanasiev, T. Grust, M. Marx, J. Rittinger, and J. Teubner. Recursion in XQuery: Put your distributivity safety belt on. In EDBT, pp. 345--356, 2009. Google ScholarDigital Library
- F. N. Afrati, V. R. Borkar, M. J. Carey, N. Polyzotis, and J. D. Ullman. Map-reduce extensions and recursive queries. In EDBT, pp. 1--8, 2011. Google ScholarDigital Library
- Apache Giraph. http://incubator.apache.org/giraph/.Google Scholar
- Apache Hadoop. http://hadoop.apache.org.Google Scholar
- Apache Mahout. http://mahout.apache.org.Google Scholar
- F. Bancilhon, R. Ramakrishnan. An amateur's introduction to recursive query processing strategies. In SIGMOD, pp. 16--52, 1986. Google ScholarDigital Library
- D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In SoCC, pp. 119--130, 2010. Google ScholarDigital Library
- C. Beeri and R. Ramakrishnan. On the power of Magic. In PODS, pp. 269--284, 1987. Google ScholarDigital Library
- A. Behm, V. R. Borkar, M. J. Carey, R. Grover, C. Li, N. Onose, R. Vernica, A. Deutsch, Y. Papakonstantinou, and V. J. Tsotras. ASTERIX: Towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases, 29(3):185--216, 2011. Google ScholarDigital Library
- P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. In WWW, pp. 595--601, 2004. Google ScholarDigital Library
- V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pp. 1151--1162, 2011. Google ScholarDigital Library
- Y. Bu, V. R. Borkar, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, R. Ramakrishnan. Scaling Datalog for machine learning on Big Data. In CoRR, abs/1203.0160, 2012.Google Scholar
- Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient iterative data processing on large clusters. PVLDB, 3(1):285--296, 2010. Google ScholarDigital Library
- J. Cai and R. Paige. Program derivation by fixed point computation. Sci. Comput. Program., 11(3):197--261, 1989. Google ScholarDigital Library
- R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008. Google ScholarDigital Library
- J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pp. 137--150, 2004. Google ScholarDigital Library
- D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna. GAMMA - A high performance dataflow database machine. In VLDB, pp. 228--237, 1986. Google ScholarDigital Library
- J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: A runtime for iterative MapReduce. In HPDC, pp. 810--818, 2010. Google ScholarDigital Library
- S. Fushimi, M. Kitsuregawa, and H. Tanaka. An overview of the system software of a parallel relational database machine GRACE. In VLDB, pp. 209--219, 1986. Google ScholarDigital Library
- G. Graefe and W. J. McKenna. The Volcano optimizer generator: Extensibility and efficient search. In ICDE, pp. 209--218, 1993. Google ScholarDigital Library
- U. Güntzer, W. Kießling, and R. Bayer. On the evaluation of recursion in (deductive) database systems by efficient differential fixpoint iteration. In ICDE, pp. 120--129, 1987. Google ScholarDigital Library
- F. Hueske, M. Peters, M. J. Sax, A. Rheinländer, R. Bergmann, A. Krettek, K. Tzoumas. Opening the black boxes in data flow optimization. PVLDB, Vol. 5, 2012 (to appear). Google ScholarDigital Library
- M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys, pp. 59--72, 2007. Google ScholarDigital Library
- J. B. Kam and J. D. Ullman. Monotone data flow analysis frameworks. Acta Inf., Vol. 7, No. 3: pp. 305--317, 1977.Google ScholarDigital Library
- S. Kamvar, T. Haveliwala, and G. Golub. Adaptive methods for the computation of PageRank. Technical Report 2003--26, Stanford InfoLab, 2003.Google Scholar
- U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system. In ICDM, pp. 229--238, 2009. Google ScholarDigital Library
- T-H. Lai, Y-C. Tseng, X. Dong. A more efficient message-optimal algorithm for distributed termination detection. In IPPS, pp. 646--649, 1992 Google ScholarDigital Library
- Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A new framework for parallel machine learning. In UAI, pp. 340--349, 2010.Google Scholar
- G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD, pp. 135--146, 2010. Google ScholarDigital Library
- D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand. Ciel: A universal execution engine for distributed data-flow computing. In NSDI, 2011. Google ScholarDigital Library
- L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report No. 1999--66, Stanford InfoLab, 1999.Google Scholar
- D. A. Schneider and D. J. DeWitt. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In SIGMOD, pp. 110--121, 1989. Google ScholarDigital Library
- P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In SIGMOD, pp. 23--34, 1979. Google ScholarDigital Library
- L. G. Valiant. General purpose parallel architectures. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity (A), pp. 943--972, 1990. Google ScholarDigital Library
- M. Weimer, T. Condie, and R. Ramakrishnan. Machine learning in ScalOps, a higher order cloud computing language. In NIPS BigLearn, Vol. 9, pp. 389--396, 2011.Google Scholar
- M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In HotCloud, pp. 1--7, 2010. Google ScholarDigital Library
- M. Zinkevich, M. Weimer, A. J. Smola, and L. Li. Parallelized stochastic gradient descent. In NIPS, pp. 2595--2603, 2010.Google ScholarDigital Library
Recommendations
Efficient Iterative Methods Applied to the Solution of Transonic Flows
We investigate the use of an inexact Newton's method to solve the potential equations in the transonic regime. As a test case, we solve the two-dimensional steady transonic small disturbance equation. Approximate factorization/ADI techniques have ...
Comparison of the Parallel Fast Marching Method, the Fast Iterative Method, and the Parallel Semi-ordered Fast Iterative Method
Solving the eikonal equation allows to compute a monotone front propagation of anisotropic nature and is thus a widely applied technique in different areas of science and engineering. Various methods are available out of which only a subset is suitable ...
Iterative methods for variational and complementarity problems
In this paper, we study both the local and global convergence of various iterative methods for solving the variational inequality and the nonlinear complementarity problems. Included among such methods are the Newton and several successive ...
Comments