skip to main content
research-article

Spinning fast iterative data flows

Published:01 July 2012Publication History
Skip Abstract Section

Abstract

Parallel dataflow systems are a central part of most analytic pipelines for big data. The iterative nature of many analysis and machine learning algorithms, however, is still a challenge for current systems. While certain types of bulk iterative algorithms are supported by novel dataflow frameworks, these systems cannot exploit computational dependencies present in many algorithms, such as graph algorithms. As a result, these algorithms are inefficiently executed and have led to specialized systems based on other paradigms, such as message passing or shared memory.

We propose a method to integrate incremental iterations, a form of workset iterations, with parallel dataflows. After showing how to integrate bulk iterations into a dataflow system and its optimizer, we present an extension to the programming model for incremental iterations. The extension alleviates for the lack of mutable state in dataflows and allows for exploiting the sparse computational dependencies inherent in many iterative algorithms. The evaluation of a prototypical implementation shows that those aspects lead to up to two orders of magnitude speedup in algorithm runtime, when exploited. In our experiments, the improved dataflow system is highly competitive with specialized systems while maintaining a transparent and unified dataflow abstraction.

References

  1. L. Afanasiev, T. Grust, M. Marx, J. Rittinger, and J. Teubner. Recursion in XQuery: Put your distributivity safety belt on. In EDBT, pp. 345--356, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. F. N. Afrati, V. R. Borkar, M. J. Carey, N. Polyzotis, and J. D. Ullman. Map-reduce extensions and recursive queries. In EDBT, pp. 1--8, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Apache Giraph. http://incubator.apache.org/giraph/.Google ScholarGoogle Scholar
  4. Apache Hadoop. http://hadoop.apache.org.Google ScholarGoogle Scholar
  5. Apache Mahout. http://mahout.apache.org.Google ScholarGoogle Scholar
  6. F. Bancilhon, R. Ramakrishnan. An amateur's introduction to recursive query processing strategies. In SIGMOD, pp. 16--52, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In SoCC, pp. 119--130, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. C. Beeri and R. Ramakrishnan. On the power of Magic. In PODS, pp. 269--284, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Behm, V. R. Borkar, M. J. Carey, R. Grover, C. Li, N. Onose, R. Vernica, A. Deutsch, Y. Papakonstantinou, and V. J. Tsotras. ASTERIX: Towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases, 29(3):185--216, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Boldi and S. Vigna. The WebGraph framework I: Compression techniques. In WWW, pp. 595--601, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. V. R. Borkar, M. J. Carey, R. Grover, N. Onose, and R. Vernica. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE, pp. 1151--1162, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Y. Bu, V. R. Borkar, M. J. Carey, J. Rosen, N. Polyzotis, T. Condie, M. Weimer, R. Ramakrishnan. Scaling Datalog for machine learning on Big Data. In CoRR, abs/1203.0160, 2012.Google ScholarGoogle Scholar
  13. Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst. HaLoop: Efficient iterative data processing on large clusters. PVLDB, 3(1):285--296, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. J. Cai and R. Paige. Program derivation by fixed point computation. Sci. Comput. Program., 11(3):197--261, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB, 1(2):1265--1276, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, pp. 137--150, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. J. DeWitt, R. H. Gerber, G. Graefe, M. L. Heytens, K. B. Kumar, and M. Muralikrishna. GAMMA - A high performance dataflow database machine. In VLDB, pp. 228--237, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox. Twister: A runtime for iterative MapReduce. In HPDC, pp. 810--818, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Fushimi, M. Kitsuregawa, and H. Tanaka. An overview of the system software of a parallel relational database machine GRACE. In VLDB, pp. 209--219, 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. G. Graefe and W. J. McKenna. The Volcano optimizer generator: Extensibility and efficient search. In ICDE, pp. 209--218, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. U. Güntzer, W. Kießling, and R. Bayer. On the evaluation of recursion in (deductive) database systems by efficient differential fixpoint iteration. In ICDE, pp. 120--129, 1987. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. F. Hueske, M. Peters, M. J. Sax, A. Rheinländer, R. Bergmann, A. Krettek, K. Tzoumas. Opening the black boxes in data flow optimization. PVLDB, Vol. 5, 2012 (to appear). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly. Dryad: Distributed data-parallel programs from sequential building blocks. In EuroSys, pp. 59--72, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. B. Kam and J. D. Ullman. Monotone data flow analysis frameworks. Acta Inf., Vol. 7, No. 3: pp. 305--317, 1977.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. S. Kamvar, T. Haveliwala, and G. Golub. Adaptive methods for the computation of PageRank. Technical Report 2003--26, Stanford InfoLab, 2003.Google ScholarGoogle Scholar
  26. U. Kang, C. E. Tsourakakis, and C. Faloutsos. Pegasus: A peta-scale graph mining system. In ICDM, pp. 229--238, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. T-H. Lai, Y-C. Tseng, X. Dong. A more efficient message-optimal algorithm for distributed termination detection. In IPPS, pp. 646--649, 1992 Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A new framework for parallel machine learning. In UAI, pp. 340--349, 2010.Google ScholarGoogle Scholar
  29. G. Malewicz, M. H. Austern, A. J. C. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski. Pregel: A system for large-scale graph processing. In SIGMOD, pp. 135--146, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand. Ciel: A universal execution engine for distributed data-flow computing. In NSDI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L. Page, S. Brin, R. Motwani, and T. Winograd. The PageRank citation ranking: Bringing order to the web. Technical Report No. 1999--66, Stanford InfoLab, 1999.Google ScholarGoogle Scholar
  32. D. A. Schneider and D. J. DeWitt. A performance evaluation of four parallel join algorithms in a shared-nothing multiprocessor environment. In SIGMOD, pp. 110--121, 1989. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. P. G. Selinger, M. M. Astrahan, D. D. Chamberlin, R. A. Lorie, and T. G. Price. Access path selection in a relational database management system. In SIGMOD, pp. 23--34, 1979. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. L. G. Valiant. General purpose parallel architectures. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity (A), pp. 943--972, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Weimer, T. Condie, and R. Ramakrishnan. Machine learning in ScalOps, a higher order cloud computing language. In NIPS BigLearn, Vol. 9, pp. 389--396, 2011.Google ScholarGoogle Scholar
  36. M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In HotCloud, pp. 1--7, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Zinkevich, M. Weimer, A. J. Smola, and L. Li. Parallelized stochastic gradient descent. In NIPS, pp. 2595--2603, 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 5, Issue 11
    July 2012
    608 pages

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 July 2012
    Published in pvldb Volume 5, Issue 11

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader