skip to main content
research-article

Summingbird: a framework for integrating batch and online MapReduce computations

Published:01 August 2014Publication History
Skip Abstract Section

Abstract

Summingbird is an open-source domain-specific language implemented in Scala and designed to integrate online and batch MapReduce computations in a single framework. Summingbird programs are written using dataflow abstractions such as sources, sinks, and stores, and can run on different execution platforms: Hadoop for batch processing (via Scalding/Cascading) and Storm for online processing. Different execution modes require different bindings for the dataflow abstractions (e.g., HDFS files or message queues for the source) but do not require any changes to the program logic. Furthermore, Summingbird can operate in a hybrid processing mode that transparently integrates batch and online results to efficiently generate up-to-date aggregations over long time spans. The language was designed to improve developer productivity and address pain points in building analytics solutions at Twitter where often, the same code needs to be written twice (once for batch processing and again for online processing) and indefinitely maintained in parallel. Our key insight is that certain algebraic structures provide the theoretical foundation for integrating batch and online processing in a seamless fashion. This means that Summingbird imposes constraints on the types of aggregations that can be performed, although in practice we have not found these constraints to be overly restrictive for a broad range of analytics tasks at Twitter.

References

  1. T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. MillWheel: Fault-tolerant stream processing at internet scale. VLDB, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Ananthanarayanan, V. Basker, S. Das, A. Gupta, H. Jiang, T. Qiu, A. Reznichenko, D. Ryabkov, M. Singh, and S. Venkataraman. Photon: Fault-tolerant and scalable joining of continuous data streams. SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and R. Pasquini. Incoop: MapReduce for incremental computations. SoCC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. B. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. L. Bottou. Large-scale machine learning with stochastic gradient descent. COMPSTAT, 2010.Google ScholarGoogle Scholar
  6. A. Z. Broder. On the resemblance and containment of documents. SEQUENCES, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams---a new class of data management applications. VLDB, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce online. NSDI, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Flajolet, É. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. Conference on Analysis of Algorithms, 2007.Google ScholarGoogle Scholar
  11. B. Gedik, H. Andrade, K.-L. Wu, P. Yu, and M. Doo. SPADE: The System S declarative stream processing engine. SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Gehrke. Special issue on data stream processing. Bulletin of the Technical Committee on Data Engineering, 26(1):2, 2003.Google ScholarGoogle Scholar
  13. M. Grinev, M. Grineva, M. Hentschel, and D. Kossmann. Analytics for the real-time web. VLDB, 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Hayes and S. Shah. Hourglass: A library for incremental processing on Hadoop. Big Data, 2013.Google ScholarGoogle Scholar
  15. S. Heule, M. Nunkesser, and A. Hall. HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. EDBT, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. NSDI, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. M. Izbicki. Algebraic classifiers: A generic approach to fast cross-validation, online training, and parallel training. ICML, 2013.Google ScholarGoogle Scholar
  18. J. Kreps, N. Narkhede, and J. Rao. Kafka: A distributed messaging system for log processing. NetDB, 2011.Google ScholarGoogle Scholar
  19. S. Krishnamurthy, M. Franklin, J. Davis, D. Farina, P. Golovko, A. Li, and N. Thombre. Continuous analytics over discontinuous streams. SIGMOD, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Kumar, H. Andrade, B. Gedik, and K.-L. Wu. DEDUCE: At the intersection of MapReduce and stream processing. EDBT, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. W. Lam, L. Liu, S. Prasad, A. Rajaraman, Z. Vacheri, and A. Doan. Muppet: MapReduce-style processing of fast data. VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. B. Lampson. Hints for computer system design. SOSP, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. G. Lee, J. Lin, C. Liu, A. Lorek, and D. Ryaboy. The unified logging infrastructure for data analytics at Twitter. VLDB, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Lin and A. Kolcz. Large-scale machine learning at Twitter. SIGMOD, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Lin and D. Ryaboy. Scaling big data mining infrastructure: The Twitter experience. SIGKDD Explorations, 14(2):6--19, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum. Stateful bulk processing for incremental analytics. SoCC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. E. Meijer and G. Bierman. A co-relational model of data for large shared data banks. Communications of the ACM, 54(4):49--58, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. Mishne, J. Dalton, Z. Li, A. Sharma, and J. Lin. Fast data in the era of big data: Twitter's real-time related query suggestion architecture. SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. S. A. Myers, A. Sharma, P. Gupta, and J. Lin. Information network or social network? The structure of the Twitter follow graph. WWW Companion, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. D. Navalho, S. Duarte, N. Preguiça, and M. Shapiro. Incremental stream processing using computational conflict-free replicated data types. CloudDP, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. KDCloud Workshop at ICDM, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. SIGMOD, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming Journal, 13(4):277--298, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Z. Qian, Y. He, C. Su, Z. Wu, H. Zhu, T. Zhang, L. Zhou, Y. Yu, and Z. Zhang. TimeStream: Reliable stream computation in the cloud. EuroSys, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. Shapiro, N. Preguiça, C. Baquero, and M. Zawirski. A comprehensive study of Convergent and Commutative Replicated Data Types. Technical report, INRIA, 2011.Google ScholarGoogle Scholar
  36. D. Simoncelli, M. Dusi, F. Gringoli, and S. Niccolini. Stream-monitoring with BlockMon: Convergence of network measurements and data analytics platforms. ACM SIGCOMM Computer Communication Review, 43(2):30--35, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, and D. Ryaboy. Storm @Twitter. SIGMOD, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. OSDI, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. HotCloud, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image Proceedings of the VLDB Endowment
    Proceedings of the VLDB Endowment  Volume 7, Issue 13
    August 2014
    466 pages
    ISSN:2150-8097
    Issue’s Table of Contents

    Publisher

    VLDB Endowment

    Publication History

    • Published: 1 August 2014
    Published in pvldb Volume 7, Issue 13

    Qualifiers

    • research-article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader