research-article

Summingbird: a framework for integrating batch and online MapReduce computations

Authors:
Oscar Boykin

Twitter, Inc. San Francisco, California

Twitter, Inc. San Francisco, California
View Profile

,
Sam Ritchie

Twitter, Inc. San Francisco, California

Twitter, Inc. San Francisco, California
View Profile

,
Ian O'Connell

Twitter, Inc. San Francisco, California

Twitter, Inc. San Francisco, California
View Profile

,
Jimmy Lin

Twitter, Inc. San Francisco, California

Twitter, Inc. San Francisco, California
View Profile

Proceedings of the VLDB Endowment Volume 7 Issue 13pp 1441–1451https://doi.org/10.14778/2733004.2733016

Published:01 August 2014Publication History

Proceedings of the VLDB Endowment

Abstract

Summingbird is an open-source domain-specific language implemented in Scala and designed to integrate online and batch MapReduce computations in a single framework. Summingbird programs are written using dataflow abstractions such as sources, sinks, and stores, and can run on different execution platforms: Hadoop for batch processing (via Scalding/Cascading) and Storm for online processing. Different execution modes require different bindings for the dataflow abstractions (e.g., HDFS files or message queues for the source) but do not require any changes to the program logic. Furthermore, Summingbird can operate in a hybrid processing mode that transparently integrates batch and online results to efficiently generate up-to-date aggregations over long time spans. The language was designed to improve developer productivity and address pain points in building analytics solutions at Twitter where often, the same code needs to be written twice (once for batch processing and again for online processing) and indefinitely maintained in parallel. Our key insight is that certain algebraic structures provide the theoretical foundation for integrating batch and online processing in a seamless fashion. This means that Summingbird imposes constraints on the types of aggregations that can be performed, although in practice we have not found these constraints to be overly restrictive for a broad range of analytics tasks at Twitter.

References

T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. MillWheel: Fault-tolerant stream processing at internet scale. VLDB, 2013. Google ScholarDigital Library
R. Ananthanarayanan, V. Basker, S. Das, A. Gupta, H. Jiang, T. Qiu, A. Reznichenko, D. Ryabkov, M. Singh, and S. Venkataraman. Photon: Fault-tolerant and scalable joining of continuous data streams. SIGMOD, 2013. Google ScholarDigital Library
P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and R. Pasquini. Incoop: MapReduce for incremental computations. SoCC, 2011. Google ScholarDigital Library
B. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970. Google ScholarDigital Library
L. Bottou. Large-scale machine learning with stochastic gradient descent. COMPSTAT, 2010.Google Scholar
A. Z. Broder. On the resemblance and containment of documents. SEQUENCES, 1997. Google ScholarDigital Library
D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams---a new class of data management applications. VLDB, 2002. Google ScholarDigital Library
T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce online. NSDI, 2010. Google ScholarDigital Library
G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005. Google ScholarDigital Library
P. Flajolet, É. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. Conference on Analysis of Algorithms, 2007.Google Scholar
B. Gedik, H. Andrade, K.-L. Wu, P. Yu, and M. Doo. SPADE: The System S declarative stream processing engine. SIGMOD, 2008. Google ScholarDigital Library
J. Gehrke. Special issue on data stream processing. Bulletin of the Technical Committee on Data Engineering, 26(1):2, 2003.Google Scholar
M. Grinev, M. Grineva, M. Hentschel, and D. Kossmann. Analytics for the real-time web. VLDB, 2011.Google ScholarDigital Library
M. Hayes and S. Shah. Hourglass: A library for incremental processing on Hadoop. Big Data, 2013.Google Scholar
S. Heule, M. Nunkesser, and A. Hall. HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. EDBT, 2013. Google ScholarDigital Library
B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. NSDI, 2011. Google ScholarDigital Library
M. Izbicki. Algebraic classifiers: A generic approach to fast cross-validation, online training, and parallel training. ICML, 2013.Google Scholar
J. Kreps, N. Narkhede, and J. Rao. Kafka: A distributed messaging system for log processing. NetDB, 2011.Google Scholar
S. Krishnamurthy, M. Franklin, J. Davis, D. Farina, P. Golovko, A. Li, and N. Thombre. Continuous analytics over discontinuous streams. SIGMOD, 2010. Google ScholarDigital Library
V. Kumar, H. Andrade, B. Gedik, and K.-L. Wu. DEDUCE: At the intersection of MapReduce and stream processing. EDBT, 2010. Google ScholarDigital Library
W. Lam, L. Liu, S. Prasad, A. Rajaraman, Z. Vacheri, and A. Doan. Muppet: MapReduce-style processing of fast data. VLDB, 2012. Google ScholarDigital Library
B. Lampson. Hints for computer system design. SOSP, 1983. Google ScholarDigital Library
G. Lee, J. Lin, C. Liu, A. Lorek, and D. Ryaboy. The unified logging infrastructure for data analytics at Twitter. VLDB, 2012. Google ScholarDigital Library
J. Lin and A. Kolcz. Large-scale machine learning at Twitter. SIGMOD, 2012. Google ScholarDigital Library
J. Lin and D. Ryaboy. Scaling big data mining infrastructure: The Twitter experience. SIGKDD Explorations, 14(2):6--19, 2012. Google ScholarDigital Library
D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum. Stateful bulk processing for incremental analytics. SoCC, 2010. Google ScholarDigital Library
E. Meijer and G. Bierman. A co-relational model of data for large shared data banks. Communications of the ACM, 54(4):49--58, 2011. Google ScholarDigital Library
G. Mishne, J. Dalton, Z. Li, A. Sharma, and J. Lin. Fast data in the era of big data: Twitter's real-time related query suggestion architecture. SIGMOD, 2013. Google ScholarDigital Library
S. A. Myers, A. Sharma, P. Gupta, and J. Lin. Information network or social network? The structure of the Twitter follow graph. WWW Companion, 2014. Google ScholarDigital Library
D. Navalho, S. Duarte, N. Preguiça, and M. Shapiro. Incremental stream processing using computational conflict-free replicated data types. CloudDP, 2013. Google ScholarDigital Library
L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. KDCloud Workshop at ICDM, 2010. Google ScholarDigital Library
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. SIGMOD, 2008. Google ScholarDigital Library
R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming Journal, 13(4):277--298, 2005. Google ScholarDigital Library
Z. Qian, Y. He, C. Su, Z. Wu, H. Zhu, T. Zhang, L. Zhou, Y. Yu, and Z. Zhang. TimeStream: Reliable stream computation in the cloud. EuroSys, 2013. Google ScholarDigital Library
M. Shapiro, N. Preguiça, C. Baquero, and M. Zawirski. A comprehensive study of Convergent and Commutative Replicated Data Types. Technical report, INRIA, 2011.Google Scholar
D. Simoncelli, M. Dusi, F. Gringoli, and S. Niccolini. Stream-monitoring with BlockMon: Convergence of network measurements and data analytics platforms. ACM SIGCOMM Computer Communication Review, 43(2):30--35, 2013. Google ScholarDigital Library
A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, and D. Ryaboy. Storm @Twitter. SIGMOD, 2014. Google ScholarDigital Library
Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. OSDI, 2008. Google ScholarDigital Library
M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI, 2012. Google ScholarDigital Library
M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. HotCloud, 2012. Google ScholarDigital Library

Recommendations

Layout-sensitive language extensibility with SugarHaskell
Haskell '12: Proceedings of the 2012 Haskell Symposium

Programmers need convenient syntax to write elegant and concise programs. Consequently, the Haskell standard provides syntactic sugar for some scenarios (e.g., do notation for monadic code), authors of Haskell compilers provide syntactic sugar for more ...
Read More
Layout-sensitive language extensibility with SugarHaskell
Haskell '12

Programmers need convenient syntax to write elegant and concise programs. Consequently, the Haskell standard provides syntactic sugar for some scenarios (e.g., do notation for monadic code), authors of Haskell compilers provide syntactic sugar for more ...
Read More
Business leadership: Influence and Lead ! Fundamentals for Personal and Profess
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
Proceedings of the VLDB Endowment Volume 7, Issue 13
August 2014
466 pages
ISSN:2150-8097
Editors:
H. V. Jagadish
University of Michigan
,
Aoying Zhou
East Normal University, China
Issue’s Table of Contents
Sponsors
In-Cooperation
Publisher
VLDB Endowment
Publication History
- Published: 1 August 2014
Published in pvldb Volume 7, Issue 13
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 35
  Total Citations
  View Citations
- 479
  Total Downloads
- Downloads (Last 12 months)10
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Summingbird: a framework for integrating batch and online MapReduce computations

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Layout-sensitive language extensibility with SugarHaskell

Layout-sensitive language extensibility with SugarHaskell

Business leadership: Influence and Lead ! Fundamentals for Personal and Profess

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Summingbird: a framework for integrating batch and online MapReduce computations

Proceedings of the VLDB Endowment

Abstract

References

Cited By

Recommendations

Layout-sensitive language extensibility with SugarHaskell

Layout-sensitive language extensibility with SugarHaskell

Business leadership: Influence and Lead ! Fundamentals for Personal and Profess

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media