Abstract
Summingbird is an open-source domain-specific language implemented in Scala and designed to integrate online and batch MapReduce computations in a single framework. Summingbird programs are written using dataflow abstractions such as sources, sinks, and stores, and can run on different execution platforms: Hadoop for batch processing (via Scalding/Cascading) and Storm for online processing. Different execution modes require different bindings for the dataflow abstractions (e.g., HDFS files or message queues for the source) but do not require any changes to the program logic. Furthermore, Summingbird can operate in a hybrid processing mode that transparently integrates batch and online results to efficiently generate up-to-date aggregations over long time spans. The language was designed to improve developer productivity and address pain points in building analytics solutions at Twitter where often, the same code needs to be written twice (once for batch processing and again for online processing) and indefinitely maintained in parallel. Our key insight is that certain algebraic structures provide the theoretical foundation for integrating batch and online processing in a seamless fashion. This means that Summingbird imposes constraints on the types of aggregations that can be performed, although in practice we have not found these constraints to be overly restrictive for a broad range of analytics tasks at Twitter.
- T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. MillWheel: Fault-tolerant stream processing at internet scale. VLDB, 2013. Google ScholarDigital Library
- R. Ananthanarayanan, V. Basker, S. Das, A. Gupta, H. Jiang, T. Qiu, A. Reznichenko, D. Ryabkov, M. Singh, and S. Venkataraman. Photon: Fault-tolerant and scalable joining of continuous data streams. SIGMOD, 2013. Google ScholarDigital Library
- P. Bhatotia, A. Wieder, R. Rodrigues, U. A. Acar, and R. Pasquini. Incoop: MapReduce for incremental computations. SoCC, 2011. Google ScholarDigital Library
- B. Bloom. Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, 13(7):422--426, 1970. Google ScholarDigital Library
- L. Bottou. Large-scale machine learning with stochastic gradient descent. COMPSTAT, 2010.Google Scholar
- A. Z. Broder. On the resemblance and containment of documents. SEQUENCES, 1997. Google ScholarDigital Library
- D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, G. Seidman, M. Stonebraker, N. Tatbul, and S. Zdonik. Monitoring streams---a new class of data management applications. VLDB, 2002. Google ScholarDigital Library
- T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. MapReduce online. NSDI, 2010. Google ScholarDigital Library
- G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005. Google ScholarDigital Library
- P. Flajolet, É. Fusy, O. Gandouet, and F. Meunier. Hyperloglog: The analysis of a near-optimal cardinality estimation algorithm. Conference on Analysis of Algorithms, 2007.Google Scholar
- B. Gedik, H. Andrade, K.-L. Wu, P. Yu, and M. Doo. SPADE: The System S declarative stream processing engine. SIGMOD, 2008. Google ScholarDigital Library
- J. Gehrke. Special issue on data stream processing. Bulletin of the Technical Committee on Data Engineering, 26(1):2, 2003.Google Scholar
- M. Grinev, M. Grineva, M. Hentschel, and D. Kossmann. Analytics for the real-time web. VLDB, 2011.Google ScholarDigital Library
- M. Hayes and S. Shah. Hourglass: A library for incremental processing on Hadoop. Big Data, 2013.Google Scholar
- S. Heule, M. Nunkesser, and A. Hall. HyperLogLog in practice: Algorithmic engineering of a state of the art cardinality estimation algorithm. EDBT, 2013. Google ScholarDigital Library
- B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. NSDI, 2011. Google ScholarDigital Library
- M. Izbicki. Algebraic classifiers: A generic approach to fast cross-validation, online training, and parallel training. ICML, 2013.Google Scholar
- J. Kreps, N. Narkhede, and J. Rao. Kafka: A distributed messaging system for log processing. NetDB, 2011.Google Scholar
- S. Krishnamurthy, M. Franklin, J. Davis, D. Farina, P. Golovko, A. Li, and N. Thombre. Continuous analytics over discontinuous streams. SIGMOD, 2010. Google ScholarDigital Library
- V. Kumar, H. Andrade, B. Gedik, and K.-L. Wu. DEDUCE: At the intersection of MapReduce and stream processing. EDBT, 2010. Google ScholarDigital Library
- W. Lam, L. Liu, S. Prasad, A. Rajaraman, Z. Vacheri, and A. Doan. Muppet: MapReduce-style processing of fast data. VLDB, 2012. Google ScholarDigital Library
- B. Lampson. Hints for computer system design. SOSP, 1983. Google ScholarDigital Library
- G. Lee, J. Lin, C. Liu, A. Lorek, and D. Ryaboy. The unified logging infrastructure for data analytics at Twitter. VLDB, 2012. Google ScholarDigital Library
- J. Lin and A. Kolcz. Large-scale machine learning at Twitter. SIGMOD, 2012. Google ScholarDigital Library
- J. Lin and D. Ryaboy. Scaling big data mining infrastructure: The Twitter experience. SIGKDD Explorations, 14(2):6--19, 2012. Google ScholarDigital Library
- D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum. Stateful bulk processing for incremental analytics. SoCC, 2010. Google ScholarDigital Library
- E. Meijer and G. Bierman. A co-relational model of data for large shared data banks. Communications of the ACM, 54(4):49--58, 2011. Google ScholarDigital Library
- G. Mishne, J. Dalton, Z. Li, A. Sharma, and J. Lin. Fast data in the era of big data: Twitter's real-time related query suggestion architecture. SIGMOD, 2013. Google ScholarDigital Library
- S. A. Myers, A. Sharma, P. Gupta, and J. Lin. Information network or social network? The structure of the Twitter follow graph. WWW Companion, 2014. Google ScholarDigital Library
- D. Navalho, S. Duarte, N. Preguiça, and M. Shapiro. Incremental stream processing using computational conflict-free replicated data types. CloudDP, 2013. Google ScholarDigital Library
- L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. KDCloud Workshop at ICDM, 2010. Google ScholarDigital Library
- C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig Latin: A not-so-foreign language for data processing. SIGMOD, 2008. Google ScholarDigital Library
- R. Pike, S. Dorward, R. Griesemer, and S. Quinlan. Interpreting the data: Parallel analysis with Sawzall. Scientific Programming Journal, 13(4):277--298, 2005. Google ScholarDigital Library
- Z. Qian, Y. He, C. Su, Z. Wu, H. Zhu, T. Zhang, L. Zhou, Y. Yu, and Z. Zhang. TimeStream: Reliable stream computation in the cloud. EuroSys, 2013. Google ScholarDigital Library
- M. Shapiro, N. Preguiça, C. Baquero, and M. Zawirski. A comprehensive study of Convergent and Commutative Replicated Data Types. Technical report, INRIA, 2011.Google Scholar
- D. Simoncelli, M. Dusi, F. Gringoli, and S. Niccolini. Stream-monitoring with BlockMon: Convergence of network measurements and data analytics platforms. ACM SIGCOMM Computer Communication Review, 43(2):30--35, 2013. Google ScholarDigital Library
- A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, N. Bhagat, S. Mittal, and D. Ryaboy. Storm @Twitter. SIGMOD, 2014. Google ScholarDigital Library
- Y. Yu, M. Isard, D. Fetterly, M. Budiu, Ú. Erlingsson, P. K. Gunda, and J. Currey. DryadLINQ: A system for general-purpose distributed data-parallel computing using a high-level language. OSDI, 2008. Google ScholarDigital Library
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient Distributed Datasets: A fault-tolerant abstraction for in-memory cluster computing. NSDI, 2012. Google ScholarDigital Library
- M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. HotCloud, 2012. Google ScholarDigital Library
Recommendations
Layout-sensitive language extensibility with SugarHaskell
Haskell '12: Proceedings of the 2012 Haskell SymposiumProgrammers need convenient syntax to write elegant and concise programs. Consequently, the Haskell standard provides syntactic sugar for some scenarios (e.g., do notation for monadic code), authors of Haskell compilers provide syntactic sugar for more ...
Layout-sensitive language extensibility with SugarHaskell
Haskell '12Programmers need convenient syntax to write elegant and concise programs. Consequently, the Haskell standard provides syntactic sugar for some scenarios (e.g., do notation for monadic code), authors of Haskell compilers provide syntactic sugar for more ...
Comments