Abstract
Modern Stream Processing Engines (SPEs) process large data volumes under tight latency constraints. Many SPEs execute processing pipelines using message passing on shared-nothing architectures and apply a partition-based scale-out strategy to handle high-velocity input streams. Furthermore, many state-of-the-art SPEs rely on a Java Virtual Machine to achieve platform independence and speed up system development by abstracting from the underlying hardware.
In this paper, we show that taking the underlying hardware into account is essential to exploit modern hardware efficiently. To this end, we conduct an extensive experimental analysis of current SPEs and SPE design alternatives optimized for modern hardware. Our analysis highlights potential bottlenecks and reveals that state-of-the-art SPEs are not capable of fully exploiting current and emerging hardware trends, such as multi-core processors and high-speed networks. Based on our analysis, we describe a set of design changes to the common architecture of SPEs to scale-up on modern hardware. We show that the single-node throughput can be increased by up to two orders of magnitude compared to state-of-the-art SPEs by applying specialized code generation, fusing operators, batch-style parallelization strategies, and optimized windowing. This speedup allows for deploying typical streaming applications on a single or a few nodes instead of large clusters.
- Apache Storm Trident. URL storm.apache.org/releases/1.1.1/Trident-tutorial.html.Google Scholar
- Disni: Direct storage and networking interface, 2017. Retrieved March 30, 2018, from https://github.com/zrlio/disni.Google Scholar
- I. t. association. infiniband roadmap, 2017. Retrieved October 30, 2017, from http://www.infinibandta.org/.Google Scholar
- Package java.util.concurrent, 2017. Retrieved October 30, 2017, from https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/package-summary.html.Google Scholar
- A ReaderWriter queue which shows superior performance in benchmarks, 2017. Retrieved October 30, 2017, from https://github.com/cameron314/readerwriterqueue/tree/master/benchmarks.Google Scholar
- Sparkrdma shufflemanager plugin, 2017. Retrieved March 30, 2018, from https://github.com/Mellanox/SparkRDMA.Google Scholar
- A SPSC implementation from Facebook, 2017. Retrieved October 30, 2017, from https://github.com/facebook/folly/tree/master/folly.Google Scholar
- A SPSC queue implemented by the Boost library, 2017. Retrieved October 30, 2017, from www.boost.org/doc/libs/1\_64\_0/doc/html/boost/lockfree/queue.html.Google Scholar
- A SPSC queue implemented by the TBB library, 2017. Retrieved October 30, 2017, from https://www.threadingbuildingblocks.org/docs/doxygen/a00035.html.Google Scholar
- Hotspot virtual machine garbage collection tuning guide, 2018. Retrieved November 2, 2018, from https://docs.oracle.com/en/java/javase/11/gctuning/available-collectors.html.Google Scholar
- D. J. Abadi, Y. Ahmad, M. Balazinska, U. Cetintemel, M. Cherniack, J.-H. Hwang, W. Lindner, A. Maskey, A. Rasin, E. Ryvkina, et al. The design of the borealis stream processing engine. In CIDR, volume 5, pages 277--289, 2005.Google Scholar
- D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: a new model and architecture for data stream management. The VLDB Journal, 12(2):120--139, 2003. Google ScholarDigital Library
- T. Akidau, A. Balikov, K. Bekiroğlu, S. Chernyak, J. Haberman, R. Lax, S. McVeety, D. Mills, P. Nordstrom, and S. Whittle. Millwheel: fault-tolerant stream processing at internet scale. PVLDB, 6(11):1033--1044, 2013. Google ScholarDigital Library
- T. Akidau, R. Bradshaw, C. Chambers, S. Chernyak, R. J. Fernández-Moctezuma, R. Lax, S. McVeety, D. Mills, F. Perry, E. Schmidt, et al. The dataflow model: a practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. PVLDB, 8(12):1792--1803, 2015. Google ScholarDigital Library
- A. Alexandrov, R. Bergmann, S. Ewen, J.-C. Freytag, F. Hueske, A. Heise, O. Kao, M. Leich, U. Leser, V. Markl, et al. The stratosphere platform for big data analytics. The VLDB Journal, 23(6):939--964, 2014. Google ScholarDigital Library
- A. Arasu, M. Cherniack, E. Galvez, D. Maier, A. S. Maskey, E. Ryvkina, M. Stonebraker, and R. Tibbetts. Linear road: a stream data management benchmark. In VLDB, pages 480--491. 2004. Google ScholarDigital Library
- A. Arasu and J. Widom. Resource sharing in continuous sliding-window aggregates. In VLDB, pages 336--347, 2004. Google ScholarDigital Library
- M. Armbrust. Making apache spark the fastest open source streaming engine, 2017. Databricks Engineering Blog, URL https://databricks.com/blog/2017/06/06/simple-super-fast-streaming-engine-apache-spark.html.Google Scholar
- D. Battré, S. Ewen, F. Hueske, O. Kao, V. Markl, and D. Warneke. Nephele/PACTs: A programming model and execution framework for web-scale analytical processing. In SoCC, pages 119--130. ACM, 2010. Google ScholarDigital Library
- T. Bernhardt and A. Vasseur. Esper: Event stream processing and correlation, 2007.Google Scholar
- A. Biem, E. Bouillet, H. Feng, A. Ranganathan, A. Riabov, O. Verscheure, H. Koutsopoulos, and C. Moran. IBM infosphere streams for scalable, real-time, intelligent transportation services. In SIGMOD, pages 1093--1104. ACM, 2010. Google ScholarDigital Library
- C. Binnig, A. Crotty, A. Galakatos, T. Kraska, and E. Zamanian. The end of slow networks: It's time for a redesign. PVLDB, 9(7):528--539, 2016. Google ScholarDigital Library
- P. Boncz, M. Zukowski, and N. Nes. MonetDB/X100: hyper-pipelining query execution. In CIDR, volume 5, pages 225--237, 2005.Google Scholar
- I. Botan, Y. Cho, R. Derakhshan, N. Dindar, L. Haas, K. Kim, C. Lee, G. Mundada, M.-C. Shan, N. Tatbul, et al. Design and implementation of the maxstream federated stream processing architecture. 2009.Google Scholar
- I. Botan, D. Kossmann, P. M. Fischer, T. Kraska, D. Florescu, and R. Tamosevicius. Extending xquery with window functions. In VLDB, pages 75--86. 2007. Google ScholarDigital Library
- L. Braun, T. Etter, G. Gasparis, M. Kaufmann, D. Kossmann, D. Widmer, A. Avitzur, A. Iliopoulos, E. Levy, and N. Liang. Analytics in motion: High performance event-processing and real-time analytics in the same database. In SIGMOD, pages 251--264. ACM, 2015. Google ScholarDigital Library
- S. Breß, H. Funke, and J. Teubner. Robust query processing in co-processor-accelerated databases. In SIGMOD, pages 1891--1906. ACM, 2016. Google ScholarDigital Library
- C. Cameron, J. Singer, and D. Vengerov. The judgment of forseti: Economic utility for dynamic heap sizing of multiple runtimes. In ACM SIGPLAN Notices, volume 50, pages 143--156. ACM, 2015. Google ScholarDigital Library
- P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 36(4), 2015.Google Scholar
- P. Carbone, A. Katsifodimos, S. Ewen, V. Markl, S. Haridi, and K. Tzoumas. Apache flink: Stream and batch processing in a single engine. IEEE Data Engineering Bulletin, 36(4), 2015.Google Scholar
- P. Carbone, J. Traub, A. Katsifodimos, S. Haridi, and V. Markl. Cutty: Aggregate sharing for user-defined windows. In CIKM, pages 1201--1210. 2016. Google ScholarDigital Library
- R. Castro Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Integrating scale out and fault tolerance in stream processing using operator state management. In SIGMOD, pages 725--736. 2013. Google ScholarDigital Library
- C. Chambers, A. Raniwala, F. Perry, S. Adams, R. Henry, R. Bradshaw, and Nathan. Flumejava: Easy, efficient data-parallel pipelines. In ACM SIGPLAN (PLDI), pages 363--375. 2010. Google ScholarDigital Library
- B. Chandramouli, J. Goldstein, M. Barnett, R. DeLine, D. Fisher, J. C. Platt, J. F. Terwilliger, and J. Wernsing. Trill: A high-performance incremental query processor for diverse analytics. PVLDB, 8(4):401--412, 2014. Google ScholarDigital Library
- S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden, F. Reiss, and M. A. Shah. Telegraphcq: continuous dataflow processing. In SIGMOD, pages 668--668. ACM, 2003. Google ScholarDigital Library
- J. Chen, D. J. DeWitt, F. Tian, and Y. Wang. Niagaracq: A scalable continuous query system for internet databases. In ACM SIGMOD Record, volume 29, pages 379--390. ACM, 2000. Google ScholarDigital Library
- Q. Chen, M. Hsu, and H. Zeller. Experience in continuous analytics as a service (CaaaS). In EDBT, pages 509--514. 2011. Google ScholarDigital Library
- S. Chintapalli, D. Dagit, B. Evans, R. Farivar, T. Graves, M. Holderbaugh, Z. Liu, K. Nusbaum, K. Patil, B. J. Peng, et al. Benchmarking streaming computation engines: Storm, Flink and Spark streaming. In International Parallel and Distributed Processing Symposium, Workshops, pages 1789--1792. 2016.Google ScholarCross Ref
- S. Chintapalli, D. Dagit, B. Evans, R. Farivar, T. Graves, M. Holderbaugh, Z. Liu, K. Nusbaum, K. Patil, B. J. Peng, and P. Poulosky. Benchmarking streaming computation engines at yahoo, 2015.Google Scholar
- R. Consortium. RDMA Protocol Verbs Specification (Version 1.0). April 2003.Google Scholar
- I. Corporation. Intel® 64 and IA-32 Architectures Software Developer's Manual. Number 325462-044US. August 2012.Google Scholar
- J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. In OSDI, pages 10--10. USENIX Association, 2004. Google ScholarDigital Library
- S. Ewen. Off-heap memory in apache flink and the curious jit compiler, 2015. Retrieved November 2, 2018, from https://flink.apache.org/news/2015/09/16/off-heap-memory.html.Google Scholar
- F. Färber, S. K. Cha, J. Primsch, C. Bornhövd, S. Sigg, and W. Lehner. Sap hana database: data management for modern business applications. ACM Sigmod Record, 40(4):45--51, 2012. Google ScholarDigital Library
- R. C. Fernandez, M. Migliavacca, E. Kalyvianaki, and P. Pietzuch. Making state explicit for imperative big data processing. In 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 49--60, Philadelphia, PA, 2014. USENIX Association. Google ScholarDigital Library
- P. M. Fischer, K. S. Esmaili, and R. J. Miller. Stream schema: providing and exploiting static metadata for data stream processing. In EDBT, pages 207--218. 2010. Google ScholarDigital Library
- B. Gedik. Generic windowing support for extensible stream processing systems. Software: Practice and Experience, 44(9):1105--1128, 2014. Google ScholarDigital Library
- B. Gedik, H. Andrade, and K. Wu. A code generation approach to optimizing high-performance distributed data stream processing. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, CIKM 2009, pages 847--856, 2009. Google ScholarDigital Library
- J. Grier. Extending the yahoo! streaming benchmark, 2016.Google Scholar
- M. Grossniklaus, D. Maier, J. Miller, S. Moorthy, and K. Tufte. Frames: data-driven windows. In DEBS, pages 13--24. 2016. Google ScholarDigital Library
- M. Herlihy, N. Shavit, and M. Tzafrir. Hopscotch hashing. In International Symposium on Distributed Computing, pages 350--364. Springer, 2008. Google ScholarDigital Library
- M. Hirzel, H. Andrade, B. Gedik, G. Jacques-Silva, R. Khandekar, V. Kumar, M. Mendell, H. Nasgaard, S. Schneider, R. Soulé, et al. IBM streams processing language: Analyzing big data in motion. IBM Journal of Research and Development, 57(3/4):7--1, 2013. Google ScholarDigital Library
- M. Hirzel, R. Soulé, S. Schneider, B. Gedik, and R. Grimm. A catalog of stream processing optimizations. ACM Comput. Surv., 46(4):46:1--46:34, 2014. Google ScholarDigital Library
- N. Jain, L. Amini, H. Andrade, R. King, Y. Park, P. Selo, and C. Venkatramani. Design, implementation, and evaluation of the linear road bnchmark on the stream processing core. In SIGMOD, pages 431--442. 2006. Google ScholarDigital Library
- A. Kalia, M. Kaminsky, and D. G. Andersen. Design guidelines for high performance rdma systems. In Proceedings of the 2016 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC '16, pages 437--450, Berkeley, CA, USA, 2016. USENIX Association. Google ScholarDigital Library
- A. Kipf, V. Pandey, J. Böttcher, L. Braun, T. Neumann, and A. Kemper. Analytics on fast data: Main-memory database systems versus modern streaming systems. In EDBT, pages 49--60. 2017.Google Scholar
- A. Koliousis, M. Weidlich, R. Castro Fernandez, A. L. Wolf, P. Costa, and P. Pietzuch. Saber: Window-based hybrid stream processing for heterogeneous architectures. In SIGMOD, pages 555--569. ACM, 2016. Google ScholarDigital Library
- K. Krikellas, S. Viglas, and M. Cintra. Generating code for holistic query evaluation. In ICDE, pages 613--624. IEEE, 2010.Google ScholarCross Ref
- S. Krishnamurthy, M. J. Franklin, J. Davis, D. Farina, P. Golovko, A. Li, and N. Thombre. Continuous analytics over discontinuous streams. In SIGMOD, pages 1081--1092. 2010. Google ScholarDigital Library
- S. Krishnamurthy, C. Wu, and M. Franklin. On-the-fly sharing for streamed aggregation. In SIGMOD, pages 623--634. 2006. Google ScholarDigital Library
- V. Leis, P. Boncz, A. Kemper, and T. Neumann. Morsel-driven parallelism: A NUMA-aware query evaluation framework for the many-core age. In SIGMOD, pages 743--754. ACM, 2014. Google ScholarDigital Library
- V. Leis, F. Scheibner, A. Kemper, and T. Neumann. The art of practical synchronization. In DaMoN, page 3. ACM, 2016. Google ScholarDigital Library
- J. Li, D. Maier, K. Tufte, V. Papadimos, and P. A. Tucker. No pane, no gain: efficient evaluation of sliding-window aggregates over data streams. ACM SIGMOD Record, 34(1):39--44, 2005. Google ScholarDigital Library
- E. Liarou, R. Goncalves, and S. Idreos. Exploiting the power of relational databases for efficient stream processing. In EDBT, pages 323--334. 2009. Google ScholarDigital Library
- D. Lion, A. Chiu, H. Sun, X. Zhuang, N. Grcevski, and D. Yuan. Don't get caught in the cold, warm-up your jvm: Understand and eliminate jvm warm-up overhead in data-parallel systems. In OSDI, pages 383--400, 2016. Google ScholarDigital Library
- H. Miao, H. Park, M. Jeon, G. Pekhimenko, K. S. McKinley, and F. X. Lin. Streambox: Modern stream processing on a multicore machine. In 2017 USENIX Annual Technical Conference (USENIX ATC 17), pages 617--629, Santa Clara, CA, 2017. USENIX Association. Google ScholarDigital Library
- C. Mitchell, Y. Geng, and J. Li. Using one-sided rdma reads to build a fast, cpu-efficient key-value store. In Proceedings of the 2013 USENIX Conference on Annual Technical Conference, USENIX ATC'13, pages 103--114, Berkeley, CA, USA, 2013. USENIX Association. Google ScholarDigital Library
- D. G. Murray, F. McSherry, R. Isaacs, M. Isard, P. Barham, and M. Abadi. Naiad: a timely dataflow system. In SOSP, pages 439--455. ACM, 2013. Google ScholarDigital Library
- T. Neumann. Efficiently compiling efficient query plans for modern hardware. PVLDB, 4(9):539--550, 2011. Google ScholarDigital Library
- K. Ousterhout, R. Rasti, S. Ratnasamy, S. Shenker, and B.-G. Chun. Making sense of performance in data analytics frameworks. In NSDI, pages 293--307. USENIX Association, 2015. Google ScholarDigital Library
- I. Pandis, R. Johnson, N. Hardavellas, and A. Ailamaki. Data-oriented transaction execution. PVLDB, 3(1--2):928--939, 2010. Google ScholarDigital Library
- P. Pietzuch, P. Garefalakis, A. Koliousis, H. Pirk, and G. Theodorakis. Do we need distributed stream processing?, 2018.Google Scholar
- I. Psaroudakis, T. Scheuer, N. May, A. Sellami, and A. Ailamaki. Scaling up concurrent main-memory column-store scans: Towards adaptive numa-aware data and task placement. PVLDB, 8(12):1442--1453, 2015. Google ScholarDigital Library
- W. Rödiger, T. Mühlbauer, A. Kemper, and T. Neumann. High-speed query processing over high-speed networks. PVLDB, 9(4):228--239, 2015. Google ScholarDigital Library
- M. J. Sax, M. Castellanos, Q. Chen, and M. Hsu. Performance optimization for distributed intra-node-parallel streaming systems. In Workshops Proceedings of the 29th IEEE International Conference on Data Engineering, ICDE 2013, Brisbane, Australia, April 8--12, 2013, pages 62--69, 2013.Google ScholarCross Ref
- T. Schneider. Analyzing 1.1 billion nyc taxi and uber trips, with a vengeance, 2015.Google Scholar
- M. Svensson. Benchmarking the performance of a data stream management system. PhD thesis, MSc thesis report, Uppsala University, 2007.Google Scholar
- K. Tangwongsan, M. Hirzel, and S. Schneider. Low-latency sliding-window aggregation in worst-case constant time. In DEBS, pages 66--77, New York, NY, USA, 2017. ACM. Google ScholarDigital Library
- K. Tangwongsan, M. Hirzel, S. Schneider, and K.-L. Wu. General incremental sliding-window aggregation. PVLDB, 8(7):702--713, 2015. Google ScholarDigital Library
- A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, et al. Storm@ twitter. In SIGMOD, pages 147--156. ACM, 2014. Google ScholarDigital Library
- A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J. M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, et al. Storm@Twitter. In SIGMOD, pages 147--156. 2014. Google ScholarDigital Library
- J. Traub, P. Grulich, A. R. Cuéllar, S. Breß, A. Katsifodimos, T. Rabl, and V. Markl. Efficient window aggregation with general stream slicing. In 22nd International Conference on Extending Database Technology (EDBT), 2019.Google Scholar
- J. Traub, P. M. Grulich, A. R. Cuellar, S. Breß, A. Katsifodimos, T. Rabl, and V. Markl. Scotty: Efficient window aggregation for out-of-order stream processing. In 34th International Conference on Data Engineering (ICDE), pages 1300--1303. IEEE, 2018.Google ScholarCross Ref
- A. Trivedi, P. Stuedi, J. Pfefferle, R. Stoica, B. Metzler, I. Koltsidas, and N. Ioannou. On the {ir}relevance of network performance for data processing. In USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 16). USENIX Association, 2016. Google ScholarDigital Library
- D. Vyukov. Single-producer/single-consumer queue. Intel Developer Zonw, URL software.intel.com/en-us/articles/single-producer-single-consumer-queue, 2015.Google Scholar
- Y. Wu and K.-L. Tan. Chronostream: Elastic stateful stream computation in the cloud. In ICDE, pages 723--734. IEEE, 2015.Google ScholarCross Ref
- R. Xin and J. Rosen. Project tungsten: Bringing apache spark closer to bare metal, 2015. Retrieved November 2, 2018, from URL https://databricks.com/blog/2015/04/28/project-tungsten-bringing-spark-closer-to-bare-metal.html.Google Scholar
- B. Yavuz. Benchmarking structured streaming on databricks runtime against state-of-the-art streaming systems, 2017.Google Scholar
- M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. J. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In NSDI, pages 2--2. 2012. Google ScholarDigital Library
- M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. Discretized streams: Fault-tolerant streaming computation at scale. In SOSP, pages 423--438. ACM, 2013. Google ScholarDigital Library
- M. Zaharia, R. S. Xin, P. Wendell, T. Das, M. Armbrust, A. Dave, X. Meng, J. Rosen, S. Venkataraman, M. J. Franklin, et al. Apache spark: A unified engine for big data processing. Communications of the ACM, 59(11):56--65, 2016. Google ScholarDigital Library
- E. Zeitler and T. Risch. Scalable splitting of massive data streams. In DASFAA, pages 184--198. 2010. Google ScholarDigital Library
- E. Zeitler and T. Risch. Massive scale-out of expensive continuous queries. PVLDB, 4(11):1181--1188, 2011.Google ScholarDigital Library
- S. Zhang, B. He, D. Dahlmeier, A. C. Zhou, and T. Heinze. Revisiting the design of data stream processing systems on multi-core processors. In ICDE, pages 659--670. 2017.Google ScholarCross Ref
- M. Zukowski, P. A. Boncz, et al. Vectorwise: Beyond column stores. IEEE Data Eng. Bull., 2012.Google Scholar
Recommendations
ATI Stream Profiler: a tool to optimize an OpenCL kernel on ATI Radeon GPUs
SIGGRAPH '10: ACM SIGGRAPH 2010 PostersModern GPUs have been shown to be highly efficient machines for data-parallel applications such as graphics, image, video processing, or physical simulation applications. For example, a single ATI Radeon™ HD 5870 GPU has a theoretical peak of 2.72 ...
Brook for GPUs: Stream Computing on Graphics Hardware
Seminal Graphics Papers: Pushing the Boundaries, Volume 2In this paper, we present Brook for GPUs, a system for general-purpose computation on programmable graphics hardware. Brook extends C to include simple data-parallel constructs, enabling the use of the GPU as a streaming co-processor. We present a ...
Comments