Abstract
Window aggregation is a core operation in data stream processing. Existing aggregation techniques focus on reducing latency, eliminating redundant computations, or minimizing memory usage. However, each technique operates under different assumptions with respect to workload characteristics, such as properties of aggregation functions (e.g., invertible, associative), window types (e.g., sliding, sessions), windowing measures (e.g., time- or count-based), and stream (dis)order. In this article, we present Scotty, an efficient and general open-source operator for sliding-window aggregation in stream processing systems, such as Apache Flink, Apache Beam, Apache Samza, Apache Kafka, Apache Spark, and Apache Storm. One can easily extend Scotty with user-defined aggregation functions and window types. Scotty implements the concept of general stream slicing and derives workload characteristics from aggregation queries to improve performance without sacrificing its general applicability. We provide an in-depth view on the algorithms of the general stream slicing approach. Our experiments show that Scotty outperforms alternative solutions.
- Tyler Akidau, Robert Bradshaw, et al. 2015. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8, 12 (2015), 1792--1803.Google ScholarDigital Library
- Alexander Alexandrov, Rico Bergmann, Stephan Ewen, et al. 2014. The Stratosphere platform for big data analytics. VLDB J. 23, 6 (2014), 939--964.Google ScholarDigital Library
- Apache Apex. 2018. Enterprise-grade unified stream and batch processing engine. Retrieved from https://apex.apache.org/.Google Scholar
- Apache Beam. 2018. An advanced unified programming model. Retrieved from https://beam.apache.org/.Google Scholar
- Arvind Arasu and Jennifer Widom. 2004. Resource sharing in continuous sliding-window aggregates. Proceedings of the International Conference on Very Large Data Bases (PVLDB’04). 336--347.Google ScholarCross Ref
- Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz, Shixiong Zhu, Reynold Xin, et al. 2018. Structured streaming: A declarative API for real-time applications in Apache Spark. In Proceedings of the ACM Special Interest Group on Management of Data (SIGMOD’18). 601--613.Google Scholar
- Ahmed Awad, Jonas Traub, and Sherif Sakr. 2019. Adaptive watermarks: A concept drift-based approach for predicting event-time progress in data streams. In Proceedings of the International Conference on Extending Database Technology (EDBT’19).Google Scholar
- Cagri Balkesen and Nesime Tatbul. 2011. Scalable data partitioning techniques for parallel sliding window processing over data streams. In Proceedings of the International Workshop on Data Management for Sensor Networks (DMSN’11).Google Scholar
- Lawrence Benson, Philipp M. Grulich, Steffen Zeuch, Volker Markl, and Tilmann Rabl. 2020. Disco: Efficient distributed window aggregation. In Proceedings of the International Conference on Extending Database Technology (EDBT’20).Google Scholar
- Pramod Bhatotia, Umut A. Acar, Flavio P. Junqueira, and Rodrigo Rodrigues. 2014. Slider: Incremental sliding window analytics. In Proceedings of the International Middleware Conference. ACM, 61--72.Google ScholarDigital Library
- Brice Bingman. 2018. Poor performance with sliding time windows. In Flink Jira Issues. Retrieved from issues.apache.org/jira/browse/FLINK-6990.Google Scholar
- Irina Botan, Roozbeh Derakhshan, Nihal Dindar, Laura Haas, Renée J. Miller, and Nesime Tatbul. 2010. SECRET: A model for analysis of the execution semantics of stream processing systems. Proc. VLDB Endow. 3, 1--2 (2010), 232--243.Google ScholarDigital Library
- Paris Carbone. 2018. Scalable and Reliable Data Stream Processing. Ph.D. Dissertation. KTH Stockholm.Google Scholar
- Paris Carbone, Stephan Ewen, Gyula Fóra, Seif Haridi, Stefan Richter, and Kostas Tzoumas. 2017. State management in Apache Flink: Consistent stateful distributed stream processing. Proc. VLDB Endow. 10, 12 (2017), 1718--1729.Google ScholarDigital Library
- Paris Carbone, Gyula Fóra, Stephan Ewen, Seif Haridi, and Kostas Tzoumas. 2015. Lightweight asynchronous snapshots for distributed dataflows. Retrieved from https://arxiv.org/abs/1506.08603.Google Scholar
- Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and batch processing in a single engine. IEEE Data Eng. Bull. 38, 4 (2015), 28--38.Google Scholar
- Paris Carbone, Jonas Traub, Asterios Katsifodimos, Seif Haridi, and Volker Markl. 2016. Cutty: Aggregate sharing for user-defined windows. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM’16). 1201--1210.Google ScholarDigital Library
- Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert DeLine, Danyel Fisher, John C. Platt, James F. Terwilliger, and John Wernsing. 2014. Trill: A high-performance incremental query processor for diverse analytics. Proceedings of the International Conference on Very Large Data Bases (PVLDB’14) 8, 4 (2014), 401--412.Google ScholarDigital Library
- Sanket Chintapalli, Derek Dagit, Bobby Evans et al. 2016. Benchmarking streaming computation engines: Storm, Flink and Spark streaming. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’16). 1789--1792.Google ScholarCross Ref
- Xenofontas Dimitropoulos, Paul Hurley, Andreas Kind, and Marc Ph Stoecklin. 2009. On the 95-percentile billing method. In Passive and Active Network Measurement. Springer, 207--216.Google Scholar
- Buğra Gedik. 2014. Generic windowing support for extensible stream processing systems. Softw.: Pract. Exp. 44, 9 (2014), 1105--1128.Google ScholarDigital Library
- Thanaa M. Ghanem, Moustafa A. Hammad, Mohamed F. Mokbel, Walid G. Aref, and Ahmed K. Elmagarmid. 2007. Incremental evaluation of sliding-window queries over data streams. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 57--72.Google ScholarDigital Library
- Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov. 1, 1 (1997), 29--53.Google ScholarDigital Library
- Michael Grossniklaus, David Maier, James Miller, Sharmadha Moorthy, and Kristin Tufte. 2016. Frames: Data-driven windows. In Proceedings of the ACM International Conference on Distributed and Event-based Systems (DEBS’16).Google ScholarDigital Library
- Philipp M. Grulich, Sebastian Breß, Jonas Traub, Tilmann Rabl, Janis von Bleichert, Zongxiong Chen, Steffen Zeuch, and Volker Markl. 2020. Grizzly: Efficient stream processing through adaptive query compilation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM New York, NY.Google ScholarDigital Library
- Philipp M. Grulich, Jonas Traub, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, and Volker Markl. 2019. Generating reproducible out-of-order data streams. In Proceedings of the ACM International Conference on Distributed and Event-based Systems (DEBS’19). 256--257.Google ScholarDigital Library
- Shenoda Guirguis, Mohamed A. Sharaf, Panos K. Chrysanthis, and Alexandros Labrinidis. 2011. Optimized processing of multiple aggregate continuous queries. In Proceedings of the Conference on Information and Knowledge Management (CIKM’11). ACM, 1515--1524.Google ScholarDigital Library
- Shenoda Guirguis, Mohamed A. Sharaf, Panos K. Chrysanthis, and Alexandros Labrinidis. 2012. Three-level processing of multiple aggregate continuous queries. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’12). 929--940.Google ScholarDigital Library
- Guenter Hesse, Christoph Matthies, Kelvin Glass, Johannes Huegle, and Matthias Uflacker. 2019. Quantitative impact evaluation of an abstraction layer for data stream processing systems. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS’19). IEEE, 1381--1392.Google ScholarCross Ref
- Martin Hirzel, Henrique Andrade, Buğra Gedik, Vibhore Kumar, Giuliano Losa, Howard Nasgaard, Robert Soulé, and Kun-Lung Wu. 2009. SPL stream processing language specification. IBM Res. Report (2009). http://cs.yale.edu/homes/soule/pubs/rc24897.pdf.Google Scholar
- Martin Hirzel, Scott Schneider, and Kanat Tangwongsan. 2017. Sliding-window aggregation algorithms: Tutorial. In Proceedings of the ACM International Conference on Distributed and Event-based Systems (DEBS’17). 11--14.Google ScholarDigital Library
- Martin Hirzel, Robert Soulé, Scott Schneider, Buğra Gedik, and Robert Grimm. 2014. A catalog of stream processing optimizations. Comput. Surveys 46, 4 (2014), 46.Google ScholarDigital Library
- Kartik Hosanagar, John Chuang, Ramayya Krishnan, and Michael D. Smith. 2008. Service adoption and pricing of content delivery network (CDN) services. Manage. Sci. 54, 9 (2008), 1579--1593.Google ScholarDigital Library
- Ryan Huebsch, Minos Garofalakis, Joseph M. Hellerstein, and Ion Stoica. 2007. Sharing aggregate computation for distributed queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 485--496.Google ScholarDigital Library
- Julius Hülsmann, Jonas Traub, and Volker Markl. 2020. Demand-based sensor data gathering with multi-query optimization. Proc. VLDB Endow. 13, 12 (2020), 2801--2804.Google ScholarDigital Library
- Zbigniew Jerzak, Thomas Heinze, Matthias Fehr, Daniel Gröber, Raik Hartung, and Nenad Stojanovic. 2012. The DEBS 2012 grand challenge. In Proceedings of the ACM International Conference on Distributed and Event-based Systems (DEBS’12). 393--398.Google Scholar
- Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich, and Volker Markl. 2014. M4: A visualization-oriented time series data aggregation. In Proceedings of the International Conference on Very Large Data Bases (PVLDB’14). 797--808.Google ScholarDigital Library
- Jeyhun Karimov, Tilmann Rabl, Asterios Katsifodimos, Roman Samarev, Henri Heiskanen, and Volker Markl. 2018. Benchmarking distributed stream processing engines. In IEEE Proceedings of the IEEE International Conference on Data Engineering (ICDE’18).Google Scholar
- Alexandros Koliousis, Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa, and Peter Pietzuch. 2016. SABER: Window-based hybrid stream processing for heterogeneous architectures. In Proceedings of the ACM Special Interest Group on Management of Data (SIGMOD’16). 555--569.Google Scholar
- Jay Kreps. 2016. Introducing Kafka streams: Stream processing made simple. Confluent Blog. Retrieved from https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/.Google Scholar
- Sailesh Krishnamurthy, Michael J. Franklin, Jeffrey Davis, Daniel Farina, Pasha Golovko, Alan Li, and Neil Thombre. 2010. Continuous analytics over discontinuous streams. In Proceedings of the ACM Special Interest Group on Management of Data (SIGMOD’10). 1081--1092.Google ScholarDigital Library
- Sailesh Krishnamurthy, Chung Wu, and Michael Franklin. 2006. On-the-fly sharing for streamed aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 623--634.Google ScholarDigital Library
- Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker. 2005. No pane, no gain: Efficient evaluation of sliding-window aggregates over data streams. SIGMOD Record 34, 1 (2005), 39--44.Google ScholarDigital Library
- Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker. 2005. Semantics and evaluation techniques for window aggregates in data streams. In Proceedings of the ACM Special Interest Group on Management of Data (SIGMOD’05). 311--322.Google Scholar
- Jin Li, Kristin Tufte, David Maier, and Vassilis Papadimos. 2008. AdaptWID: An adaptive, memory-efficient window aggregation implementation. IEEE Internet Comput. 12, 6 (2008), 22--29.Google ScholarDigital Library
- Jin Li, Kristin Tufte, Vladislav Shkapenyuk, Vassilis Papadimos, Theodore Johnson, and David Maier. 2008. Out-of-order processing: A new architecture for high-performance stream systems. Proc. VLDB Endow. 1, 1 (2008), 274--288.Google ScholarDigital Library
- Christopher Mutschler, Holger Ziekow, and Zbigniew Jerzak. 2013. The DEBS 2013 grand challenge. In Proceedings of the ACM International Conference on Distributed and Event-based Systems (DEBS’13). 289--294.Google Scholar
- Shadi A. Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, and Roy H. Campbell. 2017. Samza: Stateful scalable stream processing at LinkedIn. Proc. VLDB Endow. 10, 12 (2017), 1634--1645.Google ScholarDigital Library
- OpenJDK. 2018. JMH benchmarking suite project website. Retrieved from http://openjdk.java.net/projects/code-tools/jmh/.Google Scholar
- OpenJDK. 2018. Nashorn project, ObjectSizeCalculator. Retrieved from http://openjdk.java.net/projects/nashorn/.Google Scholar
- Kostas Patroumpas et al. 2006. Window specification over data streams. In Proceedings of the International Conference on Extending Database Technology (EDBT’06).Google Scholar
- David Salomon. 2007. Variable-length Codes for Data Compression. Springer.Google ScholarDigital Library
- Anatoli U. Shein, Panos K. Chrysanthis, and Alexandros Labrinidis. 2015. F1: Accelerating the optimization of aggregate continuous queries. In Proceedings of the Conference on Information and Knowledge Management (CIKM’15). 1151--1160.Google ScholarDigital Library
- Anatoli U. Shein, Panos K. Chrysanthis, and Alexandros Labrinidis. 2017. Flatfit: Accelerated incremental sliding-window aggregation for real-time analytics. In Proceedings of the International Conference on Scientific and Statistical Database Management.Google ScholarDigital Library
- Anatoli U. Shein, Panos K. Chrysanthis, and Alexandros Labrinidis. 2018. SlickDeque: High throughput and low latency incremental sliding-window aggregation. In Proceedings of the International Conference on Extending Database Technology (EDBT’18).Google Scholar
- Leo Syinchwun. 2016. Lightweight event time window. In Flink Jira Issues. Retrieved from issues.apache.org/jira/browse/FLINK-5387.Google Scholar
- Kanat Tangwongsan, Martin Hirzel, and Scott Schneider. 2017. Low-latency sliding-window aggregation in worst-case constant time. In Proceedings of the ACM International Conference on Distributed and Event-based Systems (DEBS’17). 66--77.Google ScholarDigital Library
- Kanat Tangwongsan, Martin Hirzel, and Scott Schneider. 2019. Optimal and general out-of-order sliding-window aggregation. Proc. VLDB Endow. 12, 10 (2019), 1167--1180.Google ScholarDigital Library
- Kanat Tangwongsan, Martin Hirzel, Scott Schneider, and Kun-Lung Wu. 2015. General incremental sliding-window aggregation. Proceedings of the International Conference on Very Large Data Bases (PVLDB’15) 8, 7 (2015), 702--713.Google ScholarDigital Library
- Joseph Torres, Michael Armbrust, Tathagata Das, and Shixiong Zhu. 2018. Introducing low-latency continuous processing mode in structured streaming in Apache Spark 2.3. Databricks Blog. Retrieved from https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html.Google Scholar
- Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, et al. 2014. Storm@twitter. In Proceedings of the ACM Special Interest Group on Management of Data (SIGMOD’14). 147--156.Google Scholar
- Jonas Traub. 2019. Demand-based data stream gathering, processing, and transmission. TU Berlin, PhD Thesis.Google Scholar
- Jonas Traub, Sebastian Breß, Tilmann Rabl, Asterios Katsifodimos, and Volker Markl. 2017. Optimized on-demand data streaming from sensor nodes. Proceedings of the Symposium on Cloud Computing (SoCC’17). 586--597.Google ScholarDigital Library
- Jonas Traub, Philipp M. Grulich, Alejandro Rodriguez Cuellar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, and Volker Markl. 2018. Scotty: Efficient window aggregation for out-of-order stream processing. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’18).Google ScholarCross Ref
- Jonas Traub, Philipp M. Grulich, Alejandro Rodriguez Cuellar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, and Volker Markl. 2019. Efficient window aggregation with general stream slicing. In Proceedings of the International Conference on Extending Database Technology (EDBT’17).Google Scholar
- Jonas Traub, Julius Hülsmann, Sebastian Breß, Tilmann Rabl, and Volker Markl. 2019. SENSE: Scalable data acquisition from distributed sensors with guaranteed time coherence. Retrieved from https://arxiv.org/abs/1912.04648.Google Scholar
- Jonas Traub, Nikolaas Steenbergen, Philipp M. Grulich, Tilmann Rabl, and Volker Markl. 2017. I2: Interactive real-time visualization for streaming data. In Proceedings of the International Conference on Extending Database Technology (EDBT’17). 526--529.Google Scholar
- Peter A. Tucker, David Maier, Tim Sheard, and Leonidas Fegaras. 2003. Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowl. Data Eng. 15, 3 (2003), 555--568.Google ScholarDigital Library
- Kostas Tzoumas et al. 2015. High-throughput, low-latency, and exactly-once stream processing with Apache Flink. Retrieved from data-artisans.com/blog/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink.Google Scholar
- Mikhail Vorontsov. 2013. Memory consumption of popular Java data types - part 2. Java Performance Tuning Guide. Retrieved from http://java-performance.info/memory-consumption-of-java-data-types-2/.Google Scholar
- Guozhang Wang. 2017. Enabling exactly-once in Kafka streams. Confluent Blog. Retrieved from https://www.confluent.io/blog/enabling-exactly-once-kafka-streams/.Google Scholar
- Jark Wu. 2017. Improve performance of sliding time window with pane optimization. In Flink Jira Issues. Retrieved from issues.apache.org/jira/browse/FLINK-7001.Google Scholar
- Yuan Yu, Pradeep Kumar Gunda, and Michael Isard. 2009. Distributed aggregation for data-parallel computing: Interfaces and implementations. In Proceedings of the ACM Special Interest Group on Operating Systems (SIGOPS’09). 247--260.Google Scholar
- Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoica. 2012. Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. In Proceedings of the USENIX Conference on Hot Topics in Cloud Computing.Google Scholar
- Matei Zaharia, Reynold S. Xin, Patrick Wendell, et al. 2016. Apache Spark: A unified engine for big data processing. Commun. ACM 59, 11 (2016), 56--65.Google ScholarDigital Library
- Steffen Zeuch, Ankit Chaudhary, Bonaventura Del Monte, et al. 2020. The NebulaStream Platform: Data and application management for the Internet of Things. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’20).Google Scholar
- Steffen Zeuch, Bonaventura Del Monte, Jeyhun Karimov, Clemens Lutz, Manuel Renz, Jonas Traub, Sebastian Breß, Tilmann Rabl, and Volker Markl. 2019. Analyzing efficient stream processing on modern hardware. In Proceedings of the International Conference on Very Large Data Bases (PVLDB’19).Google ScholarDigital Library
- Shuhao Zhang, Feng Zhang, Yingjun Wu, Bingsheng He, and Paul Johns. 2020. Hardware-conscious stream processing: A survey. ACM SIGMOD Record 48, 4 (2020), 18--29.Google ScholarDigital Library
Index Terms
- Scotty: General and Efficient Open-source Window Aggregation for Stream Processing Systems
Recommendations
Experimental study on the performance and resource utilization of data streaming frameworks
CCGrid '18: Proceedings of the 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid ComputingWith the advent of the Internet of Things (IoT), data stream processing have gained increased attention due to the ever-increasing need to process heterogeneous and voluminous data streams. This work addresses the problem of selecting a correct stream ...
Speeding up the multimedia feature extraction: a comparative study on the big data approach
The current explosion of multimedia data is significantly increasing the amount of potential knowledge. However, to get to the actual information requires to apply novel content-based techniques which in turn require time consuming extraction of ...
HYAS: Hybrid Autoscaler Agent for Apache Flink
Web EngineeringAbstractApache Flink is a distributed processing engine for stateful computations over unbounded and bounded data streams. Despite its versatility, Apache Flink cannot automatically and optimally adjust its computing resources to match the requirements of ...
Comments