skip to main content
research-article

Scotty: General and Efficient Open-source Window Aggregation for Stream Processing Systems

Published:27 March 2021Publication History
Skip Abstract Section

Abstract

Window aggregation is a core operation in data stream processing. Existing aggregation techniques focus on reducing latency, eliminating redundant computations, or minimizing memory usage. However, each technique operates under different assumptions with respect to workload characteristics, such as properties of aggregation functions (e.g., invertible, associative), window types (e.g., sliding, sessions), windowing measures (e.g., time- or count-based), and stream (dis)order. In this article, we present Scotty, an efficient and general open-source operator for sliding-window aggregation in stream processing systems, such as Apache Flink, Apache Beam, Apache Samza, Apache Kafka, Apache Spark, and Apache Storm. One can easily extend Scotty with user-defined aggregation functions and window types. Scotty implements the concept of general stream slicing and derives workload characteristics from aggregation queries to improve performance without sacrificing its general applicability. We provide an in-depth view on the algorithms of the general stream slicing approach. Our experiments show that Scotty outperforms alternative solutions.

References

  1. Tyler Akidau, Robert Bradshaw, et al. 2015. The dataflow model: A practical approach to balancing correctness, latency, and cost in massive-scale, unbounded, out-of-order data processing. Proc. VLDB Endow. 8, 12 (2015), 1792--1803.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Alexander Alexandrov, Rico Bergmann, Stephan Ewen, et al. 2014. The Stratosphere platform for big data analytics. VLDB J. 23, 6 (2014), 939--964.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Apache Apex. 2018. Enterprise-grade unified stream and batch processing engine. Retrieved from https://apex.apache.org/.Google ScholarGoogle Scholar
  4. Apache Beam. 2018. An advanced unified programming model. Retrieved from https://beam.apache.org/.Google ScholarGoogle Scholar
  5. Arvind Arasu and Jennifer Widom. 2004. Resource sharing in continuous sliding-window aggregates. Proceedings of the International Conference on Very Large Data Bases (PVLDB’04). 336--347.Google ScholarGoogle ScholarCross RefCross Ref
  6. Michael Armbrust, Tathagata Das, Joseph Torres, Burak Yavuz, Shixiong Zhu, Reynold Xin, et al. 2018. Structured streaming: A declarative API for real-time applications in Apache Spark. In Proceedings of the ACM Special Interest Group on Management of Data (SIGMOD’18). 601--613.Google ScholarGoogle Scholar
  7. Ahmed Awad, Jonas Traub, and Sherif Sakr. 2019. Adaptive watermarks: A concept drift-based approach for predicting event-time progress in data streams. In Proceedings of the International Conference on Extending Database Technology (EDBT’19).Google ScholarGoogle Scholar
  8. Cagri Balkesen and Nesime Tatbul. 2011. Scalable data partitioning techniques for parallel sliding window processing over data streams. In Proceedings of the International Workshop on Data Management for Sensor Networks (DMSN’11).Google ScholarGoogle Scholar
  9. Lawrence Benson, Philipp M. Grulich, Steffen Zeuch, Volker Markl, and Tilmann Rabl. 2020. Disco: Efficient distributed window aggregation. In Proceedings of the International Conference on Extending Database Technology (EDBT’20).Google ScholarGoogle Scholar
  10. Pramod Bhatotia, Umut A. Acar, Flavio P. Junqueira, and Rodrigo Rodrigues. 2014. Slider: Incremental sliding window analytics. In Proceedings of the International Middleware Conference. ACM, 61--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Brice Bingman. 2018. Poor performance with sliding time windows. In Flink Jira Issues. Retrieved from issues.apache.org/jira/browse/FLINK-6990.Google ScholarGoogle Scholar
  12. Irina Botan, Roozbeh Derakhshan, Nihal Dindar, Laura Haas, Renée J. Miller, and Nesime Tatbul. 2010. SECRET: A model for analysis of the execution semantics of stream processing systems. Proc. VLDB Endow. 3, 1--2 (2010), 232--243.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Paris Carbone. 2018. Scalable and Reliable Data Stream Processing. Ph.D. Dissertation. KTH Stockholm.Google ScholarGoogle Scholar
  14. Paris Carbone, Stephan Ewen, Gyula Fóra, Seif Haridi, Stefan Richter, and Kostas Tzoumas. 2017. State management in Apache Flink: Consistent stateful distributed stream processing. Proc. VLDB Endow. 10, 12 (2017), 1718--1729.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Paris Carbone, Gyula Fóra, Stephan Ewen, Seif Haridi, and Kostas Tzoumas. 2015. Lightweight asynchronous snapshots for distributed dataflows. Retrieved from https://arxiv.org/abs/1506.08603.Google ScholarGoogle Scholar
  16. Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and batch processing in a single engine. IEEE Data Eng. Bull. 38, 4 (2015), 28--38.Google ScholarGoogle Scholar
  17. Paris Carbone, Jonas Traub, Asterios Katsifodimos, Seif Haridi, and Volker Markl. 2016. Cutty: Aggregate sharing for user-defined windows. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM’16). 1201--1210.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Badrish Chandramouli, Jonathan Goldstein, Mike Barnett, Robert DeLine, Danyel Fisher, John C. Platt, James F. Terwilliger, and John Wernsing. 2014. Trill: A high-performance incremental query processor for diverse analytics. Proceedings of the International Conference on Very Large Data Bases (PVLDB’14) 8, 4 (2014), 401--412.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Sanket Chintapalli, Derek Dagit, Bobby Evans et al. 2016. Benchmarking streaming computation engines: Storm, Flink and Spark streaming. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW’16). 1789--1792.Google ScholarGoogle ScholarCross RefCross Ref
  20. Xenofontas Dimitropoulos, Paul Hurley, Andreas Kind, and Marc Ph Stoecklin. 2009. On the 95-percentile billing method. In Passive and Active Network Measurement. Springer, 207--216.Google ScholarGoogle Scholar
  21. Buğra Gedik. 2014. Generic windowing support for extensible stream processing systems. Softw.: Pract. Exp. 44, 9 (2014), 1105--1128.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Thanaa M. Ghanem, Moustafa A. Hammad, Mohamed F. Mokbel, Walid G. Aref, and Ahmed K. Elmagarmid. 2007. Incremental evaluation of sliding-window queries over data streams. IEEE Trans. Knowl. Data Eng. 19, 1 (2007), 57--72.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. 1997. Data cube: A relational aggregation operator generalizing group-by, cross-tab, and sub-totals. Data Min. Knowl. Discov. 1, 1 (1997), 29--53.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Michael Grossniklaus, David Maier, James Miller, Sharmadha Moorthy, and Kristin Tufte. 2016. Frames: Data-driven windows. In Proceedings of the ACM International Conference on Distributed and Event-based Systems (DEBS’16).Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Philipp M. Grulich, Sebastian Breß, Jonas Traub, Tilmann Rabl, Janis von Bleichert, Zongxiong Chen, Steffen Zeuch, and Volker Markl. 2020. Grizzly: Efficient stream processing through adaptive query compilation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. ACM New York, NY.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Philipp M. Grulich, Jonas Traub, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, and Volker Markl. 2019. Generating reproducible out-of-order data streams. In Proceedings of the ACM International Conference on Distributed and Event-based Systems (DEBS’19). 256--257.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Shenoda Guirguis, Mohamed A. Sharaf, Panos K. Chrysanthis, and Alexandros Labrinidis. 2011. Optimized processing of multiple aggregate continuous queries. In Proceedings of the Conference on Information and Knowledge Management (CIKM’11). ACM, 1515--1524.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Shenoda Guirguis, Mohamed A. Sharaf, Panos K. Chrysanthis, and Alexandros Labrinidis. 2012. Three-level processing of multiple aggregate continuous queries. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’12). 929--940.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Guenter Hesse, Christoph Matthies, Kelvin Glass, Johannes Huegle, and Matthias Uflacker. 2019. Quantitative impact evaluation of an abstraction layer for data stream processing systems. In Proceedings of the International Conference on Distributed Computing Systems (ICDCS’19). IEEE, 1381--1392.Google ScholarGoogle ScholarCross RefCross Ref
  30. Martin Hirzel, Henrique Andrade, Buğra Gedik, Vibhore Kumar, Giuliano Losa, Howard Nasgaard, Robert Soulé, and Kun-Lung Wu. 2009. SPL stream processing language specification. IBM Res. Report (2009). http://cs.yale.edu/homes/soule/pubs/rc24897.pdf.Google ScholarGoogle Scholar
  31. Martin Hirzel, Scott Schneider, and Kanat Tangwongsan. 2017. Sliding-window aggregation algorithms: Tutorial. In Proceedings of the ACM International Conference on Distributed and Event-based Systems (DEBS’17). 11--14.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Martin Hirzel, Robert Soulé, Scott Schneider, Buğra Gedik, and Robert Grimm. 2014. A catalog of stream processing optimizations. Comput. Surveys 46, 4 (2014), 46.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Kartik Hosanagar, John Chuang, Ramayya Krishnan, and Michael D. Smith. 2008. Service adoption and pricing of content delivery network (CDN) services. Manage. Sci. 54, 9 (2008), 1579--1593.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Ryan Huebsch, Minos Garofalakis, Joseph M. Hellerstein, and Ion Stoica. 2007. Sharing aggregate computation for distributed queries. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 485--496.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Julius Hülsmann, Jonas Traub, and Volker Markl. 2020. Demand-based sensor data gathering with multi-query optimization. Proc. VLDB Endow. 13, 12 (2020), 2801--2804.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Zbigniew Jerzak, Thomas Heinze, Matthias Fehr, Daniel Gröber, Raik Hartung, and Nenad Stojanovic. 2012. The DEBS 2012 grand challenge. In Proceedings of the ACM International Conference on Distributed and Event-based Systems (DEBS’12). 393--398.Google ScholarGoogle Scholar
  37. Uwe Jugel, Zbigniew Jerzak, Gregor Hackenbroich, and Volker Markl. 2014. M4: A visualization-oriented time series data aggregation. In Proceedings of the International Conference on Very Large Data Bases (PVLDB’14). 797--808.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Jeyhun Karimov, Tilmann Rabl, Asterios Katsifodimos, Roman Samarev, Henri Heiskanen, and Volker Markl. 2018. Benchmarking distributed stream processing engines. In IEEE Proceedings of the IEEE International Conference on Data Engineering (ICDE’18).Google ScholarGoogle Scholar
  39. Alexandros Koliousis, Matthias Weidlich, Raul Castro Fernandez, Alexander L. Wolf, Paolo Costa, and Peter Pietzuch. 2016. SABER: Window-based hybrid stream processing for heterogeneous architectures. In Proceedings of the ACM Special Interest Group on Management of Data (SIGMOD’16). 555--569.Google ScholarGoogle Scholar
  40. Jay Kreps. 2016. Introducing Kafka streams: Stream processing made simple. Confluent Blog. Retrieved from https://www.confluent.io/blog/introducing-kafka-streams-stream-processing-made-simple/.Google ScholarGoogle Scholar
  41. Sailesh Krishnamurthy, Michael J. Franklin, Jeffrey Davis, Daniel Farina, Pasha Golovko, Alan Li, and Neil Thombre. 2010. Continuous analytics over discontinuous streams. In Proceedings of the ACM Special Interest Group on Management of Data (SIGMOD’10). 1081--1092.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Sailesh Krishnamurthy, Chung Wu, and Michael Franklin. 2006. On-the-fly sharing for streamed aggregation. In Proceedings of the ACM SIGMOD International Conference on Management of Data. 623--634.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker. 2005. No pane, no gain: Efficient evaluation of sliding-window aggregates over data streams. SIGMOD Record 34, 1 (2005), 39--44.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Jin Li, David Maier, Kristin Tufte, Vassilis Papadimos, and Peter A. Tucker. 2005. Semantics and evaluation techniques for window aggregates in data streams. In Proceedings of the ACM Special Interest Group on Management of Data (SIGMOD’05). 311--322.Google ScholarGoogle Scholar
  45. Jin Li, Kristin Tufte, David Maier, and Vassilis Papadimos. 2008. AdaptWID: An adaptive, memory-efficient window aggregation implementation. IEEE Internet Comput. 12, 6 (2008), 22--29.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Jin Li, Kristin Tufte, Vladislav Shkapenyuk, Vassilis Papadimos, Theodore Johnson, and David Maier. 2008. Out-of-order processing: A new architecture for high-performance stream systems. Proc. VLDB Endow. 1, 1 (2008), 274--288.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Christopher Mutschler, Holger Ziekow, and Zbigniew Jerzak. 2013. The DEBS 2013 grand challenge. In Proceedings of the ACM International Conference on Distributed and Event-based Systems (DEBS’13). 289--294.Google ScholarGoogle Scholar
  48. Shadi A. Noghabi, Kartik Paramasivam, Yi Pan, Navina Ramesh, Jon Bringhurst, Indranil Gupta, and Roy H. Campbell. 2017. Samza: Stateful scalable stream processing at LinkedIn. Proc. VLDB Endow. 10, 12 (2017), 1634--1645.Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. OpenJDK. 2018. JMH benchmarking suite project website. Retrieved from http://openjdk.java.net/projects/code-tools/jmh/.Google ScholarGoogle Scholar
  50. OpenJDK. 2018. Nashorn project, ObjectSizeCalculator. Retrieved from http://openjdk.java.net/projects/nashorn/.Google ScholarGoogle Scholar
  51. Kostas Patroumpas et al. 2006. Window specification over data streams. In Proceedings of the International Conference on Extending Database Technology (EDBT’06).Google ScholarGoogle Scholar
  52. David Salomon. 2007. Variable-length Codes for Data Compression. Springer.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Anatoli U. Shein, Panos K. Chrysanthis, and Alexandros Labrinidis. 2015. F1: Accelerating the optimization of aggregate continuous queries. In Proceedings of the Conference on Information and Knowledge Management (CIKM’15). 1151--1160.Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. Anatoli U. Shein, Panos K. Chrysanthis, and Alexandros Labrinidis. 2017. Flatfit: Accelerated incremental sliding-window aggregation for real-time analytics. In Proceedings of the International Conference on Scientific and Statistical Database Management.Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Anatoli U. Shein, Panos K. Chrysanthis, and Alexandros Labrinidis. 2018. SlickDeque: High throughput and low latency incremental sliding-window aggregation. In Proceedings of the International Conference on Extending Database Technology (EDBT’18).Google ScholarGoogle Scholar
  56. Leo Syinchwun. 2016. Lightweight event time window. In Flink Jira Issues. Retrieved from issues.apache.org/jira/browse/FLINK-5387.Google ScholarGoogle Scholar
  57. Kanat Tangwongsan, Martin Hirzel, and Scott Schneider. 2017. Low-latency sliding-window aggregation in worst-case constant time. In Proceedings of the ACM International Conference on Distributed and Event-based Systems (DEBS’17). 66--77.Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Kanat Tangwongsan, Martin Hirzel, and Scott Schneider. 2019. Optimal and general out-of-order sliding-window aggregation. Proc. VLDB Endow. 12, 10 (2019), 1167--1180.Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Kanat Tangwongsan, Martin Hirzel, Scott Schneider, and Kun-Lung Wu. 2015. General incremental sliding-window aggregation. Proceedings of the International Conference on Very Large Data Bases (PVLDB’15) 8, 7 (2015), 702--713.Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Joseph Torres, Michael Armbrust, Tathagata Das, and Shixiong Zhu. 2018. Introducing low-latency continuous processing mode in structured streaming in Apache Spark 2.3. Databricks Blog. Retrieved from https://databricks.com/blog/2018/03/20/low-latency-continuous-processing-mode-in-structured-streaming-in-apache-spark-2-3-0.html.Google ScholarGoogle Scholar
  61. Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, et al. 2014. Storm@twitter. In Proceedings of the ACM Special Interest Group on Management of Data (SIGMOD’14). 147--156.Google ScholarGoogle Scholar
  62. Jonas Traub. 2019. Demand-based data stream gathering, processing, and transmission. TU Berlin, PhD Thesis.Google ScholarGoogle Scholar
  63. Jonas Traub, Sebastian Breß, Tilmann Rabl, Asterios Katsifodimos, and Volker Markl. 2017. Optimized on-demand data streaming from sensor nodes. Proceedings of the Symposium on Cloud Computing (SoCC’17). 586--597.Google ScholarGoogle ScholarDigital LibraryDigital Library
  64. Jonas Traub, Philipp M. Grulich, Alejandro Rodriguez Cuellar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, and Volker Markl. 2018. Scotty: Efficient window aggregation for out-of-order stream processing. In Proceedings of the IEEE International Conference on Data Engineering (ICDE’18).Google ScholarGoogle ScholarCross RefCross Ref
  65. Jonas Traub, Philipp M. Grulich, Alejandro Rodriguez Cuellar, Sebastian Breß, Asterios Katsifodimos, Tilmann Rabl, and Volker Markl. 2019. Efficient window aggregation with general stream slicing. In Proceedings of the International Conference on Extending Database Technology (EDBT’17).Google ScholarGoogle Scholar
  66. Jonas Traub, Julius Hülsmann, Sebastian Breß, Tilmann Rabl, and Volker Markl. 2019. SENSE: Scalable data acquisition from distributed sensors with guaranteed time coherence. Retrieved from https://arxiv.org/abs/1912.04648.Google ScholarGoogle Scholar
  67. Jonas Traub, Nikolaas Steenbergen, Philipp M. Grulich, Tilmann Rabl, and Volker Markl. 2017. I2: Interactive real-time visualization for streaming data. In Proceedings of the International Conference on Extending Database Technology (EDBT’17). 526--529.Google ScholarGoogle Scholar
  68. Peter A. Tucker, David Maier, Tim Sheard, and Leonidas Fegaras. 2003. Exploiting punctuation semantics in continuous data streams. IEEE Trans. Knowl. Data Eng. 15, 3 (2003), 555--568.Google ScholarGoogle ScholarDigital LibraryDigital Library
  69. Kostas Tzoumas et al. 2015. High-throughput, low-latency, and exactly-once stream processing with Apache Flink. Retrieved from data-artisans.com/blog/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink.Google ScholarGoogle Scholar
  70. Mikhail Vorontsov. 2013. Memory consumption of popular Java data types - part 2. Java Performance Tuning Guide. Retrieved from http://java-performance.info/memory-consumption-of-java-data-types-2/.Google ScholarGoogle Scholar
  71. Guozhang Wang. 2017. Enabling exactly-once in Kafka streams. Confluent Blog. Retrieved from https://www.confluent.io/blog/enabling-exactly-once-kafka-streams/.Google ScholarGoogle Scholar
  72. Jark Wu. 2017. Improve performance of sliding time window with pane optimization. In Flink Jira Issues. Retrieved from issues.apache.org/jira/browse/FLINK-7001.Google ScholarGoogle Scholar
  73. Yuan Yu, Pradeep Kumar Gunda, and Michael Isard. 2009. Distributed aggregation for data-parallel computing: Interfaces and implementations. In Proceedings of the ACM Special Interest Group on Operating Systems (SIGOPS’09). 247--260.Google ScholarGoogle Scholar
  74. Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoica. 2012. Discretized streams: An efficient and fault-tolerant model for stream processing on large clusters. In Proceedings of the USENIX Conference on Hot Topics in Cloud Computing.Google ScholarGoogle Scholar
  75. Matei Zaharia, Reynold S. Xin, Patrick Wendell, et al. 2016. Apache Spark: A unified engine for big data processing. Commun. ACM 59, 11 (2016), 56--65.Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. Steffen Zeuch, Ankit Chaudhary, Bonaventura Del Monte, et al. 2020. The NebulaStream Platform: Data and application management for the Internet of Things. In Proceedings of the Conference on Innovative Data Systems Research (CIDR’20).Google ScholarGoogle Scholar
  77. Steffen Zeuch, Bonaventura Del Monte, Jeyhun Karimov, Clemens Lutz, Manuel Renz, Jonas Traub, Sebastian Breß, Tilmann Rabl, and Volker Markl. 2019. Analyzing efficient stream processing on modern hardware. In Proceedings of the International Conference on Very Large Data Bases (PVLDB’19).Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Shuhao Zhang, Feng Zhang, Yingjun Wu, Bingsheng He, and Paul Johns. 2020. Hardware-conscious stream processing: A survey. ACM SIGMOD Record 48, 4 (2020), 18--29.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Scotty: General and Efficient Open-source Window Aggregation for Stream Processing Systems

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format