ABSTRACT
Existing stream processing frameworks operate either under data stream paradigm processing data record by record to favor low latency, or under operation stream paradigm processing data in micro-batches to desire high throughput. For complex and mutable data processing requirements, this dilemma brings the selection and deployment of stream processing frameworks into an embarrassing situation. Moreover, current data stream or operation stream paradigms cannot handle data burst efficiently, which probably results in noticeable performance degradation. This paper introduces a dual-paradigm stream processing, called DO (Data and Operation) that can adapt to stream data volatility. It enables data to be processed in micro-batches (i.e., operation stream) when data burst occurs to achieve high throughput, while data is processed record by record (i.e., data stream) in the remaining time to sustain low latency. DO embraces a method to detect data bursts, identify the main operations affected by the data burst and switch paradigms accordingly. Our insight behind DO's design is that the trade-off between latency and throughput of stream processing frameworks can be dynamically achieved according to data communication among operations in a fine-grained manner (i.e., operation level) instead of framework level. We implement a prototype stream processing framework that adopts DO. Our experimental results show that our framework with DO can achieve 5x speedup over operation stream under low data stream sizes, and outperforms data stream on throughput by 2.1x to 3.2x under data burst.
- Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-tolerant Stream Processing at Internet Scale. Proceedings of the VLDB Endowment. 6, 11 (Aug. 2013), 1033--1044. Google ScholarDigital Library
- Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni. 2013. Adaptive Online Scheduling in Storm. In Proceedings of the 7th ACM International Conference on Distributed Event-based Systems (DEBS '13). ACM, 207--218. Google ScholarDigital Library
- Brian Babcock, Shivnath Babu, Rajeev Motwani, and Mayur Datar. 2003. Chain: Operator Scheduling for Memory Minimization in Data Stream Systems. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD '03). ACM, 253--264. Google ScholarDigital Library
- Brian Babcock, Mayur Datar, and Rajeev Motwani. 2003. Load shedding techniques for data stream systems. In Proceedings of the 2003 Management and Processing of Data Streams Workshop (MPDS '03), Vol. 577. ACM.Google Scholar
- Yahoo! Streaming Benchmark. 2015. https://yahooeng.tumblr.com/post/135321837876/Google Scholar
- Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, Umut A. Acar, and Rafael Pasquin. 2011. Incoop: MapReduce for Incremental Computations. In Proceedings of the 2011 ACM Symposium on Cloud Computing (SOCC '11). ACM, Article 7, 14 pages. Google ScholarDigital Library
- Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce Online. In Proceedings of the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI '10). USENIX, 313--328. Google ScholarDigital Library
- Tathagata Das, Yuan Zhong, Ion Stoica, and Scott Shenker. 2014. Adaptive Stream Processing Using Dynamic Batch Sizing. In Proceedings of the 2014 ACM Symposium on Cloud Computing (SOCC '14). ACM, Article 16, 13 pages. Google ScholarDigital Library
- Structured Streaming Programming Guide. 2018. http://spark.apache.org/docs/latest/structured-streaming-programming-guide.htmlGoogle Scholar
- Martin Hirzel, Robert Soulé, Scott Schneider, Buğra Gedik, and Robert Grimm. 2014. A Catalog of Stream Processing Optimizations. ACM Comput. Surv. 46, 4, Article 46 (March2014), 34 pages. Google ScholarDigital Library
- JStorm. 2015. http://jstorm.io/Google Scholar
- Asterios Katsifodimos and Sebastian Schelter. 2016. Apache Flink: Stream Analytics at Scale. In Proceedings of the 2016 IEEE International Conference on Cloud Engineering Workshop (IC2EW '16). IEEE, 193--193.Google ScholarCross Ref
- Wilhelm Kleiminger, Evangelia Kalyvianaki, and Peter Pietzuch. 2011. Balancing load in stream processing with the cloud. In Proceedings of the 27th IEEE International Conference on Data Engineering Workshops (ICDEW '11). IEEE, 16--21. Google ScholarDigital Library
- Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, 239--250. Google ScholarDigital Library
- Julie Letchner, Christopher Ré, Magdalena Balazinska, and Matthai Philipose. 2010. Approximation trade-offs in Markovian stream processing: An empirical study. In Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE '10). IEEE, 936--939.Google ScholarCross Ref
- Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP '13). ACM, 439--455. Google ScholarDigital Library
- Leonardo Neumeyer, Bruce Robbins, Anish Nair, and Anand Kesari. 2010. S4: Distributed Stream Computing Platform. In Proceedings of the 2010 IEEE International Conference on Data Mining Workshops (ICDMW '10). IEEE, 170--177. Google ScholarDigital Library
- Christopher Olston, Greg Chiou, Laukik Chitnis, Francis Liu, Yiping Han, Mattias Larsson, Andreas Neumann, Vellanki B.N. Rao, Vijayanand Sankarasubramanian, Siddharth Seth, Chao Tian, Topher ZiCornell, and Xiaodan Wang. 2011. Nova: Continuous Pig/Hadoop Workflows. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD '11). ACM, 1081--1090. Google ScholarDigital Library
- Samza. 2015. http://samza.apache.org/Google Scholar
- Michael Stonebraker, Uğur Çetintemel, and Stan Zdonik. 2005. The 8 Requirements of Real-time Stream Processing. SIGMOD Rec. 34, 4 (Dec. 2005), 42--47. Google ScholarDigital Library
- Jaspar Subhlok and Gary Vondran. 1996. Optimal Latency-throughput Tradeoffs for Data Parallel Pipelines. In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '96). ACM, 62--71. Google ScholarDigital Library
- Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy Ryaboy. 2014. Storm@twitter. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). ACM, 147--156. Google ScholarDigital Library
- Trident. 2015. http://storm.apache.org/releases/1.1.0/Trident-tutorial.htmlGoogle Scholar
- Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael J. Franklin, Benjamin Recht, and Ion Stoica. 2017. Drizzle: Fast and Adaptable Stream Processing at Scale. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP '17). ACM, 374--389. Google ScholarDigital Library
- Naga Vydyanathan, Umit Catalyurek, Tahsin Kurc, Ponnuswamy Sadayappan, and Joel Saltz. 2011. Optimizing latency and throughput of application workflows on clusters. Parallel Comput. 37, 10 (2011), 694--712. Google ScholarDigital Library
- Walter Willinger, Murad S. Taqqu, and Ashok Erramilli. 1996. A Bibliographical Guide to Self-Similar Traffic and Performance Modeling for Modern High-Speed Networks. Stochastic Networks Theory and Applications (1996), 339--366.Google Scholar
- Jielong Xu, Zhenhua Chen, Jian Tang, and Sen Su. 2014. T-Storm: Traffic-Aware Online Scheduling in Storm. In Proceedings of the 34th IEEE International Conference on Distributed Computing Systems (ICDCS '14). IEEE, 535--544. Google ScholarDigital Library
- Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized Streams: Fault-tolerant Streaming Computation at Scale. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP '13). ACM, 423--438. Google ScholarDigital Library
Recommendations
Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka
BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and TechnologiesIn recent years there has been a surge in applications focusing on streaming data to generate insights in real-time. Both academia, as well as industry, have tried to address this use case by developing a variety of Stream Processing Engines (SPEs) with ...
Towards collaborative data reduction in stream-processing systems
We consider a distributed system that disseminates high-volume event streams to many simultaneous monitoring applications over a low-bandwidth network. For bandwidth efficiency, we propose a collaborative data-reduction mechanism, 'group-aware stream ...
Massively-parallel stream processing under QoS constraints with Nephele
HPDC '12: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed ComputingToday, a growing number of commodity devices, like mobile phones or smart meters, is equipped with rich sensors and capable of producing continuous data streams. The sheer amount of these devices and the resulting overall data volumes of the streams ...
Comments