research-article

Dual-Paradigm Stream Processing

Authors:
Song Wu

SCTS/CGCL, Huazhong University of Science and Technology, Wuhan, China

SCTS/CGCL, Huazhong University of Science and Technology, Wuhan, China
View Profile

,
Zhiyi Liu

SCTS/CGCL, Huazhong University of Science and Technology, Wuhan, China

SCTS/CGCL, Huazhong University of Science and Technology, Wuhan, China
View Profile

,
Shadi Ibrahim

Inria, IMT Atlantique, Nantes, France

Inria, IMT Atlantique, Nantes, France
View Profile

,
Lin Gu

SCTS/CGCL, Huazhong University of Science and Technology, Wuhan, China

SCTS/CGCL, Huazhong University of Science and Technology, Wuhan, China
View Profile

,
Hai Jin

SCTS/CGCL, Huazhong University of Science and Technology, Wuhan, China

SCTS/CGCL, Huazhong University of Science and Technology, Wuhan, China
View Profile

,
Fei Chen

SCTS/CGCL, Huazhong University of Science and Technology, Wuhan, China

SCTS/CGCL, Huazhong University of Science and Technology, Wuhan, China
View Profile

ICPP '18: Proceedings of the 47th International Conference on Parallel ProcessingAugust 2018Article No.: 83Pages 1–10https://doi.org/10.1145/3225058.3225120

Published:13 August 2018Publication History

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

Pages 1–10

ABSTRACT

Existing stream processing frameworks operate either under data stream paradigm processing data record by record to favor low latency, or under operation stream paradigm processing data in micro-batches to desire high throughput. For complex and mutable data processing requirements, this dilemma brings the selection and deployment of stream processing frameworks into an embarrassing situation. Moreover, current data stream or operation stream paradigms cannot handle data burst efficiently, which probably results in noticeable performance degradation. This paper introduces a dual-paradigm stream processing, called DO (Data and Operation) that can adapt to stream data volatility. It enables data to be processed in micro-batches (i.e., operation stream) when data burst occurs to achieve high throughput, while data is processed record by record (i.e., data stream) in the remaining time to sustain low latency. DO embraces a method to detect data bursts, identify the main operations affected by the data burst and switch paradigms accordingly. Our insight behind DO's design is that the trade-off between latency and throughput of stream processing frameworks can be dynamically achieved according to data communication among operations in a fine-grained manner (i.e., operation level) instead of framework level. We implement a prototype stream processing framework that adopts DO. Our experimental results show that our framework with DO can achieve 5x speedup over operation stream under low data stream sizes, and outperforms data stream on throughput by 2.1x to 3.2x under data burst.

References

Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-tolerant Stream Processing at Internet Scale. Proceedings of the VLDB Endowment. 6, 11 (Aug. 2013), 1033--1044. Google ScholarDigital Library
Leonardo Aniello, Roberto Baldoni, and Leonardo Querzoni. 2013. Adaptive Online Scheduling in Storm. In Proceedings of the 7th ACM International Conference on Distributed Event-based Systems (DEBS '13). ACM, 207--218. Google ScholarDigital Library
Brian Babcock, Shivnath Babu, Rajeev Motwani, and Mayur Datar. 2003. Chain: Operator Scheduling for Memory Minimization in Data Stream Systems. In Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data (SIGMOD '03). ACM, 253--264. Google ScholarDigital Library
Brian Babcock, Mayur Datar, and Rajeev Motwani. 2003. Load shedding techniques for data stream systems. In Proceedings of the 2003 Management and Processing of Data Streams Workshop (MPDS '03), Vol. 577. ACM.Google Scholar
Yahoo! Streaming Benchmark. 2015. https://yahooeng.tumblr.com/post/135321837876/Google Scholar
Pramod Bhatotia, Alexander Wieder, Rodrigo Rodrigues, Umut A. Acar, and Rafael Pasquin. 2011. Incoop: MapReduce for Incremental Computations. In Proceedings of the 2011 ACM Symposium on Cloud Computing (SOCC '11). ACM, Article 7, 14 pages. Google ScholarDigital Library
Tyson Condie, Neil Conway, Peter Alvaro, Joseph M. Hellerstein, Khaled Elmeleegy, and Russell Sears. 2010. MapReduce Online. In Proceedings of the 7th USENIX Symposium on Networked Systems Design and Implementation (NSDI '10). USENIX, 313--328. Google ScholarDigital Library
Tathagata Das, Yuan Zhong, Ion Stoica, and Scott Shenker. 2014. Adaptive Stream Processing Using Dynamic Batch Sizing. In Proceedings of the 2014 ACM Symposium on Cloud Computing (SOCC '14). ACM, Article 16, 13 pages. Google ScholarDigital Library
Structured Streaming Programming Guide. 2018. http://spark.apache.org/docs/latest/structured-streaming-programming-guide.htmlGoogle Scholar
Martin Hirzel, Robert Soulé, Scott Schneider, Buğra Gedik, and Robert Grimm. 2014. A Catalog of Stream Processing Optimizations. ACM Comput. Surv. 46, 4, Article 46 (March2014), 34 pages. Google ScholarDigital Library
JStorm. 2015. http://jstorm.io/Google Scholar
Asterios Katsifodimos and Sebastian Schelter. 2016. Apache Flink: Stream Analytics at Scale. In Proceedings of the 2016 IEEE International Conference on Cloud Engineering Workshop (IC2EW '16). IEEE, 193--193.Google ScholarCross Ref
Wilhelm Kleiminger, Evangelia Kalyvianaki, and Peter Pietzuch. 2011. Balancing load in stream processing with the cloud. In Proceedings of the 27th IEEE International Conference on Data Engineering Workshops (ICDEW '11). IEEE, 16--21. Google ScholarDigital Library
Sanjeev Kulkarni, Nikunj Bhagat, Maosong Fu, Vikas Kedigehalli, Christopher Kellogg, Sailesh Mittal, Jignesh M. Patel, Karthik Ramasamy, and Siddarth Taneja. 2015. Twitter Heron: Stream Processing at Scale. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD '15). ACM, 239--250. Google ScholarDigital Library
Julie Letchner, Christopher Ré, Magdalena Balazinska, and Matthai Philipose. 2010. Approximation trade-offs in Markovian stream processing: An empirical study. In Proceedings of the 26th IEEE International Conference on Data Engineering (ICDE '10). IEEE, 936--939.Google ScholarCross Ref
Derek G. Murray, Frank McSherry, Rebecca Isaacs, Michael Isard, Paul Barham, and Martín Abadi. 2013. Naiad: A Timely Dataflow System. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP '13). ACM, 439--455. Google ScholarDigital Library
Leonardo Neumeyer, Bruce Robbins, Anish Nair, and Anand Kesari. 2010. S4: Distributed Stream Computing Platform. In Proceedings of the 2010 IEEE International Conference on Data Mining Workshops (ICDMW '10). IEEE, 170--177. Google ScholarDigital Library
Christopher Olston, Greg Chiou, Laukik Chitnis, Francis Liu, Yiping Han, Mattias Larsson, Andreas Neumann, Vellanki B.N. Rao, Vijayanand Sankarasubramanian, Siddharth Seth, Chao Tian, Topher ZiCornell, and Xiaodan Wang. 2011. Nova: Continuous Pig/Hadoop Workflows. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD '11). ACM, 1081--1090. Google ScholarDigital Library
Samza. 2015. http://samza.apache.org/Google Scholar
Michael Stonebraker, Uğur Çetintemel, and Stan Zdonik. 2005. The 8 Requirements of Real-time Stream Processing. SIGMOD Rec. 34, 4 (Dec. 2005), 42--47. Google ScholarDigital Library
Jaspar Subhlok and Gary Vondran. 1996. Optimal Latency-throughput Tradeoffs for Data Parallel Pipelines. In Proceedings of the 8th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA '96). ACM, 62--71. Google ScholarDigital Library
Ankit Toshniwal, Siddarth Taneja, Amit Shukla, Karthik Ramasamy, Jignesh M. Patel, Sanjeev Kulkarni, Jason Jackson, Krishna Gade, Maosong Fu, Jake Donham, Nikunj Bhagat, Sailesh Mittal, and Dmitriy Ryaboy. 2014. Storm@twitter. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD '14). ACM, 147--156. Google ScholarDigital Library
Trident. 2015. http://storm.apache.org/releases/1.1.0/Trident-tutorial.htmlGoogle Scholar
Shivaram Venkataraman, Aurojit Panda, Kay Ousterhout, Michael Armbrust, Ali Ghodsi, Michael J. Franklin, Benjamin Recht, and Ion Stoica. 2017. Drizzle: Fast and Adaptable Stream Processing at Scale. In Proceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP '17). ACM, 374--389. Google ScholarDigital Library
Naga Vydyanathan, Umit Catalyurek, Tahsin Kurc, Ponnuswamy Sadayappan, and Joel Saltz. 2011. Optimizing latency and throughput of application workflows on clusters. Parallel Comput. 37, 10 (2011), 694--712. Google ScholarDigital Library
Walter Willinger, Murad S. Taqqu, and Ashok Erramilli. 1996. A Bibliographical Guide to Self-Similar Traffic and Performance Modeling for Modern High-Speed Networks. Stochastic Networks Theory and Applications (1996), 339--366.Google Scholar
Jielong Xu, Zhenhua Chen, Jian Tang, and Sen Su. 2014. T-Storm: Traffic-Aware Online Scheduling in Storm. In Proceedings of the 34th IEEE International Conference on Distributed Computing Systems (ICDCS '14). IEEE, 535--544. Google ScholarDigital Library
Matei Zaharia, Tathagata Das, Haoyuan Li, Timothy Hunter, Scott Shenker, and Ion Stoica. 2013. Discretized Streams: Fault-tolerant Streaming Computation at Scale. In Proceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP '13). ACM, 423--438. Google ScholarDigital Library

Recommendations

Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka
BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies

In recent years there has been a surge in applications focusing on streaming data to generate insights in real-time. Both academia, as well as industry, have tried to address this use case by developing a variety of Stream Processing Engines (SPEs) with ...
Read More
Towards collaborative data reduction in stream-processing systems

We consider a distributed system that disseminates high-volume event streams to many simultaneous monitoring applications over a low-bandwidth network. For bandwidth efficiency, we propose a collaborative data-reduction mechanism, 'group-aware stream ...
Read More
Massively-parallel stream processing under QoS constraints with Nephele
HPDC '12: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing

Today, a growing number of commodity devices, like mobile phones or smart meters, is equipped with rich sensors and capable of producing continuous data streams. The sheer amount of these devices and the resulting overall data volumes of the streams ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing
August 2018
945 pages
ISBN:9781450365109
DOI:10.1145/3225058

Copyright © 2018 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 August 2018
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Data burst
Dual-paradigm approach
High throughput
Low latency
Stream processing
Qualifiers
- research-article
- Research
- Refereed limited
Conference

Acceptance Rates
ICPP '18 Paper Acceptance Rate91of313submissions,29%Overall Acceptance Rate91of313submissions,29%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 181
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Dual-Paradigm Stream Processing

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

ABSTRACT

References

Cited By

Recommendations

Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka

Towards collaborative data reduction in stream-processing systems

Massively-parallel stream processing under QoS constraints with Nephele

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Dual-Paradigm Stream Processing

ICPP '18: Proceedings of the 47th International Conference on Parallel Processing

ABSTRACT

References

Cited By

Recommendations

Characterization of Big Data Stream Processing Pipeline: A Case Study using Flink and Kafka

Towards collaborative data reduction in stream-processing systems

Massively-parallel stream processing under QoS constraints with Nephele

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media