Abstract
Over the past few years, Stream Processing Engines (SPEs) have emerged as a new class of software systems, enabling low latency processing of streams of data arriving at high rates. As SPEs mature and get used in monitoring applications that must continuously run (e.g., in network security monitoring), a significant challenge arises: SPEs must be able to handle various software and hardware faults that occur, masking them to provide high availability (HA). In this article, we develop, implement, and evaluate DPC (Delay, Process, and Correct), a protocol to handle crash failures of processing nodes and network failures in a distributed SPE.
Like previous approaches to HA, DPC uses replication and masks many types of node and network failures. In the presence of network partitions, the designer of any replication system faces a choice between providing availability or data consistency across the replicas. In DPC, this choice is made explicit: the user specifies an availability bound (no result should be delayed by more than a specified delay threshold even under failure if the corresponding input is available), and DPC attempts to minimize the resulting inconsistency between replicas (not all of which might have seen the input data) while meeting the given delay threshold. Although conceptually simple, the DPC protocol tolerates the occurrence of multiple simultaneous failures as well as any further failures that occur during recovery.
This article describes DPC and its implementation in the Borealis SPE. We show that DPC enables a distributed SPE to maintain low-latency processing at all times, while also achieving eventual consistency, where applications eventually receive the complete and correct output streams. Furthermore, we show that, independent of system size and failure location, it is possible to handle failures almost up-to the user-specified bound in a manner that meets the required availability without introducing any inconsistency.
- Abadi, D. J., Ahmad, Y., Balazinska, M., Çetintemel, U., Cherniack, M., Hwang, J.-H., Lindner, W., Maskey, A. S., Rasin, A., Ryvkina, E., Tatbul, N., Xing, Y., and Zdonik, S. 2005. The design of the Borealis stream processing engine. In Proceedings of the 2nd Biennial Conference on Innovative Data Systems Research (CIDR).Google Scholar
- Abadi, D. J., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Lee, S., Stonebraker, M., Tatbul, N., and Zdonik, S. 2003. Aurora: A new model and architecture for data stream management. VLDB J. 12, 2. Google ScholarDigital Library
- Aleri. http://www.aleri.com/index.html.Google Scholar
- Alonso, G., Günthör, R., Kamath, M., Agrawal, D., El Abbadi, A., and Mohan, C. 1995. Exotica/FMDC: Handling disconnected clients in a workflow management system. In Proceedings of the 3rd International Conference on Cooperative Information Systems.Google Scholar
- Alonso, G. and Mohan, C. 1997. WFMS: The next generation of distributed processing tools. S. Jajodia and L. Kerschberg, Eds. In Advanced Transaction Models and Architectures, Kluwer.Google Scholar
- Alonso, G., Mohan, C., Günthör, R., Agrawal, D., El Abbadi, A., and Kamath, M. 1995. Exotica/FMQM: A persistent message-based architecture for distributed workflow management. In Proceedings of IFIP WG8.1 Working Conference on Information Systems for Decentralized Organizations.Google Scholar
- Alonso, G., Reinwald, B., and Mohan, C. 1997. Distributed data management in workflow environments. In Proceedings of the 7th ACM RIDE. Google ScholarDigital Library
- Balakrishnan, H., Balazinska, M., Carney, D., Çetintemel, U., Cherniack, M., Convey, C., Galvez, E., Salz, J., Stonebraker, M., Tatbul, N., Tibbets, R., and Zdonik, S. 2004. Retrospective on Aurora. VLDB J. 13, 4. Google ScholarDigital Library
- Balazinska, M. 2006. Fault-tolerance and load management in a distributed stream processing system. Ph.D. thesis, Massachusetts Institute of Technology. Google ScholarDigital Library
- Balazinska, M., Balakrishnan, H., Madden, S., and Stonebraker, M. 2005. Fault-tolerance in the Borealis distributed stream processing system. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Bernstein, P. A., Hsu, M., and Mann, B. 1990. Implementing recoverable requests using queues. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Brewer, E. A. 2001. Lessons from giant-scale services. IEEE Internet Comput. 5, 4, 46--55. Google ScholarDigital Library
- Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M. J., Hellerstein, J. M., Hong, W., Krishnamurthy, S., Madden, S., Raman, V., Reiss, F., and Shah, M. 2003. TelegraphCQ: Continuous dataflow processing for an uncertain world. In Proceedings of the 1st Biennial Conference on Innovative Data Systems Research (CIDR).Google Scholar
- Cherniack, M., Balakrishnan, H., Balazinska, M., Carney, D., Çetintemel, U., Xing, Y., and Zdonik, S. 2003. Scalable distributed stream processing. In Proceedings of the 1st Biennial Conference on Innovative Data Systems Research (CIDR).Google Scholar
- Coral8. http://coral8.com/.Google Scholar
- Cranor, C., Johnson, T., Shkapenyuk, V., and Spatscheck, O. 2003. Gigascope: A stream database for network applications. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Elnozahy, E. N. M., Alvisi, L., Wang, Y.-M., and Johnson, D. B. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34, 3, 375--408. Google ScholarDigital Library
- Fekete, A., Gupta, D., Luchangco, V., Lynch, N., and Shvartsman, A. 1996. Eventually-serializable data services. In Proceedings of the 15th ACM Symposium on Principles of Distributed Computing (PODC'96). Google ScholarDigital Library
- Garcia-Molina, H. and Barbara, D. 1985. How to assign votes in a distributed system. J. ACM 32, 4, 841--860. Google ScholarDigital Library
- Gifford, D. K. 1979. Weighted voting for replicated data. In Proceedings of the 7th ACM Symposium on Operating Systems Principles (SOSP). Google ScholarDigital Library
- Gilbert, S. and Lynch, N. 2002. Brewer's conjecture and the feasibility of consistent, available, partition-tolerant Web services. ACM SIGACT News 33, 2. Google ScholarDigital Library
- Gray, J., Helland, P., O'Neil, P., and Shasha, D. 1996. The dangers of replication and a solution. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Gray, J. and Reuters, A. 1993. Transaction Processing: Concepts and Techniques. Morgan Kaufmann. Google ScholarDigital Library
- Hsu, M. 1995. Special issue on workflow systems. IEEE Data Eng. Bull. 18, 1.Google Scholar
- Hwang, J.-H., Balazinska, M., Rasin, A., Çetintemel, U., Stonebraker, M., and Zdonik, S. 2005. High-availability algorithms for distributed stream processing. In Proceedings of the 21st International Conference on Data Engineering (ICDE). Google ScholarDigital Library
- Kamath, M., Alonso, G., Guenthor, R., and Mohan, C. 1996. Providing high availability in very large workflow management systems. In Proceedings of the 5th International Conference on Extending Database Technology. Google ScholarDigital Library
- Kao, B. and Garcia-Molina, H. 1995. An overview of real-time database systems. In Advances in Real-Time Systems. Prentice-Hall. Google ScholarDigital Library
- Kawell, L., Beckhardt, S., Halvorsen, T., Ozzie, R., and Greif, I. 1988. Replicated document management in a group communication system. In Proceedings of the ACM Conference on Computer-Supported Cooperative Work (CSCW). Google ScholarDigital Library
- Lam, K.-W., Son, S. H., Hung, S.-L., and Wang, Z. 2000. Scheduling transactions with stringent real-time constraints. Inform. Syst. 25, 6, 431--452. Google ScholarDigital Library
- Lomet, D. and Tuttle, M. 2003. A theory of redo recovery. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Melliar-Smith, P. M. and Moser, L. E. 2004. Progress in real-time fault tolerance. In Proceedings of the 23rd IEEE International Symposium on Reliable Distributed Systems (SRDS'04). Google ScholarDigital Library
- Mossé, D., Melhem, R., and Ghosh, S. 2003. A nonpreemptive real-time scheduler with recovery from transient faults and its implementation. IEEE Trans. Softw. Engin. 29, 8. Google ScholarDigital Library
- Motwani, R., Widom, J., Arasu, A., Babcock, B., Babu, S., Datar, M., Manku, G., Olston, C., Rosenstein, J., and Varma, R. 2003. Query processing, approximation, and resource management in a data stream management system. In Proceedings of the 1st Biennial Conference on Innovative Data Systems Research (CIDR).Google Scholar
- Naughton et al. 2001. The Niagara Internet query system. IEEE Data Engin. Bull. 24, 2.Google Scholar
- Olston, C. 2003. Approximate replication. Ph.D. thesis, Stanford University. Google ScholarDigital Library
- Olston, C., Jiang, J., and Widom, J. 2003. Adaptive filters for continuous queries over distributed data streams. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Radoslavov, P., Govindan, R., and Estrin, D. 2001. Topology-informed Internet replica placement. In Proceedings of the 6th International Workshop on Web Caching and Content Distribution (WCW'01).Google Scholar
- Raman, V. and Hellerstein, J. M. 2002. Partial results for online query processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Saito, Y. and Shapiro, M. 2005. Optimistic replication. ACM Comput. Surv. 37, 1, 42--81. Google ScholarDigital Library
- Shah, M., Hellerstein, J., and Brewer, E. 2004. Highly-available, fault-tolerant, parallel dataflows. In Proceedings of the ACM SIGMOD International Conference on Management of Data. Google ScholarDigital Library
- Srivastava, U. and Widom, J. 2004. Flexible time management in data stream systems. In Proceedings of the 23rd ACM Symposium on Principles of Database Systems (PODS). Google ScholarDigital Library
- StreamBase. http://www.streambase.com/.Google Scholar
- Strom, R. E. 2004. Fault-tolerance in the SMILE stateful publish-subscribe system. In Proceedings of the 3rd International Workshop on Distributed Event-Based Systems (DEBS'04).Google ScholarCross Ref
- Tang, X., Chi, H., and Chanson, S. T. 2007. Optimal replica placement under TTL-based consistency. IEEE Trans. Parall. Distrib. Syst. 18, 3, 351--363. Google ScholarDigital Library
- Terry, D. B., Theimer, M., Petersen, K., Demers, A. J., Spreitzer, M., and Hauser, C. 1995. Managing update conflicts in Bayou, a weakly connected replicated storage system. In Proceedings of the 15th ACM Symposium on Operating Systems Principles (SOSP). Google ScholarDigital Library
- The NTP Project. NTP: The Network Time Protocol. http://www.ntp.org/.Google Scholar
- Tucker, P. A. and Maier, D. 2003. Dealing with disorder. In Proceedings of the Workshop on Management and Processing of Data Streams (MPDS).Google Scholar
- Urbano, R. 2003. Oracle Streams Replication Administrator's Guide, 10g Release 1 (10.1). Oracle Corporation.Google Scholar
Index Terms
- Fault-tolerance in the borealis distributed stream processing system
Recommendations
Fault-tolerant stream processing using a distributed, replicated file system
We present SGuard, a new fault-tolerance technique for distributed stream processing engines (SPEs) running in clusters of commodity servers. SGuard is less disruptive to normal stream processing and leaves more resources available for normal stream ...
Fault-tolerance in the Borealis distributed stream processing system
SIGMOD '05: Proceedings of the 2005 ACM SIGMOD international conference on Management of dataWe present a replication-based approach to fault-tolerant distributed stream processing in the face of node failures, network failures, and network partitions. Our approach aims to reduce the degree of inconsistency in the system while guaranteeing that ...
Borealis-R: a replication-transparent stream processing system for wide-area monitoring applications
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of dataBorealis-R is a replication-based system for both fast and
highly-available processing of data streams over wide-area networks. In Borealis-R, multiple operator replicas send outputs to downstream replicas, allowing each replica to use whichever data ...
Comments