skip to main content
article

MillWheel: fault-tolerant stream processing at internet scale

Published:01 August 2013Publication History
Skip Abstract Section

Abstract

MillWheel is a framework for building low-latency data-processing applications that is widely used at Google. Users specify a directed computation graph and application code for individual nodes, and the system manages persistent state and the continuous flow of records, all within the envelope of the framework's fault-tolerance guarantees.

This paper describes MillWheel's programming model as well as its implementation. The case study of a continuous anomaly detector in use at Google serves to motivate how many of MillWheel's features are used. MillWheel's programming model provides a notion of logical time, making it simple to write time-based aggregations. MillWheel was designed from the outset with fault tolerance and scalability in mind. In practice, we find that MillWheel's unique combination of scalability, fault tolerance, and a versatile programming model lends itself to a wide variety of problems at Google.

References

  1. D. J. Abadi, Y. Ahmad, M. Balazinska, M. Cherniack, J. hyon Hwang, W. Lindner, A. S. Maskey, E. Rasin, E. Ryvkina, N. Tatbul, Y. Xing, and S. Zdonik. The design of the borealis stream processing engine. In In CIDR, pages 277-289, 2005.Google ScholarGoogle Scholar
  2. D. J. Abadi, D. Carney, U. Çetintemel, M. Cherniack, C. Convey, S. Lee, M. Stonebraker, N. Tatbul, and S. Zdonik. Aurora: a new model and architecture for data stream management. The VLDB Journal, 12(2):120-139, 2003. Google ScholarGoogle Scholar
  3. A. Adya, J. Dunagan, and A. Wolman. Centrifuge: Integrated lease management and partitioning for cloud services. In NSDI, pages 1-16. USENIX Association, 2010. Google ScholarGoogle Scholar
  4. Apache. Apache hadoop. http://hadoop.apache.org, 2012.Google ScholarGoogle Scholar
  5. B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In Proceedings of the twenty-first ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 1-16. ACM, 2002. Google ScholarGoogle Scholar
  6. S. Chandrasekaran, O. Cooper, A. Deshpande, M. J. Franklin, J. M. Hellerstein, W. Hong, S. Krishnamurthy, S. R. Madden, F. Reiss, and M. A. Shah. Telegraphcq: continuous dataflow processing. In Proceedings of the 2003 ACM SIGMOD international conference on Management of data, pages 668-668. ACM, 2003. Google ScholarGoogle Scholar
  7. F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. Bigtable: A distributed storage system for structured data. ACM Trans. Comput. Syst., 26:4:1-4:26, June 2008. Google ScholarGoogle Scholar
  8. T. Condie, N. Conway, P. Alvaro, J. M. Hellerstein, K. Elmeleegy, and R. Sears. Mapreduce online. Technical report, University of California, Berkeley, 2009.Google ScholarGoogle Scholar
  9. J. C. Corbett, J. Dean, M. Epstein, A. Fikes, C. Frost, J. Furman, S. Ghemawat, A. Gubarev, C. Heiser, P. Hochschild, et al. Spanner: Googles globally-distributed database. To appear in Proceedings of OSDI, page 1, 2012. Google ScholarGoogle Scholar
  10. C. Cranor, Y. Gao, T. Johnson, V. Shkapenyuk, and O. Spatscheck. Gigascope: High performance network monitoring with an sql interface. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data, pages 623-623. ACM, 2002. Google ScholarGoogle Scholar
  11. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51:107-113, Jan. 2008. Google ScholarGoogle Scholar
  12. E. Deelman and B. K. Szymanski. Continuously monitored global virtual time. Technical report, in Intern. Conf. Parallel and Distributed Processing Techniques and Applications, Las Vegas, NV, 1996.Google ScholarGoogle Scholar
  13. Google. Protocol buffers. http://code.google.com/p/protobuf/, 2012.Google ScholarGoogle Scholar
  14. J.-H. Hwang, M. Balazinska, A. Rasin, U. Cetintemel, M. Stonebraker, and S. Zdonik. High-availability algorithms for distributed stream processing. In Data Engineering, 2005. ICDE 2005. Proceedings. 21st International Conference on, pages 779-790. IEEE, 2005. Google ScholarGoogle Scholar
  15. D. R. Jefferson. Virtual time. ACM Transactions on Programming Languages and Systems, 7:404-425, 1985. Google ScholarGoogle Scholar
  16. T. Johnson, S. Muthukrishnan, V. Shkapenyuk, and O. Spatscheck. A heartbeat mechanism and its application in gigascope. In Proceedings of the 31st international conference on Very large data bases, pages 1079-1088. VLDB Endowment, 2005. Google ScholarGoogle Scholar
  17. Y. Kwon, M. Balazinska, and A. Greenberg. Fault-tolerant stream processing using a distributed, replicated file system. Proceedings of the VLDB Endowment, 1(1):574-585, 2008. Google ScholarGoogle Scholar
  18. L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558-565, July 1978. Google ScholarGoogle Scholar
  19. J. Li, K. Tufte, V. Shkapenyuk, V. Papadimos, T. Johnson, and D. Maier. Out-of-order processing: a new architecture for high-performance stream systems. Proceedings of the VLDB Endowment, 1(1):274-288, 2008. Google ScholarGoogle Scholar
  20. D. Logothetis, C. Olston, B. Reed, K. C. Webb, and K. Yocum. Stateful bulk processing for incremental analytics. In Proceedings of the 1st ACM symposium on Cloud computing, pages 51-62. ACM, 2010. Google ScholarGoogle Scholar
  21. S. Madden and M. J. Franklin. Fjording the stream: An architecture for queries over streaming sensor data. In Data Engineering, 2002. Proceedings. 18th International Conference on, pages 555-566. IEEE, 2002. Google ScholarGoogle Scholar
  22. N. Marz. Trident. https://github.com/nathanmarz/storm/wiki/Trident-tutorial, 2012.Google ScholarGoogle Scholar
  23. N. Marz. Twitter storm. https://github.com/nathanmarz/storm/wiki, 2012.Google ScholarGoogle Scholar
  24. R. Motwani, J. Widom, A. Arasu, B. Babcock, S. Babu, M. Datar, G. Manku, C. Olston, J. Rosenstein, and R. Varma. Query processing, resource management, and approximation in a data stream management system. Technical Report 2002-41, Stanford InfoLab, 2002.Google ScholarGoogle Scholar
  25. D. G. Murray, M. Schwarzkopf, C. Smowton, S. Smith, A. Madhavapeddy, and S. Hand. Ciel: a universal execution engine for distributed data-flow computing. In Proceedings of the 8th USENIX conference on Networked systems design and implementation, page 9. USENIX Association, 2011. Google ScholarGoogle Scholar
  26. L. Neumeyer, B. Robbins, A. Nair, and A. Kesari. S4: Distributed stream computing platform. In Data Mining Workshops (ICDMW), 2010 IEEE International Conference on, pages 170-177, dec. 2010. Google ScholarGoogle Scholar
  27. D. Peng, F. Dabek, and G. Inc. Large-scale incremental processing using distributed transactions and notifications. In 9th USENIX Symposium on Operating Systems Design and Implementation, 2010. Google ScholarGoogle Scholar
  28. M. A. Shah, J. M. Hellerstein, S. Chandrasekaran, and M. J. Franklin. Flux: An adaptive partitioning operator for continuous query systems. In Data Engineering, 2003. Proceedings. 19th International Conference on, pages 25-36. IEEE, 2003.Google ScholarGoogle Scholar
  29. U. Srivastava and J. Widom. Flexible time management in data stream systems. In Proceedings of the twenty-third ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, pages 263-274. ACM, 2004. Google ScholarGoogle Scholar
  30. M. Stonebraker, U. Çetintemel, and S. Zdonik. The 8 requirements of real-time stream processing. ACM SIGMOD Record, 34(4):42-47, 2005. Google ScholarGoogle Scholar
  31. P. A. Tucker, D. Maier, T. Sheard, and L. Fegaras. Exploiting punctuation semantics in continuous data streams. Knowledge and Data Engineering, IEEE Transactions on, 15(3):555-568, 2003. Google ScholarGoogle Scholar
  32. F. Yang, Z. Qian, X. Chen, I. Beschastnikh, L. Zhuang, L. Zhou, and J. Shen. Sonora: A platform for continuous mobile-cloud computing. Technical report, Technical Report. Microsoft Research Asia.Google ScholarGoogle Scholar
  33. M. Zaharia, M. Chowdhury, T. Das, A. Dave, J. Ma, M. McCauley, M. Franklin, S. Shenker, and I. Stoica. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, 2011. Google ScholarGoogle Scholar
  34. M. Zaharia, T. Das, H. Li, S. Shenker, and I. Stoica. Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing, pages 10-10. USENIX Association, 2012. Google ScholarGoogle Scholar

Index Terms

  1. MillWheel: fault-tolerant stream processing at internet scale

        Recommendations

        Reviews

        Peng Li

        In this paper, the authors introduce MillWheel, a framework for processing real-time streamed data that is used at Google. It is based on the paradigm of streaming computing, in which users specify a computation graph and provide code for individual nodes. After the computation starts, the application receives continuous data and processes it in a real-time fashion to reduce latency. MillWheel targets latency-sensitive applications, such as the detection of query hikes or query dips, and the delivery of ads. The MillWheel model features persistent storage, low watermarks, and duplicate prevention. The persistent storage stores the state of each node so that if a node fails, the system can resume the computation without losing data by reading the stored state. The low watermark subsystem tracks the timestamps of processed data at each node, providing the basis for fault tolerance and duplication prevention. With those features, MillWheel is capable of processing continuous data streams that consist of data tokens with keys and timestamps. If the user wants to trigger a specific event at a future time, he or she can use the provided timer mechanism (which is optional) to register the event. Fault tolerance is a key feature of MillWheel. Since the framework might run on thousands of nodes continuously, the chance of node failure is high. When a node fails, it is required that computed results are saved; if the node keeps state, the state can be resumed so that future computations are still correct. After the node restarts from the failure, duplication of computed results should also be prevented. To achieve these goals, MillWheel ensures exactly once delivery with the help of persistent storage and data acknowledgment. For stateful computations (in which repeated computations with the same input may have different results), MillWheel provides a mechanism called “strong productions” to checkpoint produced data before changing the node state. This form of fault tolerance, however, might incur unnecessary cost for stateless computations (in which computations with the same input always yield the same output). MillWheel provides an option to turn off strong productions and use weak productions to improve the performance of stateless computations. The state of computations is stored in both disks and memories. While disks provide space for huge amounts of state data, memories have a speed advantage. To ensure consistency, all state write operations are wrapped in per-key atomic operations. However, in case of work migration or node restart, there might be zombie writers and network remnants that issue stale writes. To prevent this, all write operations are assigned with a unique sequence. The user can customize the granularity of state modification operations for performance benefit according to the failure probabilities. MillWheel has been implemented on Google clusters. Streams are delivered via remote procedure call. Load distribution is handled by a replicated master. Persistent state is maintained by a database like BigTable or Spanner. Low watermarks are computed conservatively by a central authority, but interested consumers should compute the low watermark according to their own records and those of subscribed senders. Experiments show that with weak productions, the median record delay of a node is several milliseconds. With strong productions and exactly-once delivery enabled, the median delay increases to tens of milliseconds, which is still within human reaction time. Framework-level cache is also used to reduce traffic between storage layers. MillWheel has been used in various Google internal systems, such as ad delivery and image processing for Google Street View. However, the authors point out that there are applications that MillWheel does not suit well. For example, if an application cannot be parallelized well among different keys, there could be bottleneck stages that slow down the whole computation. Online Computing Reviews Service

        Access critical reviews of Computing literature here

        Become a reviewer for Computing Reviews.

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the VLDB Endowment
          Proceedings of the VLDB Endowment  Volume 6, Issue 11
          August 2013
          237 pages

          Publisher

          VLDB Endowment

          Publication History

          • Published: 1 August 2013
          Published in pvldb Volume 6, Issue 11

          Qualifiers

          • article

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader