ABSTRACT
Network failures are inevitable. Interfaces go down, devices crash and resources become exhausted. It is the responsibility of the control software to provide reliable services on top of unreliable components and throughout unpredictable events. Guaranteeing the correctness of the controller under all types of failures is therefore essential for network operations. Yet, this is also an almost impossible task due to the complexity of the control software, the underlying network, and the lack of precision in simulation tools.
Instead, we argue that testing network control software should follow in the footsteps of large scale distributed systems, such as those of Netflix or Google, which deliberately induce live failures in their production environments during working hours, and analyze how their control software reacts.
In this paper, we describe Armageddon, a framework for introducing sustainable and systematic chaos in networks. When we cause failures, we do so without violating some operator-specified network invariants (e.g., end-to-end connectivity). The injected failures also guarantee some notion of coverage. If the controller can sustain all of the failures, then it can be considered resilient with a high degree of confidence. We describe efficient algorithms to compute failure scenarios and implemented them in a prototype. Applied to real-world networks, our algorithms a coverage of 80% of the links within only three iterations of failures.
Supplemental Material
- Amazon AWS Official Blog. EC2 Maintenance Update. https://aws.amazon.com/blogs/aws/ec2-maintenance-update-2/.Google Scholar
- Azure's Search Chaos Monkey is wreaking havoc to find potential points of failure. http://bit.ly/1HPLtQ9.Google Scholar
- Big Switch Networks. Chaos Monkey and Big Cloud Fabric. http://bit.ly/1RDxYaO.Google Scholar
- NetFlix. 5 Lessons We've Learned Using AWS. http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html.Google Scholar
- NetFlix. Can Spark Streaming survive Chaos Monkey? http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html.Google Scholar
- NetFlix. Chaos Monkey Released Into The Wild. http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html.Google Scholar
- NetFlix. Simian Army. GitHub repository. https://github.com/Netflix/SimianArmy.Google Scholar
- ONOS Controller Platform. http://onosproject.org/.Google Scholar
- OpenDaylight Controller Platform. http://www.opendaylight.org/.Google Scholar
- K. Agarwal, E. Rozner, C. Dixon, and J. Carter. SDN Traceroute: Tracing SDN Forwarding Without Changing Network Behavior. In ACM SOSR, Santa Clara, CA, USA, Jun 2015.Google Scholar
- R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network flows - theory, algorithms and applications. Prentice Hall, 1993. Google ScholarDigital Library
- A. Al-Shabibi, M. De Leenheer, M. Gerola, A. Koshibe, W. Snow, and G. Parulkar. OpenVirteX: A network hypervisor. Open Networking Summit, 2014.Google Scholar
- R. Alimi, Y. Wang, and Y. R. Yang. Shadow configuration as a network management primitive. In Proceedings of the ACM SIGCOMM 2008 Conference on Data Communication, SIGCOMM '08, pages 111--122, New York, NY, USA, 2008. ACM. Google ScholarDigital Library
- M. Canini, D. Venzano, P. Peresini, D. Kostic, J. Rexford, et al. A NICE Way to Test OpenFlow Applications. In NSDI, volume 12, pages 127--140, 2012. Google ScholarDigital Library
- T. Claburn. Google Vs. Zombies -- And Worse. Information Week - Network Computing, 2013. http://ubm.io/1ftfjxA.Google Scholar
- T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms (3. ed.). MIT Press, 2009. Google ScholarDigital Library
- P. Gill, N. Jain, and N. Nagappan. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. In Proceedings of the ACM SIGCOMM 2011 Conference, SIGCOMM '11, pages 350--361, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- C. Hong, S. Kandula, R. Mahajan, M. Zhang, V. Gill, M. Nanduri, and R. Wattenhofer. Achieving high utilization with software-driven WAN. In D. M. Chiu, J. Wang, P. Barford, and S. Seshan, editors, ACM SIGCOMM, pages 15--26. ACM, 2013. Google ScholarDigital Library
- S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wanderer, J. Zhou, M. Zhu, J. Zolla, U. Hölzle, S. Stuart, and A. Vahdat. B4: Experience with a Globally-Deployed Software DeïňĄned WAN. In ACM SIGCOMM, 2013. Google ScholarDigital Library
- P. Kazemian, M. Chan, H. Zeng, G. Varghese, N. McKeown, and S. Whyte. Real Time Network Policy Checking Using Header Space Analysis. In NSDI, pages 99--111, 2013. Google ScholarDigital Library
- E. Keller, M. Yu, M. Caesar, and J. Rexford. Virtually eliminating router bugs. In Proceedings of the 5th International Conference on Emerging Networking Experiments and Technologies, CoNEXT '09, pages 13--24, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- A. Khurshid, W. Zhou, M. Caesar, and P. Godfrey. Veriflow: verifying network-wide invariants in real time. SIGCOMM '12, pages 467--472, 2012. Google ScholarDigital Library
- S. Knight, H. Nguyen, N. Falkner, R. Bowden, and M. Roughan. The internet topology zoo. Selected Areas in Communications, IEEE Journal on, 29(9):1765 --1775, october 2011.Google Scholar
- M. Kuzniar, P. Peresini, M. Canini, D. Venzano, and D. Kostic. A SOFT Way for Openflow Switch Interoperability Testing. In Proceedings of the 8th International Conference on Emerging Networking Experiments and Technologies, CoNEXT '12, pages 265--276, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
- M. Kuzniar, P. Peresini, and D. Kostić. Providing Reliable FIB Update Acknowledgments in SDN. In Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies, CoNEXT '14, pages 415--422, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- M. Kuźniar, P. Perešíni, M. Canini, D. Venzano, and D. Kostić. A SOFT Way for OpenFlow Switch Interoperability Testing. In Proceedings of ACM CoNEXT'12, Dec 2012. Google ScholarDigital Library
- B. Lantz, B. Heller, and N. McKeown. A Network in a Laptop: Rapid Prototyping for Software-defined Networks. In Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks, Hotnets-IX, pages 19:1--19:6, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- Miserez, J. Bielik, P. El-Hassany, A. Vanbever, L. Vechev, and Martin. SDNRacer: Detecting Concurrency Violations in Software-Defined Networks. In ACM SOSR, Santa Clara, CA, USA, Jun 2015. Google ScholarDigital Library
- P. Perešíni, M. Kuźniar, N. Vasić, M. Canini, and D. Kostić. OF.CPP: Consistent Packet Processing for OpenFlow. In Proceedings of HotSDN'13, Aug 2013. Google ScholarDigital Library
- M. Reitblatt, M. Canini, A. Guha, and N. Foster. Fattire: Declarative fault tolerance for software-defined networks. In Proceedings of the Second ACM SIGCOMM Workshop on Hot Topics in Software Defined Networking, HotSDN '13, pages 109--114, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- J. Roskind and R. E. Tarjan. A note on finding minimum-cost edge-disjoint spanning trees. Mathematics of Operations Research, 10(4):701--708, 1985.Google ScholarDigital Library
- C. Scott, A. Wundsam, B. Raghavan, A. Panda, A. Or, J. Lai, E. Huang, Z. Liu, A. El-Hassany, S. Whitlock, H. Acharya, K. Zarifis, and S. Shenker. Troubleshooting blackbox sdn control software with minimal causal sequences. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM '14, pages 395--406, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- N. Spring, R. Mahajan, D. Wetherall, and T. Anderson. Measuring isp topologies with rocketfuel. IEEE/ACM Trans. Netw., 12(1):2--16, Feb. 2004. Google ScholarDigital Library
- P. Sun, R. Mahajan, J. Rexford, L. Yuan, M. Zhang, and A. Arefin. A Network-state Management Service. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM '14, pages 563--574, New York, NY, USA, 2014. ACM. Google ScholarDigital Library
- P. Tammana, R. Agarwal, and M. Lee. CherryPick: Tracing Packet Trajectory in Software-defined Datacenter Networks. In ACM SOSR, Santa Clara, CA, USA, Jun 2015. Google ScholarDigital Library
- R. Tarjan. A note on finding the bridges of a graph. Information Processing Letters, 2(6):160--161, 1974.Google ScholarCross Ref
- A. Tseitlin. The antifragile organization. Commun. ACM, 56(8):40--44, Aug. 2013. Google ScholarDigital Library
- X. Wu, D. Turner, C.-C. Chen, D. A. Maltz, X. Yang, L. Yuan, and M. Zhang. NetPilot: Automating Datacenter Network Failure Mitigation. In ACM SIGCOMM 2012, SIGCOMM '12, pages 419--430, New York, NY, USA, 2012. ACM. Google ScholarDigital Library
Index Terms
- Destroying networks for fun (and profit)
Comments