skip to main content
10.1145/2834050.2834099acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Free Access

Destroying networks for fun (and profit)

Published:16 November 2015Publication History

ABSTRACT

Network failures are inevitable. Interfaces go down, devices crash and resources become exhausted. It is the responsibility of the control software to provide reliable services on top of unreliable components and throughout unpredictable events. Guaranteeing the correctness of the controller under all types of failures is therefore essential for network operations. Yet, this is also an almost impossible task due to the complexity of the control software, the underlying network, and the lack of precision in simulation tools.

Instead, we argue that testing network control software should follow in the footsteps of large scale distributed systems, such as those of Netflix or Google, which deliberately induce live failures in their production environments during working hours, and analyze how their control software reacts.

In this paper, we describe Armageddon, a framework for introducing sustainable and systematic chaos in networks. When we cause failures, we do so without violating some operator-specified network invariants (e.g., end-to-end connectivity). The injected failures also guarantee some notion of coverage. If the controller can sustain all of the failures, then it can be considered resilient with a high degree of confidence. We describe efficient algorithms to compute failure scenarios and implemented them in a prototype. Applied to real-world networks, our algorithms a coverage of 80% of the links within only three iterations of failures.

Skip Supplemental Material Section

Supplemental Material

a6.mp4

mp4

774.6 MB

References

  1. Amazon AWS Official Blog. EC2 Maintenance Update. https://aws.amazon.com/blogs/aws/ec2-maintenance-update-2/.Google ScholarGoogle Scholar
  2. Azure's Search Chaos Monkey is wreaking havoc to find potential points of failure. http://bit.ly/1HPLtQ9.Google ScholarGoogle Scholar
  3. Big Switch Networks. Chaos Monkey and Big Cloud Fabric. http://bit.ly/1RDxYaO.Google ScholarGoogle Scholar
  4. NetFlix. 5 Lessons We've Learned Using AWS. http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html.Google ScholarGoogle Scholar
  5. NetFlix. Can Spark Streaming survive Chaos Monkey? http://techblog.netflix.com/2015/03/can-spark-streaming-survive-chaos-monkey.html.Google ScholarGoogle Scholar
  6. NetFlix. Chaos Monkey Released Into The Wild. http://techblog.netflix.com/2012/07/chaos-monkey-released-into-wild.html.Google ScholarGoogle Scholar
  7. NetFlix. Simian Army. GitHub repository. https://github.com/Netflix/SimianArmy.Google ScholarGoogle Scholar
  8. ONOS Controller Platform. http://onosproject.org/.Google ScholarGoogle Scholar
  9. OpenDaylight Controller Platform. http://www.opendaylight.org/.Google ScholarGoogle Scholar
  10. K. Agarwal, E. Rozner, C. Dixon, and J. Carter. SDN Traceroute: Tracing SDN Forwarding Without Changing Network Behavior. In ACM SOSR, Santa Clara, CA, USA, Jun 2015.Google ScholarGoogle Scholar
  11. R. K. Ahuja, T. L. Magnanti, and J. B. Orlin. Network flows - theory, algorithms and applications. Prentice Hall, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. A. Al-Shabibi, M. De Leenheer, M. Gerola, A. Koshibe, W. Snow, and G. Parulkar. OpenVirteX: A network hypervisor. Open Networking Summit, 2014.Google ScholarGoogle Scholar
  13. R. Alimi, Y. Wang, and Y. R. Yang. Shadow configuration as a network management primitive. In Proceedings of the ACM SIGCOMM 2008 Conference on Data Communication, SIGCOMM '08, pages 111--122, New York, NY, USA, 2008. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Canini, D. Venzano, P. Peresini, D. Kostic, J. Rexford, et al. A NICE Way to Test OpenFlow Applications. In NSDI, volume 12, pages 127--140, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Claburn. Google Vs. Zombies -- And Worse. Information Week - Network Computing, 2013. http://ubm.io/1ftfjxA.Google ScholarGoogle Scholar
  16. T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms (3. ed.). MIT Press, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. P. Gill, N. Jain, and N. Nagappan. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. In Proceedings of the ACM SIGCOMM 2011 Conference, SIGCOMM '11, pages 350--361, New York, NY, USA, 2011. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C. Hong, S. Kandula, R. Mahajan, M. Zhang, V. Gill, M. Nanduri, and R. Wattenhofer. Achieving high utilization with software-driven WAN. In D. M. Chiu, J. Wang, P. Barford, and S. Seshan, editors, ACM SIGCOMM, pages 15--26. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wanderer, J. Zhou, M. Zhu, J. Zolla, U. Hölzle, S. Stuart, and A. Vahdat. B4: Experience with a Globally-Deployed Software DeïňĄned WAN. In ACM SIGCOMM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. P. Kazemian, M. Chan, H. Zeng, G. Varghese, N. McKeown, and S. Whyte. Real Time Network Policy Checking Using Header Space Analysis. In NSDI, pages 99--111, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. E. Keller, M. Yu, M. Caesar, and J. Rexford. Virtually eliminating router bugs. In Proceedings of the 5th International Conference on Emerging Networking Experiments and Technologies, CoNEXT '09, pages 13--24, New York, NY, USA, 2009. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. A. Khurshid, W. Zhou, M. Caesar, and P. Godfrey. Veriflow: verifying network-wide invariants in real time. SIGCOMM '12, pages 467--472, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Knight, H. Nguyen, N. Falkner, R. Bowden, and M. Roughan. The internet topology zoo. Selected Areas in Communications, IEEE Journal on, 29(9):1765 --1775, october 2011.Google ScholarGoogle Scholar
  24. M. Kuzniar, P. Peresini, M. Canini, D. Venzano, and D. Kostic. A SOFT Way for Openflow Switch Interoperability Testing. In Proceedings of the 8th International Conference on Emerging Networking Experiments and Technologies, CoNEXT '12, pages 265--276, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Kuzniar, P. Peresini, and D. Kostić. Providing Reliable FIB Update Acknowledgments in SDN. In Proceedings of the 10th ACM International on Conference on Emerging Networking Experiments and Technologies, CoNEXT '14, pages 415--422, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. M. Kuźniar, P. Perešíni, M. Canini, D. Venzano, and D. Kostić. A SOFT Way for OpenFlow Switch Interoperability Testing. In Proceedings of ACM CoNEXT'12, Dec 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. B. Lantz, B. Heller, and N. McKeown. A Network in a Laptop: Rapid Prototyping for Software-defined Networks. In Proceedings of the 9th ACM SIGCOMM Workshop on Hot Topics in Networks, Hotnets-IX, pages 19:1--19:6, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Miserez, J. Bielik, P. El-Hassany, A. Vanbever, L. Vechev, and Martin. SDNRacer: Detecting Concurrency Violations in Software-Defined Networks. In ACM SOSR, Santa Clara, CA, USA, Jun 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Perešíni, M. Kuźniar, N. Vasić, M. Canini, and D. Kostić. OF.CPP: Consistent Packet Processing for OpenFlow. In Proceedings of HotSDN'13, Aug 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Reitblatt, M. Canini, A. Guha, and N. Foster. Fattire: Declarative fault tolerance for software-defined networks. In Proceedings of the Second ACM SIGCOMM Workshop on Hot Topics in Software Defined Networking, HotSDN '13, pages 109--114, New York, NY, USA, 2013. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Roskind and R. E. Tarjan. A note on finding minimum-cost edge-disjoint spanning trees. Mathematics of Operations Research, 10(4):701--708, 1985.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. C. Scott, A. Wundsam, B. Raghavan, A. Panda, A. Or, J. Lai, E. Huang, Z. Liu, A. El-Hassany, S. Whitlock, H. Acharya, K. Zarifis, and S. Shenker. Troubleshooting blackbox sdn control software with minimal causal sequences. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM '14, pages 395--406, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. N. Spring, R. Mahajan, D. Wetherall, and T. Anderson. Measuring isp topologies with rocketfuel. IEEE/ACM Trans. Netw., 12(1):2--16, Feb. 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. P. Sun, R. Mahajan, J. Rexford, L. Yuan, M. Zhang, and A. Arefin. A Network-state Management Service. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM '14, pages 563--574, New York, NY, USA, 2014. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. P. Tammana, R. Agarwal, and M. Lee. CherryPick: Tracing Packet Trajectory in Software-defined Datacenter Networks. In ACM SOSR, Santa Clara, CA, USA, Jun 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. R. Tarjan. A note on finding the bridges of a graph. Information Processing Letters, 2(6):160--161, 1974.Google ScholarGoogle ScholarCross RefCross Ref
  37. A. Tseitlin. The antifragile organization. Commun. ACM, 56(8):40--44, Aug. 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. X. Wu, D. Turner, C.-C. Chen, D. A. Maltz, X. Yang, L. Yuan, and M. Zhang. NetPilot: Automating Datacenter Network Failure Mitigation. In ACM SIGCOMM 2012, SIGCOMM '12, pages 419--430, New York, NY, USA, 2012. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Destroying networks for fun (and profit)

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in
            • Published in

              cover image ACM Conferences
              HotNets-XIV: Proceedings of the 14th ACM Workshop on Hot Topics in Networks
              November 2015
              189 pages
              ISBN:9781450340472
              DOI:10.1145/2834050

              Copyright © 2015 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 16 November 2015

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Research
              • Refereed limited

              Acceptance Rates

              Overall Acceptance Rate110of460submissions,24%

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader