skip to main content
research-article

Packet-Level Telemetry in Large Datacenter Networks

Published:17 August 2015Publication History
Skip Abstract Section

Abstract

Debugging faults in complex networks often requires capturing and analyzing traffic at the packet level. In this task, datacenter networks (DCNs) present unique challenges with their scale, traffic volume, and diversity of faults. To troubleshoot faults in a timely manner, DCN administrators must a) identify affected packets inside large volume of traffic; b) track them across multiple network components; c) analyze traffic traces for fault patterns; and d) test or confirm potential causes. To our knowledge, no tool today can achieve both the specificity and scale required for this task.

We present Everflow, a packet-level network telemetry system for large DCNs. Everflow traces specific packets by implementing a powerful packet filter on top of "match and mirror" functionality of commodity switches. It shuffles captured packets to multiple analysis servers using load balancers built on switch ASICs, and it sends "guided probes" to test or confirm potential faults. We present experiments that demonstrate Everflow's scalability, and share experiences of troubleshooting network faults gathered from running it for over 6 months in Microsoft's DCNs.

Skip Supplemental Material Section

Supplemental Material

p479-zhu.webm

webm

148.9 MB

References

  1. Data plane development kit. http://www.dpdk.org/.Google ScholarGoogle Scholar
  2. Receive side scaling. https://msdn.microsoft.com/en-us/library/windows/hardware/ff567236(v=vs.85).aspx.Google ScholarGoogle Scholar
  3. A. Arefin, A. Khurshid, M. Caesar, and K. Nahrstedt. Scaling data-plane logging in large scale networks. In MILCOM, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  4. P. Bosshart, G. Gibb, H.-S. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz. Forwarding metamorphosis: Fast programmable match-action processing in hardware for SDN. In SIGCOMM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. Case, M. Fedor, M. Schoffstall, and J. Davin. RFC 1157: Simple network management protocol. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Chaiken, B. Jenkins, P.-Å. Larson, B. Ramsey, D. Shakib, S. Weaver, and J. Zhou. Scope: easy and efficient parallel processing of massive data sets. VLDB, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. B. Claise. RFC 3954: Cisco systems netflow services export version 9 (2004).Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. N. G. Duffield and M. Grossglauser. Trajectory sampling for direct traffic observation. IEEE/ACM Trans. Netw., June 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. K. Fayaz and V. Sekar. Testing stateful and dynamic data planes with flowtest. In HotSDN, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Fogel, S. Fung, L. Pedrosa, M. Walraed-Sullivan, R. Govindan, R. Mahajan, and T. Millstein. A general approach to network configuration analysis. In NSDI, 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica. X-trace: A pervasive network tracing framework. In NSDI, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. R. Gandhi, H. H. Liu, Y. C. Hu, G. Lu, J. Padhye, L. Yuan, and M. Zhang. Duet: Cloud scale load balancing with hardware and software. In SIGCOMM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. N. Gvozdiev, B. Karp, and M. Handley. Loup: who's afraid of the big bad loop? In HotNets, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Handigol, B. Heller, V. Jeyakumar, D. Mazières, and N. McKeown. I know what your packet did last hop: Using packet histories to troubleshoot networks. In NSDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C.-Y. Hong, M. Caesar, N. Duffield, and J. Wang. Tiresias: Online anomaly detection for hierarchical operational network data. In ICDCS, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. C.-Y. Hong, S. Kandula, R. Mahajan, M. Zhang, V. Gill, M. Nanduri, and R. Wattenhofer. Achieving high utilization with software-driven WAN. In SIGCOMM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Infiniband Trade Association. InfiniBand Architecture Volume 1, General Specifications, Release 1.2.1, 2008.Google ScholarGoogle Scholar
  18. Infiniband Trade Association. Supplement to infiniband architecture specification volume 1 release 1.2.2 annex A17: RoCEv2 (ip routable ROCE), 2014.Google ScholarGoogle Scholar
  19. S. Jain, A. Kumar, S. Mandal, J. Ong, L. Poutievski, A. Singh, S. Venkata, J. Wanderer, J. Zhou, M. Zhu, J. Zolla, U. Hölzle, S. Stuart, and A. Vahdat. B4: Experience with a globally-deployed software defined WAN. In SIGCOMM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. V. Jeyakumar, M. Alizadeh, Y. Geng, C. Kim, and D. Mazières. Millions of little minions: Using packets for low latency network programming and visibility. In SIGCOMM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The nature of datacenter traffic: measurements & analysis. In IMC, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. P. Kazemian, M. Chan, H. Zeng, G. Varghese, N. McKeown, and S. Whyte. Real time network policy checking using header space analysis. In NSDI, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Khurshid, X. Zou, W. Zhou, M. Caesar, and P. B. Godfrey. Veriflow: Verifying network-wide invariants in real time. In NSDI, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. T. Koponen, K. Amidon, P. Balland, M. Casado, A. Chanda, B. Fulton, I. Ganichev, J. Gross, N. Gude, P. Ingram, E. Jackson, A. Lambeth, R. Lenglet, S.-H. Li, A. Padmanabhan, J. Pettit, B. Pfaff, R. Ramanathan, S. Shenker, A. Shieh, J. Stribling, P. Thakkar, D. Wendlandt, A. Yip, and R. Zhang. Network virtualization in multi-tenant datacenters. In NSDI, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. R. Mahajan, N. Spring, D. Wetherall, and T. Anderson. User-level Internet path diagnosis. In SOSP, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. V. Mann, A. Vishnoi, and S. Bidkar. Living on the edge: Monitoring network flows at the edge in cloud data centers. In COMSNETS, 2013.Google ScholarGoogle ScholarCross RefCross Ref
  27. P. Marchetta, A. Botta, E. Katz-Bassett, and A. Pescapé. Dissecting round trip time on the slow path with a single packet. In PAM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. P. Patel, D. Bansal, L. Yuan, A. Murthy, A. Greenberg, D. A. Maltz, and R. Kern. Ananta: cloud scale load balancing. In SIGCOMM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. P. Phaal, S. Panchen, and N. McKee. RFC 3176: Inmon corporation's sflow: A method for monitoring traffic in switched and routed networks, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. T. Qiu, Z. Ge, D. Pei, J. Wang, and J. Xu. What happened in my network: mining network events from router syslogs. In IMC, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. J. Rasley, B. Stephens, C. Dixon, E. Rozner, W. Felter, K. Agarwal, J. Carter, and R. Fonseca. Planck: Millisecond-scale monitoring and control for commodity networks. In SIGCOMM, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. L. Rizzo. netmap: A novel framework for fast packet I/O. In USENIX ATC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. J. Suh, T. Kwon, C. Dixon, W. Felter, and J. Carter. Opensample: A low-latency, sampling-based measurement platform for SDN. In ICDCS, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. W. Wu and P. Demar. Wirecap: a novel packet capture engine for commodity NICs in high-speed networks. In IMC, 2014. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. Wundsam, D. Levin, S. Seetharaman, and A. Feldmann. OFRewind: Enabling record and replay troubleshooting for networks. In ATC, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. Yu, L. Jose, and R. Miao. Software defined traffic measurement with opensketch. In NSDI, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Packet-Level Telemetry in Large Datacenter Networks

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM SIGCOMM Computer Communication Review
      ACM SIGCOMM Computer Communication Review  Volume 45, Issue 4
      SIGCOMM'15
      October 2015
      659 pages
      ISSN:0146-4833
      DOI:10.1145/2829988
      Issue’s Table of Contents
      • cover image ACM Conferences
        SIGCOMM '15: Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication
        August 2015
        684 pages
        ISBN:9781450335423
        DOI:10.1145/2785956

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 17 August 2015

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader