skip to main content
10.1145/1921168.1921175acmconferencesArticle/Chapter ViewAbstractPublication PagesconextConference Proceedingsconference-collections
research-article

G-RCA: a generic root cause analysis platform for service quality management in large IP networks

Published:30 November 2010Publication History

ABSTRACT

As IP networks have become the mainstay of an increasingly diverse set of applications ranging from Internet games and streaming videos, to e-commerce and online-banking, and even to mission-critical 911, best effort service is no longer acceptable. This requires a transformation in network management from detecting and replacing individual faulty network elements to managing the service quality as a whole.

In this paper we describe the design and development of a Generic Root Cause Analysis platform (G-RCA) for service quality management (SQM) in large IP networks. G-RCA contains a comprehensive service dependency model that includes network topological and cross-layer relationships, protocol interactions, and control plane dependencies. G-RCA abstracts the RCA process into signature identification for symptom and diagnostic events, temporal and spatial event correlation, and reasoning and inference logic. G-RCA provides a flexible rule specification language that allows operators to quickly customize G-RCA into different RCA tools as new problems need to be investigated. G-RCA is also integrated with the data trending, manual data exploration, and statistical correlation mining capabilities. G-RCA has proven to be a highly effective SQM platform in several different applications and we present results regarding BGP flaps, PIM flaps in Multicast VPN service, and end-to-end throughput drop in CDN service.

References

  1. A border gateway protocol 4 (bgp-4). http://www.ietf.org/rfc/rfc4271.txt.Google ScholarGoogle Scholar
  2. Emc ionix platform. http://www.emc.com/products/family/ionix-family.htm.Google ScholarGoogle Scholar
  3. Hp operations center. https://h10078.www1.hp.com/cda/hpms/display/main/hpms_content.jsp?zn=bto&cp=1-11-15-28_4000_100__.Google ScholarGoogle Scholar
  4. Ibm tivoli. https://www-01.ibm.com/software/tivoli/.Google ScholarGoogle Scholar
  5. Keynote systems, inc. website. http://www.keynote.com/.Google ScholarGoogle Scholar
  6. Overview of Multilink PPP Bundle. http://www.juniper.net/techpubs/software/erx/junose81/swconfig-link/html/mlppp-config2.html.Google ScholarGoogle Scholar
  7. SONET Automatic Protection Switching. http://www.cisco.com/en/US/tech/tk482/tk606/tsd_technology_support_sub-protocol_home.html.Google ScholarGoogle Scholar
  8. P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang. Towards highly reliable enterprise network services via inference of multi-level dependencies. In SIGCOMM '07: Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications, pages 13--24, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, pages 595--604, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. P. Corn, R. Dube, A. McMichael, and J. Tsay. An autonomous distributed expert system for switched network maintenance. In Proceedings of IEEE GLOBECOM88, pages 1530--1537, 1988.Google ScholarGoogle ScholarCross RefCross Ref
  11. I. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. CORDS: automatic discovery of correlations and soft functional dependencies. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 647--658, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Joseph, J. Kindrick, K. Muralidhar, and T. Toth-Fejel. MAP fault management expert system. Integrated Network Management I, North-Holland, Amsterdam, pages 627--636, 1989.Google ScholarGoogle Scholar
  13. C. Kalmanek, I. Ge, S. Lee, C. Lund, D. Pei, J. Seidel, J. van der Merwe, and J. Ates. Darkstar: Using exploratory data mining to raise the bar on network reliability and performance. In Design of Reliable Communication Networks, 2009. DRCN 2009. 7th International Workshop on, pages 1--10. IEEE, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  14. S. Kandula, D. Katabi, and J. Vasseur. Shrink: A tool for failure diagnosis in IP networks. In Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data, pages 173--178, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl. Detailed diagnosis in enterprise networks. In SIGCOMM '09: Proceedings of the 2009 conference on Applications, technologies, architectures, and protocols for computer communications, pages 243--254, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Kompella, J. Yates, A. Greenberg, and A. Snoeren. Detection and localization of network black holes. In IEEE INFOCOM 2007. 26th IEEE International Conference on Computer Communications, pages 2180--2188, 2007.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. R. R. Kompella, J. Yates, A. Greenberg, and A. C. Snoeren. Ip fault localization via risk modeling. In NSDI'05: Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation, pages 57--70, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. F. Le, S. Lee, T. Wong, H. Kim, D. Newcomb, F. Le, S. Lee, T. Wong, H. Kim, and D. Newcomb. Minerals: Using Data Mining to Detect Router. In ACM Sigcomm Workshop on Mining Network Data (MineNet), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. A. Mahimkar, J. Yates, Y. Zhang, A. Shaikh, J. Wang, Z. Ge, and C. Ee. Troubleshooting chronic conditions in large IP networks. In Proceedings of the 2008 ACM CoNEXT Conference, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. Moy. RFC2328: OSPF Version 2. 1998.Google ScholarGoogle Scholar
  21. S. Papadimitriou, J. Sun, and C. Faloutsos. Streaming pattern discovery in multiple time-series. In Proceedings of the 31st international conference on Very large data bases, pages 697--708, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Pathan, R. Buyya, and A. Vakali. Content Delivery Networks: State of the Art, Insights, and Imperatives. Content Delivery Networks, page 1, 2008.Google ScholarGoogle Scholar
  23. I. Rish, M. Brodie, and S. Ma. Efficient fault diagnosis using probing. In AAAI Spring Symposium on Information Refinement and Revision for Decision Making, 2002.Google ScholarGoogle Scholar
  24. A. Shaikh and A. Greenberg. OSPF monitoring: Architecture, design, and deployment experience. In Proc. USENIX/ACM NSDI, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. J. Treinen and R. Thurimella. A framework for the application of association rule mining in large intrusion detection infrastructures. Lecture Notes in Computer Science, 4219:1, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Wright, J. Zielinski, and E. Horton. Expert systems development: the ACE system. Expert Systems Applications to Telecommunications, pages 45--72, 1988.Google ScholarGoogle Scholar
  27. T. Yamahira, Y. Kiriha, and S. Sakata. Unified fault management scheme for network troubleshooting expert system. Integrated Network Management, I. North-Holland: Elsevier Science Publishers BV, 1989.Google ScholarGoogle Scholar
  28. K. Yamanishi and Y. Maruyama. Dynamic syslog mining for network failure monitoring. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 499--508, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. H. Yan. U-RCA: A Unified Root Cause Analysis Platform for Service Quality Management in Large IP Networks. Technical Report 10-103, Colorado State Univeristy, 2010.Google ScholarGoogle Scholar

Index Terms

  1. G-RCA: a generic root cause analysis platform for service quality management in large IP networks

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          Co-NEXT '10: Proceedings of the 6th International COnference
          November 2010
          349 pages
          ISBN:9781450304481
          DOI:10.1145/1921168

          Copyright © 2010 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 30 November 2010

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate198of789submissions,25%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader