ABSTRACT
As IP networks have become the mainstay of an increasingly diverse set of applications ranging from Internet games and streaming videos, to e-commerce and online-banking, and even to mission-critical 911, best effort service is no longer acceptable. This requires a transformation in network management from detecting and replacing individual faulty network elements to managing the service quality as a whole.
In this paper we describe the design and development of a Generic Root Cause Analysis platform (G-RCA) for service quality management (SQM) in large IP networks. G-RCA contains a comprehensive service dependency model that includes network topological and cross-layer relationships, protocol interactions, and control plane dependencies. G-RCA abstracts the RCA process into signature identification for symptom and diagnostic events, temporal and spatial event correlation, and reasoning and inference logic. G-RCA provides a flexible rule specification language that allows operators to quickly customize G-RCA into different RCA tools as new problems need to be investigated. G-RCA is also integrated with the data trending, manual data exploration, and statistical correlation mining capabilities. G-RCA has proven to be a highly effective SQM platform in several different applications and we present results regarding BGP flaps, PIM flaps in Multicast VPN service, and end-to-end throughput drop in CDN service.
- A border gateway protocol 4 (bgp-4). http://www.ietf.org/rfc/rfc4271.txt.Google Scholar
- Emc ionix platform. http://www.emc.com/products/family/ionix-family.htm.Google Scholar
- Hp operations center. https://h10078.www1.hp.com/cda/hpms/display/main/hpms_content.jsp?zn=bto&cp=1-11-15-28_4000_100__.Google Scholar
- Ibm tivoli. https://www-01.ibm.com/software/tivoli/.Google Scholar
- Keynote systems, inc. website. http://www.keynote.com/.Google Scholar
- Overview of Multilink PPP Bundle. http://www.juniper.net/techpubs/software/erx/junose81/swconfig-link/html/mlppp-config2.html.Google Scholar
- SONET Automatic Protection Switching. http://www.cisco.com/en/US/tech/tk482/tk606/tsd_technology_support_sub-protocol_home.html.Google Scholar
- P. Bahl, R. Chandra, A. Greenberg, S. Kandula, D. A. Maltz, and M. Zhang. Towards highly reliable enterprise network services via inference of multi-level dependencies. In SIGCOMM '07: Proceedings of the 2007 conference on Applications, technologies, architectures, and protocols for computer communications, pages 13--24, 2007. Google ScholarDigital Library
- M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, pages 595--604, 2002. Google ScholarDigital Library
- P. Corn, R. Dube, A. McMichael, and J. Tsay. An autonomous distributed expert system for switched network maintenance. In Proceedings of IEEE GLOBECOM88, pages 1530--1537, 1988.Google ScholarCross Ref
- I. Ilyas, V. Markl, P. Haas, P. Brown, and A. Aboulnaga. CORDS: automatic discovery of correlations and soft functional dependencies. In Proceedings of the 2004 ACM SIGMOD international conference on Management of data, pages 647--658, 2004. Google ScholarDigital Library
- C. Joseph, J. Kindrick, K. Muralidhar, and T. Toth-Fejel. MAP fault management expert system. Integrated Network Management I, North-Holland, Amsterdam, pages 627--636, 1989.Google Scholar
- C. Kalmanek, I. Ge, S. Lee, C. Lund, D. Pei, J. Seidel, J. van der Merwe, and J. Ates. Darkstar: Using exploratory data mining to raise the bar on network reliability and performance. In Design of Reliable Communication Networks, 2009. DRCN 2009. 7th International Workshop on, pages 1--10. IEEE, 2009.Google ScholarCross Ref
- S. Kandula, D. Katabi, and J. Vasseur. Shrink: A tool for failure diagnosis in IP networks. In Proceedings of the 2005 ACM SIGCOMM workshop on Mining network data, pages 173--178, 2005. Google ScholarDigital Library
- S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl. Detailed diagnosis in enterprise networks. In SIGCOMM '09: Proceedings of the 2009 conference on Applications, technologies, architectures, and protocols for computer communications, pages 243--254, 2009. Google ScholarDigital Library
- R. Kompella, J. Yates, A. Greenberg, and A. Snoeren. Detection and localization of network black holes. In IEEE INFOCOM 2007. 26th IEEE International Conference on Computer Communications, pages 2180--2188, 2007.Google ScholarDigital Library
- R. R. Kompella, J. Yates, A. Greenberg, and A. C. Snoeren. Ip fault localization via risk modeling. In NSDI'05: Proceedings of the 2nd conference on Symposium on Networked Systems Design & Implementation, pages 57--70, 2005. Google ScholarDigital Library
- F. Le, S. Lee, T. Wong, H. Kim, D. Newcomb, F. Le, S. Lee, T. Wong, H. Kim, and D. Newcomb. Minerals: Using Data Mining to Detect Router. In ACM Sigcomm Workshop on Mining Network Data (MineNet), 2006. Google ScholarDigital Library
- A. Mahimkar, J. Yates, Y. Zhang, A. Shaikh, J. Wang, Z. Ge, and C. Ee. Troubleshooting chronic conditions in large IP networks. In Proceedings of the 2008 ACM CoNEXT Conference, 2008. Google ScholarDigital Library
- J. Moy. RFC2328: OSPF Version 2. 1998.Google Scholar
- S. Papadimitriou, J. Sun, and C. Faloutsos. Streaming pattern discovery in multiple time-series. In Proceedings of the 31st international conference on Very large data bases, pages 697--708, 2005. Google ScholarDigital Library
- M. Pathan, R. Buyya, and A. Vakali. Content Delivery Networks: State of the Art, Insights, and Imperatives. Content Delivery Networks, page 1, 2008.Google Scholar
- I. Rish, M. Brodie, and S. Ma. Efficient fault diagnosis using probing. In AAAI Spring Symposium on Information Refinement and Revision for Decision Making, 2002.Google Scholar
- A. Shaikh and A. Greenberg. OSPF monitoring: Architecture, design, and deployment experience. In Proc. USENIX/ACM NSDI, 2004. Google ScholarDigital Library
- J. Treinen and R. Thurimella. A framework for the application of association rule mining in large intrusion detection infrastructures. Lecture Notes in Computer Science, 4219:1, 2006. Google ScholarDigital Library
- J. Wright, J. Zielinski, and E. Horton. Expert systems development: the ACE system. Expert Systems Applications to Telecommunications, pages 45--72, 1988.Google Scholar
- T. Yamahira, Y. Kiriha, and S. Sakata. Unified fault management scheme for network troubleshooting expert system. Integrated Network Management, I. North-Holland: Elsevier Science Publishers BV, 1989.Google Scholar
- K. Yamanishi and Y. Maruyama. Dynamic syslog mining for network failure monitoring. In Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 499--508, 2005. Google ScholarDigital Library
- H. Yan. U-RCA: A Unified Root Cause Analysis Platform for Service Quality Management in Large IP Networks. Technical Report 10-103, Colorado State Univeristy, 2010.Google Scholar
Index Terms
- G-RCA: a generic root cause analysis platform for service quality management in large IP networks
Recommendations
G-RCA: a generic root cause analysis platform for service quality management in large IP networks
An increasingly diverse set of applications, such as Internet games, streaming videos, e-commerce, online banking, and even mission-critical emergency call services, all relies on IP networks. In such an environment, best-effort service is no longer ...
FLEX-RCA: a lean-based method for root cause analysis in software process improvement
Software process improvement (SPI) is an instrument to increase the productivity of, and the quality of work, in software organizations. However, a majority of SPI frameworks are too extensive or provide guidance and potential improvement areas at a ...
R g conditional diagnosability: A novel generalized measure of system-level diagnosis
AbstractSystem-level diagnosis has become an important diagnosis method for multiprocessor systems. Among all system-level diagnosis measures, diagnosability is relatively small. The conditional diagnosability constraint that each vertex has ...
Comments