skip to main content
10.1145/1755913.1755926acmconferencesArticle/Chapter ViewAbstractPublication PageseurosysConference Proceedingsconference-collections
research-article

Fingerprinting the datacenter: automated classification of performance crises

Published:13 April 2010Publication History

ABSTRACT

Contemporary datacenters comprise hundreds or thousands of machines running applications requiring high availability and responsiveness. Although a performance crisis is easily detected by monitoring key end-to-end performance indicators (KPIs) such as response latency or request throughput, the variety of conditions that can lead to KPI degradation makes it difficult to select appropriate recovery actions. We propose and evaluate a methodology for automatic classification and identification of crises, and in particular for detecting whether a given crisis has been seen before, so that a known solution may be immediately applied. Our approach is based on a new and efficient representation of the datacenter's state called a fingerprint, constructed by statistical selection and summarization of the hundreds of performance metrics typically collected on such systems. Our evaluation uses 4 months of trouble-ticket data from a production datacenter with hundreds of machines running a 24x7 enterprise-class user-facing application. In experiments in a realistic and rigorous operational setting, our approach provides operators the information necessary to initiate recovery actions with 80% correctness in an average of 10 minutes, which is 50 minutes earlier than the deadline provided to us by the operators. To the best of our knowledge this is the first rigorous evaluation of any such approach on a large-scale production installation.

References

  1. HP OpenView, welcome.hp.com/country/us/en/ prodserv/software.html.Google ScholarGoogle Scholar
  2. P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design &; Implementation, pages 18--18, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. P. Bodík, A. Fox, M. I. Jordan, D. Patterson, A. Banerjee, R. Jagannathan, T. Su, S. Tenginakai, B. Turner, and J. Ingalls. Advanced tools for operators at Amazon.com. In Hot Topics in Autonomic Computing (HotAC), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. P. Bodík, M. Goldszmidt, and A. Fox. Hilighter: Automatically building robust signatures of performance behavior for small- and large-scale systems. In A. Fox and S. Basu, editors, SysML. USENIX Association, 2008.Google ScholarGoogle Scholar
  5. M. Y. Chen, E. Kıcıman, A. Accardi, E. A. Brewer, D. Patterson, and A. Fox. Path-based failure and evolution management. In Proc. 1st USENIX/ACM Symposium on Networked Systems Design and Implementation (NSDI'04), San Francisco, CA, March 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In Proc. 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2004), San Francisco, CA, Dec 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In A. Herbert and K. P. Birman, editors, SOSP, pages 105--118. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. B. Cook, S. Babu, G. Candea, and S. Duan. Toward Self-Healing Multitier Services. 2007.Google ScholarGoogle Scholar
  9. S. Duan and S. Babu. Guided problem diagnosis through active learning. In ICAC 2008, pages 45---54, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan, G. Nichols, D. Grant, G. Loihle, and G. Hunt. Debugging in the (very) large: Ten years of implementation and experience. In 22nd ACM Symposium on Operating Systems Principles (SOSP 2009), Big Sky, Montana, Oct 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. M. Goldszmidt, I. Cohen, S. Zhang, and A. Fox. Three research challenges at the intersection of machine learning, statistical inference, and systems. In Proc. Tenth Workshop on Hot Topics in Operating Systems (HotOS-X), Santa Fe, NM, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. S. Guha and A. McGregor. Stream order and order statistics: Quantile estimation in random--order streams. SIAM Journal on Computing, 38(5):2044--2059, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. K. Koh, S.--J. Kim, and S. Boyd. An interior-point method for large-scale L1-regularized logistic regression. Journal of Machine Learning Research, 8:1519--1555, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. N. Lachiche and P. Flach. Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves. In 20th International Conference on Machine Learning (ICML03), 2003.Google ScholarGoogle Scholar
  15. M. Massie. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7):817--840, July 2004.Google ScholarGoogle ScholarCross RefCross Ref
  16. D. Patterson, A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P. Enriquez, A. Fox, E. Kiciman, M. Merzbacher, D. Oppenheimer, N. Sastry, W. Tetzlaff, J. Traupamn, and N. Treuhaft. Recovery oriented computing (ROC): Motivation, definition, techniques, and case studies. Technical report, UC Berkeley, March 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. S. Pertet, R. Gandhi, and P. Narasimhan. Fingerpointing correlated failures in replicated systems. In SYSML'07: Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques, pages 1--6, Berkeley, CA, USA, 2007. USENIX Association. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. A. Redstone, M. M. Swift,, and B. N. Bershad. Using computers to diagnose computer problems. In 9th Workshop on Hot Topics in Operating Systems (HotOS-IX), Elmau, Germany, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. P. Reynolds, J. L. Wiener, J. C. Mogul, M. A. Shah, C. Killian, and A. Vahdat. Experiences with Pip: finding unexpected behavior in distributed systems. In SOSP '05: Proceedings of the twentieth ACM symposium on Operating systems principles, pages 1--2, New York, NY, USA, 2005. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Woodard and M. Goldszmidt. Model-based clustering for online crisis identification in distributed computing. Technical report, Microsoft Research, 2009.Google ScholarGoogle Scholar
  21. S. A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie. High speed and robust event correlation. Communications Magazine, IEEE, 34(5):82--90, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. M. Young and P. T. Hastie. L1 regularization path algorithm for generalized linear models, 2006.Google ScholarGoogle Scholar
  23. C. Yuan, N. L. J.-R. Wen, J. Li, Z. Zhang, Y.-M. Wang, and W.-Y. Ma. Automated known problem diagnosis with event traces. In EuroSys 2006, Leuven, Belgium, April 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. J. Zhang, Z. Ghahramani, and Y. Yang. A probabilistic model for online document clustering with application to novelty detection. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1617--1624. MIT Press, Cambridge, MA, 2005.Google ScholarGoogle Scholar
  25. S. Zhang, I. Cohen, M. Goldszmidt, J. Symons, and A. Fox. Ensembles of models for automated diagnosis of system performance problems. In 2005 Intl. Conf. on Dependable Systems and Networks (DSN 2005), Yokohama, Japan, June 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Fingerprinting the datacenter: automated classification of performance crises

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            EuroSys '10: Proceedings of the 5th European conference on Computer systems
            April 2010
            388 pages
            ISBN:9781605585772
            DOI:10.1145/1755913

            Copyright © 2010 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 13 April 2010

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article

            Acceptance Rates

            Overall Acceptance Rate241of1,308submissions,18%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader