research-article

Fingerprinting the datacenter: automated classification of performance crises

Authors:
Peter Bodik

UC Berkeley, Berkeley, CA, USA

UC Berkeley, Berkeley, CA, USA
View Profile

,
Moises Goldszmidt

Microsoft Research, Mountain View, CA, USA

Microsoft Research, Mountain View, CA, USA
View Profile

,
Armando Fox

UC Berkeley, Berkeley, CA, USA

UC Berkeley, Berkeley, CA, USA
View Profile

,
Dawn B. Woodard

Cornell University, Ithaca, NY, USA

Cornell University, Ithaca, NY, USA
View Profile

,
Hans Andersen

Microsoft, Redmond, WA, USA

Microsoft, Redmond, WA, USA
View Profile

EuroSys '10: Proceedings of the 5th European conference on Computer systemsApril 2010Pages 111–124https://doi.org/10.1145/1755913.1755926

Published:13 April 2010Publication History

EuroSys '10: Proceedings of the 5th European conference on Computer systems

Pages 111–124

ABSTRACT

Contemporary datacenters comprise hundreds or thousands of machines running applications requiring high availability and responsiveness. Although a performance crisis is easily detected by monitoring key end-to-end performance indicators (KPIs) such as response latency or request throughput, the variety of conditions that can lead to KPI degradation makes it difficult to select appropriate recovery actions. We propose and evaluate a methodology for automatic classification and identification of crises, and in particular for detecting whether a given crisis has been seen before, so that a known solution may be immediately applied. Our approach is based on a new and efficient representation of the datacenter's state called a fingerprint, constructed by statistical selection and summarization of the hundreds of performance metrics typically collected on such systems. Our evaluation uses 4 months of trouble-ticket data from a production datacenter with hundreds of machines running a 24x7 enterprise-class user-facing application. In experiments in a realistic and rigorous operational setting, our approach provides operators the information necessary to initiate recovery actions with 80% correctness in an average of 10 minutes, which is 50 minutes earlier than the deadline provided to us by the operators. To the best of our knowledge this is the first rigorous evaluation of any such approach on a large-scale production installation.

References

HP OpenView, welcome.hp.com/country/us/en/ prodserv/software.html.Google Scholar
P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design &; Implementation, pages 18--18, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
P. Bodík, A. Fox, M. I. Jordan, D. Patterson, A. Banerjee, R. Jagannathan, T. Su, S. Tenginakai, B. Turner, and J. Ingalls. Advanced tools for operators at Amazon.com. In Hot Topics in Autonomic Computing (HotAC), 2006. Google ScholarDigital Library
P. Bodík, M. Goldszmidt, and A. Fox. Hilighter: Automatically building robust signatures of performance behavior for small- and large-scale systems. In A. Fox and S. Basu, editors, SysML. USENIX Association, 2008.Google Scholar
M. Y. Chen, E. Kıcıman, A. Accardi, E. A. Brewer, D. Patterson, and A. Fox. Path-based failure and evolution management. In Proc. 1st USENIX/ACM Symposium on Networked Systems Design and Implementation (NSDI'04), San Francisco, CA, March 2004. Google ScholarDigital Library
I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In Proc. 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2004), San Francisco, CA, Dec 2004. Google ScholarDigital Library
I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In A. Herbert and K. P. Birman, editors, SOSP, pages 105--118. ACM, 2005. Google ScholarDigital Library
B. Cook, S. Babu, G. Candea, and S. Duan. Toward Self-Healing Multitier Services. 2007.Google Scholar
S. Duan and S. Babu. Guided problem diagnosis through active learning. In ICAC 2008, pages 45---54, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarDigital Library
K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan, G. Nichols, D. Grant, G. Loihle, and G. Hunt. Debugging in the (very) large: Ten years of implementation and experience. In 22nd ACM Symposium on Operating Systems Principles (SOSP 2009), Big Sky, Montana, Oct 2009. Google ScholarDigital Library
M. Goldszmidt, I. Cohen, S. Zhang, and A. Fox. Three research challenges at the intersection of machine learning, statistical inference, and systems. In Proc. Tenth Workshop on Hot Topics in Operating Systems (HotOS-X), Santa Fe, NM, June 2005. Google ScholarDigital Library
S. Guha and A. McGregor. Stream order and order statistics: Quantile estimation in random--order streams. SIAM Journal on Computing, 38(5):2044--2059, 2009. Google ScholarDigital Library
K. Koh, S.--J. Kim, and S. Boyd. An interior-point method for large-scale L1-regularized logistic regression. Journal of Machine Learning Research, 8:1519--1555, 2007. Google ScholarDigital Library
N. Lachiche and P. Flach. Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves. In 20th International Conference on Machine Learning (ICML03), 2003.Google Scholar
M. Massie. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7):817--840, July 2004.Google ScholarCross Ref
D. Patterson, A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P. Enriquez, A. Fox, E. Kiciman, M. Merzbacher, D. Oppenheimer, N. Sastry, W. Tetzlaff, J. Traupamn, and N. Treuhaft. Recovery oriented computing (ROC): Motivation, definition, techniques, and case studies. Technical report, UC Berkeley, March 2002. Google ScholarDigital Library
S. Pertet, R. Gandhi, and P. Narasimhan. Fingerpointing correlated failures in replicated systems. In SYSML'07: Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques, pages 1--6, Berkeley, CA, USA, 2007. USENIX Association. Google ScholarDigital Library
J. A. Redstone, M. M. Swift,, and B. N. Bershad. Using computers to diagnose computer problems. In 9th Workshop on Hot Topics in Operating Systems (HotOS-IX), Elmau, Germany, 2003. Google ScholarDigital Library
P. Reynolds, J. L. Wiener, J. C. Mogul, M. A. Shah, C. Killian, and A. Vahdat. Experiences with Pip: finding unexpected behavior in distributed systems. In SOSP '05: Proceedings of the twentieth ACM symposium on Operating systems principles, pages 1--2, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
D. Woodard and M. Goldszmidt. Model-based clustering for online crisis identification in distributed computing. Technical report, Microsoft Research, 2009.Google Scholar
S. A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie. High speed and robust event correlation. Communications Magazine, IEEE, 34(5):82--90, 1996. Google ScholarDigital Library
M. Young and P. T. Hastie. L1 regularization path algorithm for generalized linear models, 2006.Google Scholar
C. Yuan, N. L. J.-R. Wen, J. Li, Z. Zhang, Y.-M. Wang, and W.-Y. Ma. Automated known problem diagnosis with event traces. In EuroSys 2006, Leuven, Belgium, April 2006. Google ScholarDigital Library
J. Zhang, Z. Ghahramani, and Y. Yang. A probabilistic model for online document clustering with application to novelty detection. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1617--1624. MIT Press, Cambridge, MA, 2005.Google Scholar
S. Zhang, I. Cohen, M. Goldszmidt, J. Symons, and A. Fox. Ensembles of models for automated diagnosis of system performance problems. In 2005 Intl. Conf. on Dependable Systems and Networks (DSN 2005), Yokohama, Japan, June 2005. Google ScholarDigital Library

Index Terms

Fingerprinting the datacenter: automated classification of performance crises
1. Computer systems organization
  1. Dependable and fault-tolerant systems and networks
2. General and reference
  1. Cross-computing tools and techniques
    1. Reliability

Recommendations

Live Migration Impact on Virtual Datacenter Performance: Vmware vMotion Based Study
FICLOUD '14: Proceedings of the 2014 International Conference on Future Internet of Things and Cloud

Cloud computing is the future wave of information technology that provides infrastructure, platform and application as on demand services with low cost and rapid scalability. Infrastructure resources virtualization is the backbone of cloud computing to ...
Read More
HeporCloud: An energy and performance efficient resource orchestrator for hybrid heterogeneous cloud computing environments
Abstract
In major Information Technology (IT) companies such as Google, Rackspace and Amazon Web Services (AWS), virtualisation and containerisation technologies are usually used to execute customers' workloads and applications. The computational ...
Read More
Protocol Responsibility Offloading to Improve TCP Throughput in Virtualized Environments

Virtualization is a key technology that powers cloud computing platforms such as Amazon EC2. Virtual machine (VM) consolidation, where multiple VMs share a physical host, has seen rapid adoption in practice, with increasingly large numbers of VMs per ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
EuroSys '10: Proceedings of the 5th European conference on Computer systems
April 2010
388 pages
ISBN:9781605585772
DOI:10.1145/1755913
General Chair:
Christine Morin
INRIA Rennes, France
,
Program Chair:
Gilles Muller
INRIA/LIP6, France
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 April 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
datacenters
performance
web applications
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate241of1,308submissions,18%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 164
  Total Citations
  View Citations
- 1,231
  Total Downloads
- Downloads (Last 12 months)60
- Downloads (Last 6 weeks)7
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Fingerprinting the datacenter: automated classification of performance crises

EuroSys '10: Proceedings of the 5th European conference on Computer systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

Live Migration Impact on Virtual Datacenter Performance: Vmware vMotion Based Study

HeporCloud: An energy and performance efficient resource orchestrator for hybrid heterogeneous cloud computing environments

Protocol Responsibility Offloading to Improve TCP Throughput in Virtualized Environments