ABSTRACT
Contemporary datacenters comprise hundreds or thousands of machines running applications requiring high availability and responsiveness. Although a performance crisis is easily detected by monitoring key end-to-end performance indicators (KPIs) such as response latency or request throughput, the variety of conditions that can lead to KPI degradation makes it difficult to select appropriate recovery actions. We propose and evaluate a methodology for automatic classification and identification of crises, and in particular for detecting whether a given crisis has been seen before, so that a known solution may be immediately applied. Our approach is based on a new and efficient representation of the datacenter's state called a fingerprint, constructed by statistical selection and summarization of the hundreds of performance metrics typically collected on such systems. Our evaluation uses 4 months of trouble-ticket data from a production datacenter with hundreds of machines running a 24x7 enterprise-class user-facing application. In experiments in a realistic and rigorous operational setting, our approach provides operators the information necessary to initiate recovery actions with 80% correctness in an average of 10 minutes, which is 50 minutes earlier than the deadline provided to us by the operators. To the best of our knowledge this is the first rigorous evaluation of any such approach on a large-scale production installation.
- HP OpenView, welcome.hp.com/country/us/en/ prodserv/software.html.Google Scholar
- P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using magpie for request extraction and workload modelling. In OSDI'04: Proceedings of the 6th conference on Symposium on Opearting Systems Design &; Implementation, pages 18--18, Berkeley, CA, USA, 2004. USENIX Association. Google ScholarDigital Library
- P. Bodík, A. Fox, M. I. Jordan, D. Patterson, A. Banerjee, R. Jagannathan, T. Su, S. Tenginakai, B. Turner, and J. Ingalls. Advanced tools for operators at Amazon.com. In Hot Topics in Autonomic Computing (HotAC), 2006. Google ScholarDigital Library
- P. Bodík, M. Goldszmidt, and A. Fox. Hilighter: Automatically building robust signatures of performance behavior for small- and large-scale systems. In A. Fox and S. Basu, editors, SysML. USENIX Association, 2008.Google Scholar
- M. Y. Chen, E. Kıcıman, A. Accardi, E. A. Brewer, D. Patterson, and A. Fox. Path-based failure and evolution management. In Proc. 1st USENIX/ACM Symposium on Networked Systems Design and Implementation (NSDI'04), San Francisco, CA, March 2004. Google ScholarDigital Library
- I. Cohen, M. Goldszmidt, T. Kelly, J. Symons, and J. S. Chase. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In Proc. 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI 2004), San Francisco, CA, Dec 2004. Google ScholarDigital Library
- I. Cohen, S. Zhang, M. Goldszmidt, J. Symons, T. Kelly, and A. Fox. Capturing, indexing, clustering, and retrieving system history. In A. Herbert and K. P. Birman, editors, SOSP, pages 105--118. ACM, 2005. Google ScholarDigital Library
- B. Cook, S. Babu, G. Candea, and S. Duan. Toward Self-Healing Multitier Services. 2007.Google Scholar
- S. Duan and S. Babu. Guided problem diagnosis through active learning. In ICAC 2008, pages 45---54, Washington, DC, USA, 2008. IEEE Computer Society. Google ScholarDigital Library
- K. Glerum, K. Kinshumann, S. Greenberg, G. Aul, V. Orgovan, G. Nichols, D. Grant, G. Loihle, and G. Hunt. Debugging in the (very) large: Ten years of implementation and experience. In 22nd ACM Symposium on Operating Systems Principles (SOSP 2009), Big Sky, Montana, Oct 2009. Google ScholarDigital Library
- M. Goldszmidt, I. Cohen, S. Zhang, and A. Fox. Three research challenges at the intersection of machine learning, statistical inference, and systems. In Proc. Tenth Workshop on Hot Topics in Operating Systems (HotOS-X), Santa Fe, NM, June 2005. Google ScholarDigital Library
- S. Guha and A. McGregor. Stream order and order statistics: Quantile estimation in random--order streams. SIAM Journal on Computing, 38(5):2044--2059, 2009. Google ScholarDigital Library
- K. Koh, S.--J. Kim, and S. Boyd. An interior-point method for large-scale L1-regularized logistic regression. Journal of Machine Learning Research, 8:1519--1555, 2007. Google ScholarDigital Library
- N. Lachiche and P. Flach. Improving accuracy and cost of two-class and multi-class probabilistic classifiers using ROC curves. In 20th International Conference on Machine Learning (ICML03), 2003.Google Scholar
- M. Massie. The ganglia distributed monitoring system: design, implementation, and experience. Parallel Computing, 30(7):817--840, July 2004.Google ScholarCross Ref
- D. Patterson, A. Brown, P. Broadwell, G. Candea, M. Chen, J. Cutler, P. Enriquez, A. Fox, E. Kiciman, M. Merzbacher, D. Oppenheimer, N. Sastry, W. Tetzlaff, J. Traupamn, and N. Treuhaft. Recovery oriented computing (ROC): Motivation, definition, techniques, and case studies. Technical report, UC Berkeley, March 2002. Google ScholarDigital Library
- S. Pertet, R. Gandhi, and P. Narasimhan. Fingerpointing correlated failures in replicated systems. In SYSML'07: Proceedings of the 2nd USENIX workshop on Tackling computer systems problems with machine learning techniques, pages 1--6, Berkeley, CA, USA, 2007. USENIX Association. Google ScholarDigital Library
- J. A. Redstone, M. M. Swift,, and B. N. Bershad. Using computers to diagnose computer problems. In 9th Workshop on Hot Topics in Operating Systems (HotOS-IX), Elmau, Germany, 2003. Google ScholarDigital Library
- P. Reynolds, J. L. Wiener, J. C. Mogul, M. A. Shah, C. Killian, and A. Vahdat. Experiences with Pip: finding unexpected behavior in distributed systems. In SOSP '05: Proceedings of the twentieth ACM symposium on Operating systems principles, pages 1--2, New York, NY, USA, 2005. ACM. Google ScholarDigital Library
- D. Woodard and M. Goldszmidt. Model-based clustering for online crisis identification in distributed computing. Technical report, Microsoft Research, 2009.Google Scholar
- S. A. Yemini, S. Kliger, E. Mozes, Y. Yemini, and D. Ohsie. High speed and robust event correlation. Communications Magazine, IEEE, 34(5):82--90, 1996. Google ScholarDigital Library
- M. Young and P. T. Hastie. L1 regularization path algorithm for generalized linear models, 2006.Google Scholar
- C. Yuan, N. L. J.-R. Wen, J. Li, Z. Zhang, Y.-M. Wang, and W.-Y. Ma. Automated known problem diagnosis with event traces. In EuroSys 2006, Leuven, Belgium, April 2006. Google ScholarDigital Library
- J. Zhang, Z. Ghahramani, and Y. Yang. A probabilistic model for online document clustering with application to novelty detection. In L. K. Saul, Y. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages 1617--1624. MIT Press, Cambridge, MA, 2005.Google Scholar
- S. Zhang, I. Cohen, M. Goldszmidt, J. Symons, and A. Fox. Ensembles of models for automated diagnosis of system performance problems. In 2005 Intl. Conf. on Dependable Systems and Networks (DSN 2005), Yokohama, Japan, June 2005. Google ScholarDigital Library
Index Terms
- Fingerprinting the datacenter: automated classification of performance crises
Recommendations
Live Migration Impact on Virtual Datacenter Performance: Vmware vMotion Based Study
FICLOUD '14: Proceedings of the 2014 International Conference on Future Internet of Things and CloudCloud computing is the future wave of information technology that provides infrastructure, platform and application as on demand services with low cost and rapid scalability. Infrastructure resources virtualization is the backbone of cloud computing to ...
HeporCloud: An energy and performance efficient resource orchestrator for hybrid heterogeneous cloud computing environments
AbstractIn major Information Technology (IT) companies such as Google, Rackspace and Amazon Web Services (AWS), virtualisation and containerisation technologies are usually used to execute customers' workloads and applications. The computational ...
Protocol Responsibility Offloading to Improve TCP Throughput in Virtualized Environments
Virtualization is a key technology that powers cloud computing platforms such as Amazon EC2. Virtual machine (VM) consolidation, where multiple VMs share a physical host, has seen rapid adoption in practice, with increasingly large numbers of VMs per ...
Comments