skip to main content
survey

Performance Anomaly Detection and Bottleneck Identification

Published:22 July 2015Publication History
Skip Abstract Section

Abstract

In order to meet stringent performance requirements, system administrators must effectively detect undesirable performance behaviours, identify potential root causes, and take adequate corrective measures. The problem of uncovering and understanding performance anomalies and their causes (bottlenecks) in different system and application domains is well studied. In order to assess progress, research trends, and identify open challenges, we have reviewed major contributions in the area and present our findings in this survey. Our approach provides an overview of anomaly detection and bottleneck identification research as it relates to the performance of computing systems. By identifying fundamental elements of the problem, we are able to categorize existing solutions based on multiple factors such as the detection goals, nature of applications and systems, system observability, and detection methods.

References

  1. Sandip Agarwala, Fernando Alegre, Karsten Schwan, and Jegannathan Mehalingham. 2007. E2EProf: Automated end-to-end performance management for enterprise systems. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07). IEEE, 749--758. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. 2003. Performance debugging for distributed systems of black boxes. ACM SIGOPS Operating Systems Review 37, 74--89. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. Alpaydin. 2014. Introduction to Machine Learning. MIT Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Paul Barham, Rebecca Isaacs, Richard Mortier, and Dushyanth Narayanan. 2003. Magpie: Online modelling and performance-aware systems. In Proceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS IX). 85--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Roberto Battiti. 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks 5, 4, 537--550. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Muli Ben-Yehuda, David Breitgand, Michael Factor, Hillel Kolodner, Valentin Kravtsov, and Dan Pelleg. 2009. NAP: A building block for remediating performance bottlenecks via black box network analysis. In Proceedings of the 6th International Conference on Autonomic Computing. ACM, 179--188. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Frank M. Bereznay and Kaiser Permanente. 2006. Did something change? using statistical techniques to interpret service and resource metrics. In Proceedings of the International CMG Conference. 229--242.Google ScholarGoogle Scholar
  8. Pavel Berkhin. 2006. A survey of clustering data mining techniques. In Grouping Multidimensional Data. Springer, 25--71.Google ScholarGoogle Scholar
  9. Kanishka Bhaduri, Kamalika Das, and Bryan L. Matthews. 2011. Detecting abnormal machine characteristics in cloud infrastructures. In Proceedings of the IEEE 11th International Conference on Data Mining Workshops (ICDMW’11). IEEE, 137--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Walter Binder, Jarle Hulaas, and Philippe Moret. 2007. Advanced java bytecode instrumentation. In Proceedings of the 5th International Symposium on Principles and Practice of Programming in Java. ACM, 135--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Peter Bodík, Moises Goldszmidt, and Armando Fox. 2008. HiLighter: Automatically building robust signatures of performance behavior for small-and large-scale systems. In SysML. USENIX Association.Google ScholarGoogle Scholar
  12. Peter Bodik, Moises Goldszmidt, Armando Fox, Dawn B. Woodard, and Hans Andersen. 2010. Fingerprinting the datacenter: Automated classification of performance crises. In Proceedings of the 5th European Conference on Computer Systems. ACM, 111--124. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. George E. P. Box and George C. Tiao. 1975. Intervention analysis with applications to economic and environmental problems. J. Amer. Statist. Assoc. 70, 349, 70--79.Google ScholarGoogle ScholarCross RefCross Ref
  14. John S. Breese and Russ Blake. 1995. Automating computer bottleneck detection with belief nets. In Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, 36--45. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jack Brey and Rick Sironi. 1990. Managing at the knee of the curve (The use of SPC in managing a data center). In Proceedings of the International CMG Conference. 895--901.Google ScholarGoogle Scholar
  16. Shaun Burke. 2001. Missing values, outliers, robust statistics & non-parametric methods. LC-GC Europe Online Supplement, Statistics & Data Analysis 2, 19--24.Google ScholarGoogle Scholar
  17. Rajkumar Buyya, Rodrigo N. Calheiros, and Xiaorong Li. 2012. Autonomic cloud computing: Open challenges and architectural elements. In Proceedings of the 3rd International Conference on Emerging Applications of Information Technology (EAIT’12). IEEE, 3--10.Google ScholarGoogle ScholarCross RefCross Ref
  18. Jeffrey P. Buzen and Annie W. Shum. 1995. Masf-multivariate adaptive statistical filtering. In Proceedings of the International CMG Conference. 1--10.Google ScholarGoogle Scholar
  19. Giuliano Casale, Amir Kalbasi, Diwakar Krishnamurthy, and Jerry Rolia. 2009. Automatic stress testing of multi-tier systems by dynamic bottleneck switch generation. In Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag New York, 20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Giuliano Casale, Ningfang Mi, Ludmila Cherkasova, and Evgenia Smirni. 2012. Dealing with burstiness in multi-tier applications: Models and their parameterization. IEEE Transactions on Software Engineering 38, 5, 1040--1053. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM Computing Surveys (CSUR) 41, 3, 15. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Mike Y. Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. 2002. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of International Conference on Dependable Systems and Networks. IEEE, 595--604. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ludmila Cherkasova, Kivanc Ozonat, Ningfang Mi, Julie Symons, and Evgenia Smirni. 2008. Anomaly? application change? or workload change? Towards automated detection of application performance anomaly and change. In Proceedings of the IEEE International Conference on Dependable Systems and Networks with FTCS and DCC. IEEE, 452--461.Google ScholarGoogle ScholarCross RefCross Ref
  24. Ludmila Cherkasova, Kivanc Ozonat, Ningfang Mi, Julie Symons, and Evgenia Smirni. 2009. Automated anomaly detection and performance modeling of enterprise applications. ACM Transactions on Computer Systems (TOCS) 27, 3, 6. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. I-Hsin Chung, Guojing Cong, David Klepacki, Simone Sbaraglia, Seetharami Seelam, and Hui-Fang Wen. 2008. A framework for automated performance bottleneck detection. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08). IEEE, 1--7.Google ScholarGoogle Scholar
  26. Ira Cohen, Jeffrey S. Chase, Moises Goldszmidt, Terence Kelly, and Julie Symons. 2004. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In OSDI, Vol. 4. 16--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, and Armando Fox. 2005. Capturing, indexing, clustering, and retrieving system history. In ACM SIGOPS Operating Systems Review, Vol. 39. ACM, 105--118. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing. ACM, 143--154. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Marc Courtois and Murray Woodside. 2000. Using regression splines for software performance analysis. In Proceedings of the 2nd International Workshop on Software and Performance. ACM, 105--114. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Kaustav Das. 2009. Detecting patterns of anomalies. Technical Report CMU-ML-09-101. PhD thesis. Carnegie Mellon University, Department of Machine Learning. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Daniel Joseph Dean, Hiep Nguyen, and Xiaohui Gu. 2012. Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems. In Proceedings of the 9th International Conference on Autonomic Computing. ACM, 191--200. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Daniel J. Dean, Hiep Nguyen, Peipei Wang, and Xiaohui Gu. 2014. PerfCompass: Toward runtime performance anomaly fault localization for infrastructure-as-a-service clouds. In Proceedings of the 6th USENIX Conference on Hot Topics in Cloud Computing. USENIX Association, 16--16. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Anh Vu Do, Junliang Chen, Chen Wang, Young Choon Lee, Albert Y. Zomaya, and Bing Bing Zhou. 2011. Profiling applications for virtual machine placement in clouds. In Proceedings of the IEEE International Conference on Cloud Computing (CLOUD’11). IEEE, 660--667. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Evolven. 2011. Downtime, Outages and Failures—Understanding Their True Costs. Retrieved March 11, 2015 from http://www.evolven.com/blog/downtime-outages-and-failures-understanding-their-true-costs.html.Google ScholarGoogle Scholar
  35. Imola K. Fodor. 2002. A survey of dimension reduction techniques. Technical Report UCRL-ID-148494. Lawrence Livermore National Laboratory.Google ScholarGoogle Scholar
  36. Song Fu. 2011. Performance metric selection for autonomic anomaly detection on cloud computing systems. In Proceedings of the Global Telecommunications Conference (GLOBECOM’11). IEEE, 1--5.Google ScholarGoogle Scholar
  37. Song Fu, Jianguo Liu, and Husanbir Pannu. 2012. A Hybrid anomaly detection framework in cloud computing using one-class and two-class support vector machines. In Advanced Data Mining and Applications. Springer, 726--738.Google ScholarGoogle Scholar
  38. Alessio Gambi and Giovanni Toffetti. 2012. Modeling cloud performance with kriging. In Proceedings of the 2012 International Conference on Software Engineering. IEEE Press, 1439--1440. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Chunye Gong, Jie Liu, Qiang Zhang, Haitao Chen, and Zhenghu Gong. 2010. The characteristics of cloud computing. In Proceedings of the 39th International Conference on Parallel Processing Workshops (ICPPW’10). IEEE, 275--279. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Brendan Gregg. 2013. Systems Performance: Enterprise and the Cloud. Pearson Education. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Frank E. Grubbs. 1969. Procedures for detecting outlying observations in samples. Technometrics 11, 1, 1--21.Google ScholarGoogle ScholarCross RefCross Ref
  42. Xiaohui Gu and Haixun Wang. 2009. Online anomaly prediction for robust cluster systems. In Proceedings of the IEEE 25th International Conference on Data Engineering (ICDE’09). IEEE, 1000--1011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Qiang Guan and Song Fu. 2013a. Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In Proceedings of the IEEE 32nd International Symposium on Reliable Distributed Systems (SRDS’13). IEEE, 205--214. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Qiang Guan and Song Fu. 2013b. Wavelet-based multi-scale anomaly identification in cloud computing systems. In Proceedings of the Global Communications Conference (GLOBECOM’13). IEEE, 1379--1384.Google ScholarGoogle Scholar
  45. Qiang Guan, Song Fu, Nathan DeBardeleben, and Sean Blanchard. 2013. Exploring time and frequency domains for accurate and automated anomaly detection in cloud computing systems. In Proceedings of the IEEE 19th Pacific Rim International Symposium on Dependable Computing (PRDC’13). IEEE, 196--205. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Qiang Guan, Ziming Zhang, and Song Fu. 2011. Proactive failure management by integrated unsupervised and semi-supervised learning for dependable cloud systems. In Proceedings of the 6th International Conference on Availability, Reliability and Security (ARES’11). IEEE, 83--90. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Qiang Guan, Ziming Zhang, and Song Fu. 2012. Ensemble of bayesian predictors and decision trees for proactive failure management in cloud computing systems. Journal of Communications 7, 1, 52--61.Google ScholarGoogle ScholarCross RefCross Ref
  48. Dan Gunter, Brian L. Tierney, Aaron Brown, Martin Swany, John Bresnahan, and Jennifer M. Schopf. 2007. Log summarization and anomaly detection for troubleshooting distributed systems. In Proceedings of the 8th IEEE/ACM International Conference on Grid Computing. IEEE, 226--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Neil J. Gunther. 2004. Benchmarking blunders and things that go bump in the night. CoRR. http://arxiv.org/abs/cs.PF/0404043Google ScholarGoogle Scholar
  50. Neil J. Gunther. 2011. Analyzing Computer System Performance with Perl:: PDQ. Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. Masum Z. Hasan, Edgar Magana, Alexander Clemm, Lew Tucker, and Sree Lakshmi D. Gudreddi. 2012. Integrated and autonomic cloud resource scaling. In Proceedings of the Network Operations and Management Symposium (NOMS’12). IEEE, 1327--1334.Google ScholarGoogle ScholarCross RefCross Ref
  52. Victoria J. Hodge and Jim Austin. 2004. A survey of outlier detection methodologies. Artificial Intelligence Review 22, 2, 85--126. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Cheng Huang. 2011. Public DNS System and Global Traffic Management. Retrieved April 15, 2014 from http://research.microsoft.com/en-us/um/people/chengh/slides/pubdns11.pptx.pdf.Google ScholarGoogle ScholarCross RefCross Ref
  54. Su-Yun Huang, Mei-Hsien Lee, and Chuhsing Kate Hsiao. 2006. Kernel canonical correlation analysis and its applications to nonlinear measures of association and test of independence. Institute of Statistical Science: Academia Sinica, Taiwan.Google ScholarGoogle Scholar
  55. Tian Huang, Yan Zhu, Qiannan Zhang, Yongxin Zhu, Dongyang Wang, Meikang Qiu, and Lei Liu. 2013. An LOF-based adaptive anomaly detection scheme for cloud computing. In Proceedings of the IEEE 37th Annual Computer Software and Applications Conference Workshops (COMPSACW’13). IEEE, 206--211. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Waheed Iqbal, Matthew N. Dailey, David Carrera, and Paul Janecek. 2010. SLA-driven automatic bottleneck detection and resolution for read intensive multi-tier applications hosted on a cloud. In Advances in Grid and Pervasive Computing. Springer, 37--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Brendan Jennings and Rolf Stadler. 2014. Resource management in clouds: Survey and research challenges. Journal of Network and Systems Management, 1--53. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Gueyoung Jung, Galen Swint, Jason Parekh, Calton Pu, and Akhil Sahai. 2006. Detecting bottleneck in n-tier it applications through analysis. In Large Scale Management of Distributed Systems. Springer, 149--160. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Victor Bahl. 2009. Detailed diagnosis in computer networks. In ACM SIGCOMM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  60. Hui Kang, Xiaoyun Zhu, and Jennifer L. Wong. 2012. DAPA: diagnosing application performance anomalies for virtualized infrastructures. In Presented as part of the 2nd USENIX Workshop on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services. USENIX. Google ScholarGoogle ScholarDigital LibraryDigital Library
  61. Terence Kelly. 2005a. Detecting performance anomalies in global applications. In Proceedings of the 2nd Workshop on Real, Large Distributed Systems (WORLDS’05). Google ScholarGoogle ScholarDigital LibraryDigital Library
  62. Terence Kelly. 2005b. Transaction mix performance models: Methods and application to performance anomaly detection. In Proceedings of the 20th ACM Symposium on Operating Systems Principles. ACM, 1--3. Google ScholarGoogle ScholarDigital LibraryDigital Library
  63. Kissmetrics. 2014. How Loading Time Affects Your Bottom Line. Retrieved April 15, 2014 from http://blog.kissmetrics.com/loading-time/.Google ScholarGoogle Scholar
  64. David Kleinbaum, Lawrence Kupper, Azhar Nizam, and Eli Rosenberg. 2013. Applied Regression Analysis and Other Multivariable Methods. Cengage Learning.Google ScholarGoogle Scholar
  65. Seth Koehler, Greg Stitt, and Alan D. George. 2011. Platform-aware bottleneck detection for reconfigurable computing applications. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 4, 3, 30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  66. S. B. Kotsiantis. 2007. Supervised Machine Learning: A review of classification techniques. Informatica 31, 249--268.Google ScholarGoogle Scholar
  67. Zhiling Lan, Ziming Zheng, and Yawei Li. 2010. Toward automated anomaly identification in large-scale systems. IEEE Transactions on Parallel and Distributed Systems 21, 2, 174--187. Google ScholarGoogle ScholarDigital LibraryDigital Library
  68. Aleksandar Lazarevic, Levent Ertöz, Vipin Kumar, Aysel Ozgur, and Jaideep Srivastava. 2003. A comparative study of anomaly detection schemes in network intrusion detection. In Proceedings of SIAM International Conference on Data Mining. SIAM, 25--36.Google ScholarGoogle ScholarCross RefCross Ref
  69. Aleksandar Lazarevic, Nisheeth Srivastava, Ashutosh Tiwari, Josh Isom, Nikunj C. Oza, and Jaideep Srivastava. 2009. Theoretically optimal distributed anomaly detection. In Proceedings of the IEEE International Conference on Data Mining Workshops (ICDMW’09). IEEE, 515--520. Google ScholarGoogle ScholarDigital LibraryDigital Library
  70. Benjamin C. Lee and David M. Brooks. 2006. Accurate and efficient regression modeling for microarchitectural performance and power prediction. In ACM SIGPLAN Notices, Vol. 41. ACM, 185--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  71. Donghun Lee, Sang K. Cha, and Arthur H. Lee. 2012. A performance anomaly detection and analysis framework for DBMS development. IEEE Transactions on Knowledge and Data Engineering 24, 8, 1345--1360. Google ScholarGoogle ScholarDigital LibraryDigital Library
  72. Han Bok Lee and Benjamin G. Zorn. 1997. BIT: A Tool for instrumenting java bytecodes. In Proceedings of the USENIX Symposium on Internet Technologies and Systems. 73--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  73. Wenke Lee and Dong Xiang. 2001. Information-theoretic measures for anomaly detection. In Proceedings of IEEE Symposium on Security and Privacy (S&P’’01). IEEE, 130--143. Google ScholarGoogle ScholarDigital LibraryDigital Library
  74. Li Li and Allen D. Malony. 2006. Model-based performance diagnosis of master-worker parallel computations. In Euro-Par 2006 Parallel Processing. Springer, 35--46. Google ScholarGoogle ScholarDigital LibraryDigital Library
  75. Yihua Liao and V. Rao Vemuri. 2002. Use of k-nearest neighbor classifier for intrusion detection. Computers & Security 21, 5, 439--448. Google ScholarGoogle ScholarDigital LibraryDigital Library
  76. David J. Lilja. 2005. Measuring Computer Performance: A Practitioner’s Guide. Cambridge University Press.Google ScholarGoogle ScholarCross RefCross Ref
  77. Joao Paulo Magalhaes and L. Moura Silva. 2011. Adaptive profiling for root-cause analysis of performance anomalies in web-based applications. In Proceedings of the 10th IEEE International Symposium on Network Computing and Applications (NCA’11). IEEE, 171--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  78. Joao Paulo Magalhaes and Luis Moura Silva. 2010. Detection of performance anomalies in web-based applications. In Proceedings of the 9th IEEE International Symposium on Network Computing and Applications (NCA’10). IEEE, 60--67. Google ScholarGoogle ScholarDigital LibraryDigital Library
  79. João Paulo Magalhães and Luis Moura Silva. 2011. Root-cause analysis of performance anomalies in web-based applications. In Proceedings of the 2011 ACM Symposium on Applied Computing. ACM, 209--216. Google ScholarGoogle ScholarDigital LibraryDigital Library
  80. Nihar R. Mahapatra and Balakrishna Venkatrao. 1999. The processor-memory bottleneck: Problems and solutions. Crossroads 5, 3es, 2. Google ScholarGoogle ScholarDigital LibraryDigital Library
  81. Simon Malkowski, Markus Hedwig, Jason Parekh, Calton Pu, and Akhil Sahai. 2007. Bottleneck detection using statistical intervention analysis. In Managing Virtualization of Networks and Services. Springer, 122--134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  82. Simon Malkowski, Markus Hedwig, and Calton Pu. 2009. Experimental evaluation of N-tier systems: Observation and analysis of multi-bottlenecks. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). IEEE, 118--127. Google ScholarGoogle ScholarDigital LibraryDigital Library
  83. Markos Markou and Sameer Singh. 2003. Novelty detection: A review part 1: Statistical approaches. Signal processing 83, 12, 2481--2497. Google ScholarGoogle ScholarDigital LibraryDigital Library
  84. Andrew McHugh. 2013. Top 10 Web Outages of 2013. Retrieved March 11, 2015 from http://blog.smartbear.com/performance/top-10-web-outages-of-2013/.Google ScholarGoogle Scholar
  85. Bob Melander, Mats Bjorkman, and Per Gunningberg. 2000. A new end-to-end probing and analysis method for estimating bandwidth bottlenecks. In Proceedings of the Global Telecommunications Conference (GLOBECOM’00). IEEE, Vol. 1. IEEE, 415--420.Google ScholarGoogle ScholarCross RefCross Ref
  86. Ningfang Mi, Giuliano Casale, Ludmila Cherkasova, and Evgenia Smirni. 2008a. Burstiness in multi-tier applications: Symptoms, causes, and new models. In Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag, New York, 265--286. Google ScholarGoogle ScholarDigital LibraryDigital Library
  87. Ningfang Mi, Ludmila Cherkasova, Kivanc Ozonat, Julie Symons, and Evgenia Smirni. 2008b. Analysis of application performance and its change via representative application signatures. In Proceedings of the Network Operations and Management Symposium. IEEE, 216--223.Google ScholarGoogle ScholarCross RefCross Ref
  88. Jogesh K. Muppala, Steven P. Woolet, and Kishor S. Trivedi. 1991. Real-time systems performance in the presence of failures. Computer 24, 5, 37--47. Google ScholarGoogle ScholarDigital LibraryDigital Library
  89. A. S. Navaz, V. Sangeetha, and C. Prabhadevi. 2013. Entropy based anomaly detection system to prevent ddos attacks in cloud. International Journal of Computer Applications (0975-8887) 62, 15. http://arxiv.org/abs/1308.6745Google ScholarGoogle Scholar
  90. John E. Neilson, C. Murray Woodside, Dorina C. Petriu, and Shikharesh Majumdar. 1995. Software bottlenecking in client-server systems and rendezvous networks. IEEE Transactions on Software Engineering 21, 9, 776--782. Google ScholarGoogle ScholarDigital LibraryDigital Library
  91. Hiep Nguyen, Zhiming Shen, Yongmin Tan, and Xiaohui Gu. 2013. FChain: Toward black-box online fault localization for cloud systems. In Proceedings of the IEEE 33rd International Conference on Distributed Computing Systems (ICDCS’13). IEEE, 21--30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  92. George Nychis, Vyas Sekar, David G. Andersen, Hyong Kim, and Hui Zhang. 2008. An empirical evaluation of entropy-based traffic anomaly detection. In Proceedings of the 8th ACM SIGCOMM conference on Internet Measurement. ACM, 151--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  93. John S. Oakland. 2008. Statistical Process control. Routledge.Google ScholarGoogle Scholar
  94. Husanbir S. Pannu, Jianguo Liu, and Song Fu. 2012. A self-evolving anomaly detection framework for developing highly dependable utility clouds. In Proceedings of the Global Communications Conference (GLOBECOM’12). IEEE, 1605--1610.Google ScholarGoogle ScholarCross RefCross Ref
  95. Iakovos Panourgias. 2011. NUMA Effects on Multicore, Multisocket Systems. The University of Edinburgh.Google ScholarGoogle Scholar
  96. Jason Parekh, Gueyoung Jung, Galen Swint, Calton Pu, and Akhil Sahai. 2006. Issues in bottleneck detection in multi-tier enterprise applications. In Proceedings of the 14th IEEE International Workshop on Quality of Service (IWQoS’06). IEEE, 302--303.Google ScholarGoogle ScholarCross RefCross Ref
  97. Emanuel Parzen. 1962. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 1065--1076.Google ScholarGoogle ScholarCross RefCross Ref
  98. Johannes Passing. 2005. Profiling, monitoring and tracing in SAP web application server. Seminar Systems Modelling, Hasso Plattner Insitute for Software Systems Engineering.Google ScholarGoogle Scholar
  99. Manjula Peiris, James H. Hill, Jorgen Thelin, Sergey Bykov, Gabriel Kliot, and Christian Konig. 2014. PAD: Performance anomaly detection in multi-server distributed systems. In Proceedings of the 7th IEEE International Conference on Cloud Computing (CLOUD’14). IEEE. Google ScholarGoogle ScholarDigital LibraryDigital Library
  100. Soila Pertet and Priya Narasimhan. 2005. Causes of failure in web applications (cmu-pdl-05-109). Parallel Data Laboratory, 48.Google ScholarGoogle Scholar
  101. Rob Powers, Moises Goldszmidt, and Ira Cohen. 2005. Short term performance forecasting in enterprise systems. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM, 801--807. Google ScholarGoogle ScholarDigital LibraryDigital Library
  102. Calton Pu, Akhil Sahai, Jason Parekh, Gueyoung Jung, Ji Bae, You-Kyung Cha, Timothy Garcia, Danesh Irani, Jae Lee, and Qifeng Lin. 2007. An observation-based approach to performance characterization of distributed n-tier applications. In IEEE 10th International Symposium on Workload Characterization. IISWC 2007. IEEE, 161--170. Google ScholarGoogle ScholarDigital LibraryDigital Library
  103. Xing Pu, Ling Liu, Yiduo Mei, Sankaran Sivathanu, Younggyun Koh, and Calton Pu. 2010. Understanding performance interference of i/o workload in virtualized cloud environments. In Proceedings of the IEEE 3rd International Conference on Cloud Computing (CLOUD’10). IEEE, 51--58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  104. Sutharshan Rajasegarar, Christopher Leckie, and Marimuthu Palaniswami. 2008. Anomaly detection in wireless sensor networks. Wireless Communications, IEEE 15, 4, 34--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  105. Christoph Rathfelder, Stefan Becker, Klaus Krogmann, and Ralf Reussner. 2012. Workload-aware system monitoring using performance predictions applied to a large-scale e-mail system. In Proceedings of the Joint Working IEEE/IFIP Conference on Software Architecture (WICSA) and European Conference on Software Architecture (ECSA’12). IEEE, 31--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  106. Charles Reiss, John Wilkes, and Joseph L. Hellerstein. 2011. Google cluster-usage traces: Format+ schema. Google Inc., White Paper.Google ScholarGoogle Scholar
  107. Douglas Reynolds. 2009. Gaussian mixture models. Encyclopedia of Biometrics, 659--663.Google ScholarGoogle Scholar
  108. S. Rogers and M. Girolami. 2011. A First Course in Machine Learning. Taylor & Francis. Google ScholarGoogle ScholarDigital LibraryDigital Library
  109. Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. 2011. Diagnosing performance changes by comparing request flows. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 43--56. Google ScholarGoogle ScholarDigital LibraryDigital Library
  110. Bianca Schroeder, Garth Gibson, and others. 2010. A Large-scale study of failures in high-performance-computing systems. IEEE Transactions on Dependable and Secure Computing 7, 4, 337--350. Google ScholarGoogle ScholarDigital LibraryDigital Library
  111. Craig A. Shallahamer. 1995. Predicting Computing System Capacity and Throughput. Oracle Corporation White Paper. Retrieved from http://www.orapub.com.Google ScholarGoogle Scholar
  112. Claude Elwood Shannon. 2001. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5, 1, 3--55. Google ScholarGoogle ScholarDigital LibraryDigital Library
  113. Bikash Sharma, Praveen Jayachandran, Akshat Verma, and Chita R. Das. 2013. CloudPD: Problem determination and diagnosis in shared dynamic clouds. In Proceedings of the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’13). IEEE, 1--12. Google ScholarGoogle ScholarDigital LibraryDigital Library
  114. Sameer Shende. 1999. Profiling and tracing in Linux. In Proceedings of the Extreme Linux Workshop, Vol. 2. Citeseer.Google ScholarGoogle Scholar
  115. Derek Smith, Qiang Guan, and Song Fu. 2010. An anomaly detection framework for autonomic management of compute cloud systems. In Proceedings of the IEEE 34th Annual Computer Software and Applications Conference Workshops (COMPSACW’10). IEEE, 376--381. Google ScholarGoogle ScholarDigital LibraryDigital Library
  116. Ralf Steuer, Jürgen Kurths, Carsten O. Daub, Janko Weise, and Joachim Selbig. 2002. The mutual information: Detecting and evaluating dependencies between variables. Bioinformatics 18, Suppl 2, S231--S240.Google ScholarGoogle ScholarCross RefCross Ref
  117. Yongmin Tan and Xiaohui Helen Adviser-Gu. 2012. Online Performance Anomaly Prediction and Prevention for Complex Distributed Systems. North Carolina State University.Google ScholarGoogle Scholar
  118. Yongmin Tan and Xiaohui Gu. 2010. On predictability of system anomalies in real world. In Proceedings of the IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS’’10). IEEE, 133--140. Google ScholarGoogle ScholarDigital LibraryDigital Library
  119. Yongmin Tan, Xiaohui Gu, and Haixun Wang. 2010. Adaptive system anomaly prediction for large-scale hosting infrastructures. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing. ACM, 173--182. Google ScholarGoogle ScholarDigital LibraryDigital Library
  120. Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Chitra Venkatramani, and Deepak Rajan. 2012. Prepare: Predictive performance anomaly prevention for virtualized cloud systems. In Proceedings of the IEEE 32nd International Conference on Distributed Computing Systems (ICDCS’12). IEEE, 285--294. Google ScholarGoogle ScholarDigital LibraryDigital Library
  121. Jean-Claude Tarby, Houcine Ezzedine, José Rouillard, Chi Dung Tran, Philippe Laporte, and Christophe Kolski. 2007. Traces using aspect oriented programming and interactive agent-based architecture for early usability evaluation: Basic principles and comparison. In Human-Computer Interaction. Interaction Design and Usability. Springer, 632--641. Google ScholarGoogle ScholarDigital LibraryDigital Library
  122. Igor Trubin. 2005. Capturing workload pathology by statistical exception detection system. In Proceedings of the Computer Measurement Group. Citeseer.Google ScholarGoogle Scholar
  123. Igor A. Trubin and Linwood Merritt. 2004. Mainframe global and workload level statistical exception detection system, based on MASF. In Proceedings of the International CMG Conference. 671--678.Google ScholarGoogle Scholar
  124. John Wilder. 1977. Exploratory data analysis. Addison-Wesley, Reading, Mass.Google ScholarGoogle Scholar
  125. Arno Wagner and Bernhard Plattner. 2005. Entropy based worm and anomaly detection in fast IP networks. In 14th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise. IEEE, 172--177. Google ScholarGoogle ScholarDigital LibraryDigital Library
  126. Christian Walck. 2007. Handbook on statistical distributions for experimentalists. Internal Report SUF-PFY/96-01, University of Stockholm.Google ScholarGoogle Scholar
  127. Chengwei Wang, Karsten Schwan, and Matthew Wolf. 2009. Ebat: An entropy based online anomaly tester for data center management. In Proceedings of the IFIP/IEEE International Symposium on Integrated Network Management-Workshops. IEEE, 79--80.Google ScholarGoogle ScholarCross RefCross Ref
  128. Chengwei Wang, Vanish Talwar, Karsten Schwan, and Parthasarathy Ranganathan. 2010. Online detection of utility cloud anomalies using metric distributions. In Proceedings of the Network Operations and Management Symposium (NOMS’10). IEEE, 96--103.Google ScholarGoogle Scholar
  129. Chengwei Wang, Krishnamurthy Viswanathan, Lakshminarayan Choudur, Vanish Talwar, Wade Satterfield, and Karsten Schwan. 2011. Statistical techniques for online anomaly detection in data centers. In Proceedings of the IFIP/IEEE International Symposium on Integrated Network Management (IM’11). IEEE, 385--392.Google ScholarGoogle ScholarCross RefCross Ref
  130. Haichuan Wang, Qiming Teng, Xiao Zhong, and Peter F. Sweeney. 2009. Understanding cross-tier delay of multi-tier application using selective invocation context extraction. In Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag, New York, 34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  131. Qingyang Wang, Yasuhiko Kanemasa, Jack Li, Deepal Jayasinghe, Toshihiro Shimizu, Masazumi Matsubara, Motoyuki Kawaba, and Calton Pu. 2013a. Detecting transient bottlenecks in n-tier applications through fine-grained analysis. In Proceedings of the IEEE 33rd International Conference on Distributed Computing Systems (ICDCS’13). IEEE, 31--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  132. Qingyang Wang, Yasuhiko Kanemasa, Jack Li, Deepal Jayasinghe, Toshihiro Shimizu, Masazumi Matsubara, Motoyuki Kawaba, and Calton Pu. 2013b. An experimental study of rapidly alternating bottlenecks in n-tier applications. In Proceedings of the IEEE 6th International Conference on Cloud Computing (CLOUD’13). IEEE, 171--178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  133. Tao Wang, Jun Wei, Feng Qin, WenBo Zhang, Hua Zhong, and Tao Huang. 2013. Detecting performance anomaly with correlation analysis for Internetware. Science China Information Sciences 56, 8, 1--15.Google ScholarGoogle Scholar
  134. Tao Wang, Jun Wei, Wenbo Zhang, Hua Zhong, and Tao Huang. 2014. Workload-aware anomaly detection for web applications. Journal of Systems and Software 89, 19--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  135. Tao Wang, Wenbo Zhang, Jun Wei, and Hua Zhong. 2012. Workload-aware online anomaly detection in enterprise applications with local outlier factor. In Proceedings of the IEEE 36th Annual Computer Software and Applications Conference (COMPSAC’12). IEEE, 25--34. Google ScholarGoogle ScholarDigital LibraryDigital Library
  136. Pengcheng Xiong, Calton Pu, Xiaoyun Zhu, and Rean Griffith. 2013. vPerfGuard: An automated model-driven framework for application performance diagnosis in consolidated cloud environments. In Proceedings of the ACM/SPEC International Conference on Performance Engineering. ACM, 271--282. Google ScholarGoogle ScholarDigital LibraryDigital Library
  137. Yahoo! 2014. Webscope dataset—Computer System Data. Retrieved from http://webscope.sandbox.yahoo.com/catalog.php?datatype=.Google ScholarGoogle Scholar
  138. Lingyun Yang, Chuang Liu, Jennifer M. Schopf, and Ian Foster. 2007. Anomaly detection and diagnosis in grid environments. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing. IEEE, 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  139. Li Yu and Zhiling Lan. 2013. A scalable, non-parametric anomaly detection framework for Hadoop. In Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference. ACM, 22. Google ScholarGoogle ScholarDigital LibraryDigital Library
  140. Minlan Yu, Albert Greenberg, Dave Maltz, Jennifer Rexford, Lihua Yuan, Srikanth Kandula, and Changhoon Kim. 2011. Profiling network performance for multi-tier data center applications. In Proceedings of Symposium on Networked System Design and Implementation. 57--70. Google ScholarGoogle ScholarDigital LibraryDigital Library
  141. Qi Zhang, Lu Cheng, and Raouf Boutaba. 2010. Cloud computing: State-of-the-art and research challenges. Journal of Internet Services and Applications 1, 1, 7--18.Google ScholarGoogle ScholarCross RefCross Ref
  142. Qi Zhang, Ludmila Cherkasova, Guy Mathews, Wayne Greene, and Evgenia Smirni. 2007b. R-capriccio: A capacity planning and anomaly detection tool for enterprise services with live workloads. In Middleware 2007. Springer, 244--265. Google ScholarGoogle ScholarDigital LibraryDigital Library
  143. Qi Zhang, Ludmila Cherkasova, and Evgenia Smirni. 2007a. A regression-based analytic model for dynamic resource provisioning of multi-tier applications. In Proceedings of the 4th International Conference on Autonomic Computing (ICAC’07). IEEE, 27--27. Google ScholarGoogle ScholarDigital LibraryDigital Library
  144. Steve Zhang, Ira Cohen, Moises Goldszmidt, Julie Symons, and Armando Fox. 2005. Ensembles of models for automated diagnosis of system performance problems. In Proceedings of International Conference on Dependable Systems and Networks. IEEE, 644--653. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Performance Anomaly Detection and Bottleneck Identification

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Computing Surveys
      ACM Computing Surveys  Volume 48, Issue 1
      September 2015
      592 pages
      ISSN:0360-0300
      EISSN:1557-7341
      DOI:10.1145/2808687
      • Editor:
      • Sartaj Sahni
      Issue’s Table of Contents

      Copyright © 2015 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 22 July 2015
      • Accepted: 1 May 2015
      • Revised: 1 March 2015
      • Received: 1 December 2014
      Published in csur Volume 48, Issue 1

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • survey
      • Research
      • Refereed

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader