survey

Performance Anomaly Detection and Bottleneck Identification

Authors:
Olumuyiwa Ibidunmoye

Umeå University, Umeå, Sweden

Umeå University, Umeå, Sweden
View Profile

,
Francisco Hernández-Rodriguez

Umeå University, Umeå, Sweden

Umeå University, Umeå, Sweden
View Profile

,
Erik Elmroth

Umeå University, Umeå, Sweden

Umeå University, Umeå, Sweden
View Profile

Authors Info & Claims

ACM Computing Surveys Volume 48 Issue 1Article No.: 4pp 1–35https://doi.org/10.1145/2791120

Published:22 July 2015Publication History

ACM Computing Surveys

Abstract

In order to meet stringent performance requirements, system administrators must effectively detect undesirable performance behaviours, identify potential root causes, and take adequate corrective measures. The problem of uncovering and understanding performance anomalies and their causes (bottlenecks) in different system and application domains is well studied. In order to assess progress, research trends, and identify open challenges, we have reviewed major contributions in the area and present our findings in this survey. Our approach provides an overview of anomaly detection and bottleneck identification research as it relates to the performance of computing systems. By identifying fundamental elements of the problem, we are able to categorize existing solutions based on multiple factors such as the detection goals, nature of applications and systems, system observability, and detection methods.

References

Sandip Agarwala, Fernando Alegre, Karsten Schwan, and Jegannathan Mehalingham. 2007. E2EProf: Automated end-to-end performance management for enterprise systems. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07). IEEE, 749--758. Google ScholarDigital Library
Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. 2003. Performance debugging for distributed systems of black boxes. ACM SIGOPS Operating Systems Review 37, 74--89. Google ScholarDigital Library
E. Alpaydin. 2014. Introduction to Machine Learning. MIT Press. Google ScholarDigital Library
Paul Barham, Rebecca Isaacs, Richard Mortier, and Dushyanth Narayanan. 2003. Magpie: Online modelling and performance-aware systems. In Proceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS IX). 85--90. Google ScholarDigital Library
Roberto Battiti. 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks 5, 4, 537--550. Google ScholarDigital Library
Muli Ben-Yehuda, David Breitgand, Michael Factor, Hillel Kolodner, Valentin Kravtsov, and Dan Pelleg. 2009. NAP: A building block for remediating performance bottlenecks via black box network analysis. In Proceedings of the 6th International Conference on Autonomic Computing. ACM, 179--188. Google ScholarDigital Library
Frank M. Bereznay and Kaiser Permanente. 2006. Did something change? using statistical techniques to interpret service and resource metrics. In Proceedings of the International CMG Conference. 229--242.Google Scholar
Pavel Berkhin. 2006. A survey of clustering data mining techniques. In Grouping Multidimensional Data. Springer, 25--71.Google Scholar
Kanishka Bhaduri, Kamalika Das, and Bryan L. Matthews. 2011. Detecting abnormal machine characteristics in cloud infrastructures. In Proceedings of the IEEE 11th International Conference on Data Mining Workshops (ICDMW’11). IEEE, 137--144. Google ScholarDigital Library
Walter Binder, Jarle Hulaas, and Philippe Moret. 2007. Advanced java bytecode instrumentation. In Proceedings of the 5th International Symposium on Principles and Practice of Programming in Java. ACM, 135--144. Google ScholarDigital Library
Peter Bodík, Moises Goldszmidt, and Armando Fox. 2008. HiLighter: Automatically building robust signatures of performance behavior for small-and large-scale systems. In SysML. USENIX Association.Google Scholar
Peter Bodik, Moises Goldszmidt, Armando Fox, Dawn B. Woodard, and Hans Andersen. 2010. Fingerprinting the datacenter: Automated classification of performance crises. In Proceedings of the 5th European Conference on Computer Systems. ACM, 111--124. Google ScholarDigital Library
George E. P. Box and George C. Tiao. 1975. Intervention analysis with applications to economic and environmental problems. J. Amer. Statist. Assoc. 70, 349, 70--79.Google ScholarCross Ref
John S. Breese and Russ Blake. 1995. Automating computer bottleneck detection with belief nets. In Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, 36--45. Google ScholarDigital Library
Jack Brey and Rick Sironi. 1990. Managing at the knee of the curve (The use of SPC in managing a data center). In Proceedings of the International CMG Conference. 895--901.Google Scholar
Shaun Burke. 2001. Missing values, outliers, robust statistics & non-parametric methods. LC-GC Europe Online Supplement, Statistics & Data Analysis 2, 19--24.Google Scholar
Rajkumar Buyya, Rodrigo N. Calheiros, and Xiaorong Li. 2012. Autonomic cloud computing: Open challenges and architectural elements. In Proceedings of the 3rd International Conference on Emerging Applications of Information Technology (EAIT’12). IEEE, 3--10.Google ScholarCross Ref
Jeffrey P. Buzen and Annie W. Shum. 1995. Masf-multivariate adaptive statistical filtering. In Proceedings of the International CMG Conference. 1--10.Google Scholar
Giuliano Casale, Amir Kalbasi, Diwakar Krishnamurthy, and Jerry Rolia. 2009. Automatic stress testing of multi-tier systems by dynamic bottleneck switch generation. In Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag New York, 20. Google ScholarDigital Library
Giuliano Casale, Ningfang Mi, Ludmila Cherkasova, and Evgenia Smirni. 2012. Dealing with burstiness in multi-tier applications: Models and their parameterization. IEEE Transactions on Software Engineering 38, 5, 1040--1053. Google ScholarDigital Library
Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM Computing Surveys (CSUR) 41, 3, 15. Google ScholarDigital Library
Mike Y. Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. 2002. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of International Conference on Dependable Systems and Networks. IEEE, 595--604. Google ScholarDigital Library
Ludmila Cherkasova, Kivanc Ozonat, Ningfang Mi, Julie Symons, and Evgenia Smirni. 2008. Anomaly? application change? or workload change? Towards automated detection of application performance anomaly and change. In Proceedings of the IEEE International Conference on Dependable Systems and Networks with FTCS and DCC. IEEE, 452--461.Google ScholarCross Ref
Ludmila Cherkasova, Kivanc Ozonat, Ningfang Mi, Julie Symons, and Evgenia Smirni. 2009. Automated anomaly detection and performance modeling of enterprise applications. ACM Transactions on Computer Systems (TOCS) 27, 3, 6. Google ScholarDigital Library
I-Hsin Chung, Guojing Cong, David Klepacki, Simone Sbaraglia, Seetharami Seelam, and Hui-Fang Wen. 2008. A framework for automated performance bottleneck detection. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08). IEEE, 1--7.Google Scholar
Ira Cohen, Jeffrey S. Chase, Moises Goldszmidt, Terence Kelly, and Julie Symons. 2004. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In OSDI, Vol. 4. 16--16. Google ScholarDigital Library
Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, and Armando Fox. 2005. Capturing, indexing, clustering, and retrieving system history. In ACM SIGOPS Operating Systems Review, Vol. 39. ACM, 105--118. Google ScholarDigital Library
Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing. ACM, 143--154. Google ScholarDigital Library
Marc Courtois and Murray Woodside. 2000. Using regression splines for software performance analysis. In Proceedings of the 2nd International Workshop on Software and Performance. ACM, 105--114. Google ScholarDigital Library
Kaustav Das. 2009. Detecting patterns of anomalies. Technical Report CMU-ML-09-101. PhD thesis. Carnegie Mellon University, Department of Machine Learning. Google ScholarDigital Library
Daniel Joseph Dean, Hiep Nguyen, and Xiaohui Gu. 2012. Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems. In Proceedings of the 9th International Conference on Autonomic Computing. ACM, 191--200. Google ScholarDigital Library
Daniel J. Dean, Hiep Nguyen, Peipei Wang, and Xiaohui Gu. 2014. PerfCompass: Toward runtime performance anomaly fault localization for infrastructure-as-a-service clouds. In Proceedings of the 6th USENIX Conference on Hot Topics in Cloud Computing. USENIX Association, 16--16. Google ScholarDigital Library
Anh Vu Do, Junliang Chen, Chen Wang, Young Choon Lee, Albert Y. Zomaya, and Bing Bing Zhou. 2011. Profiling applications for virtual machine placement in clouds. In Proceedings of the IEEE International Conference on Cloud Computing (CLOUD’11). IEEE, 660--667. Google ScholarDigital Library
Evolven. 2011. Downtime, Outages and Failures—Understanding Their True Costs. Retrieved March 11, 2015 from http://www.evolven.com/blog/downtime-outages-and-failures-understanding-their-true-costs.html.Google Scholar
Imola K. Fodor. 2002. A survey of dimension reduction techniques. Technical Report UCRL-ID-148494. Lawrence Livermore National Laboratory.Google Scholar
Song Fu. 2011. Performance metric selection for autonomic anomaly detection on cloud computing systems. In Proceedings of the Global Telecommunications Conference (GLOBECOM’11). IEEE, 1--5.Google Scholar
Song Fu, Jianguo Liu, and Husanbir Pannu. 2012. A Hybrid anomaly detection framework in cloud computing using one-class and two-class support vector machines. In Advanced Data Mining and Applications. Springer, 726--738.Google Scholar
Alessio Gambi and Giovanni Toffetti. 2012. Modeling cloud performance with kriging. In Proceedings of the 2012 International Conference on Software Engineering. IEEE Press, 1439--1440. Google ScholarDigital Library
Chunye Gong, Jie Liu, Qiang Zhang, Haitao Chen, and Zhenghu Gong. 2010. The characteristics of cloud computing. In Proceedings of the 39th International Conference on Parallel Processing Workshops (ICPPW’10). IEEE, 275--279. Google ScholarDigital Library
Brendan Gregg. 2013. Systems Performance: Enterprise and the Cloud. Pearson Education. Google ScholarDigital Library
Frank E. Grubbs. 1969. Procedures for detecting outlying observations in samples. Technometrics 11, 1, 1--21.Google ScholarCross Ref
Xiaohui Gu and Haixun Wang. 2009. Online anomaly prediction for robust cluster systems. In Proceedings of the IEEE 25th International Conference on Data Engineering (ICDE’09). IEEE, 1000--1011. Google ScholarDigital Library
Qiang Guan and Song Fu. 2013a. Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In Proceedings of the IEEE 32nd International Symposium on Reliable Distributed Systems (SRDS’13). IEEE, 205--214. Google ScholarDigital Library
Qiang Guan and Song Fu. 2013b. Wavelet-based multi-scale anomaly identification in cloud computing systems. In Proceedings of the Global Communications Conference (GLOBECOM’13). IEEE, 1379--1384.Google Scholar
Qiang Guan, Song Fu, Nathan DeBardeleben, and Sean Blanchard. 2013. Exploring time and frequency domains for accurate and automated anomaly detection in cloud computing systems. In Proceedings of the IEEE 19th Pacific Rim International Symposium on Dependable Computing (PRDC’13). IEEE, 196--205. Google ScholarDigital Library
Qiang Guan, Ziming Zhang, and Song Fu. 2011. Proactive failure management by integrated unsupervised and semi-supervised learning for dependable cloud systems. In Proceedings of the 6th International Conference on Availability, Reliability and Security (ARES’11). IEEE, 83--90. Google ScholarDigital Library
Qiang Guan, Ziming Zhang, and Song Fu. 2012. Ensemble of bayesian predictors and decision trees for proactive failure management in cloud computing systems. Journal of Communications 7, 1, 52--61.Google ScholarCross Ref
Dan Gunter, Brian L. Tierney, Aaron Brown, Martin Swany, John Bresnahan, and Jennifer M. Schopf. 2007. Log summarization and anomaly detection for troubleshooting distributed systems. In Proceedings of the 8th IEEE/ACM International Conference on Grid Computing. IEEE, 226--234. Google ScholarDigital Library
Neil J. Gunther. 2004. Benchmarking blunders and things that go bump in the night. CoRR. http://arxiv.org/abs/cs.PF/0404043Google Scholar
Neil J. Gunther. 2011. Analyzing Computer System Performance with Perl:: PDQ. Springer. Google ScholarDigital Library
Masum Z. Hasan, Edgar Magana, Alexander Clemm, Lew Tucker, and Sree Lakshmi D. Gudreddi. 2012. Integrated and autonomic cloud resource scaling. In Proceedings of the Network Operations and Management Symposium (NOMS’12). IEEE, 1327--1334.Google ScholarCross Ref
Victoria J. Hodge and Jim Austin. 2004. A survey of outlier detection methodologies. Artificial Intelligence Review 22, 2, 85--126. Google ScholarDigital Library
Cheng Huang. 2011. Public DNS System and Global Traffic Management. Retrieved April 15, 2014 from http://research.microsoft.com/en-us/um/people/chengh/slides/pubdns11.pptx.pdf.Google ScholarCross Ref
Su-Yun Huang, Mei-Hsien Lee, and Chuhsing Kate Hsiao. 2006. Kernel canonical correlation analysis and its applications to nonlinear measures of association and test of independence. Institute of Statistical Science: Academia Sinica, Taiwan.Google Scholar
Tian Huang, Yan Zhu, Qiannan Zhang, Yongxin Zhu, Dongyang Wang, Meikang Qiu, and Lei Liu. 2013. An LOF-based adaptive anomaly detection scheme for cloud computing. In Proceedings of the IEEE 37th Annual Computer Software and Applications Conference Workshops (COMPSACW’13). IEEE, 206--211. Google ScholarDigital Library
Waheed Iqbal, Matthew N. Dailey, David Carrera, and Paul Janecek. 2010. SLA-driven automatic bottleneck detection and resolution for read intensive multi-tier applications hosted on a cloud. In Advances in Grid and Pervasive Computing. Springer, 37--46. Google ScholarDigital Library
Brendan Jennings and Rolf Stadler. 2014. Resource management in clouds: Survey and research challenges. Journal of Network and Systems Management, 1--53. Google ScholarDigital Library
Gueyoung Jung, Galen Swint, Jason Parekh, Calton Pu, and Akhil Sahai. 2006. Detecting bottleneck in n-tier it applications through analysis. In Large Scale Management of Distributed Systems. Springer, 149--160. Google ScholarDigital Library
Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Victor Bahl. 2009. Detailed diagnosis in computer networks. In ACM SIGCOMM. Google ScholarDigital Library
Hui Kang, Xiaoyun Zhu, and Jennifer L. Wong. 2012. DAPA: diagnosing application performance anomalies for virtualized infrastructures. In Presented as part of the 2nd USENIX Workshop on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services. USENIX. Google ScholarDigital Library
Terence Kelly. 2005a. Detecting performance anomalies in global applications. In Proceedings of the 2nd Workshop on Real, Large Distributed Systems (WORLDS’05). Google ScholarDigital Library
Terence Kelly. 2005b. Transaction mix performance models: Methods and application to performance anomaly detection. In Proceedings of the 20th ACM Symposium on Operating Systems Principles. ACM, 1--3. Google ScholarDigital Library
Kissmetrics. 2014. How Loading Time Affects Your Bottom Line. Retrieved April 15, 2014 from http://blog.kissmetrics.com/loading-time/.Google Scholar
David Kleinbaum, Lawrence Kupper, Azhar Nizam, and Eli Rosenberg. 2013. Applied Regression Analysis and Other Multivariable Methods. Cengage Learning.Google Scholar
Seth Koehler, Greg Stitt, and Alan D. George. 2011. Platform-aware bottleneck detection for reconfigurable computing applications. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 4, 3, 30. Google ScholarDigital Library
S. B. Kotsiantis. 2007. Supervised Machine Learning: A review of classification techniques. Informatica 31, 249--268.Google Scholar
Zhiling Lan, Ziming Zheng, and Yawei Li. 2010. Toward automated anomaly identification in large-scale systems. IEEE Transactions on Parallel and Distributed Systems 21, 2, 174--187. Google ScholarDigital Library
Aleksandar Lazarevic, Levent Ertöz, Vipin Kumar, Aysel Ozgur, and Jaideep Srivastava. 2003. A comparative study of anomaly detection schemes in network intrusion detection. In Proceedings of SIAM International Conference on Data Mining. SIAM, 25--36.Google ScholarCross Ref
Aleksandar Lazarevic, Nisheeth Srivastava, Ashutosh Tiwari, Josh Isom, Nikunj C. Oza, and Jaideep Srivastava. 2009. Theoretically optimal distributed anomaly detection. In Proceedings of the IEEE International Conference on Data Mining Workshops (ICDMW’09). IEEE, 515--520. Google ScholarDigital Library
Benjamin C. Lee and David M. Brooks. 2006. Accurate and efficient regression modeling for microarchitectural performance and power prediction. In ACM SIGPLAN Notices, Vol. 41. ACM, 185--194. Google ScholarDigital Library
Donghun Lee, Sang K. Cha, and Arthur H. Lee. 2012. A performance anomaly detection and analysis framework for DBMS development. IEEE Transactions on Knowledge and Data Engineering 24, 8, 1345--1360. Google ScholarDigital Library
Han Bok Lee and Benjamin G. Zorn. 1997. BIT: A Tool for instrumenting java bytecodes. In Proceedings of the USENIX Symposium on Internet Technologies and Systems. 73--82. Google ScholarDigital Library
Wenke Lee and Dong Xiang. 2001. Information-theoretic measures for anomaly detection. In Proceedings of IEEE Symposium on Security and Privacy (S&P’’01). IEEE, 130--143. Google ScholarDigital Library
Li Li and Allen D. Malony. 2006. Model-based performance diagnosis of master-worker parallel computations. In Euro-Par 2006 Parallel Processing. Springer, 35--46. Google ScholarDigital Library
Yihua Liao and V. Rao Vemuri. 2002. Use of k-nearest neighbor classifier for intrusion detection. Computers & Security 21, 5, 439--448. Google ScholarDigital Library
David J. Lilja. 2005. Measuring Computer Performance: A Practitioner’s Guide. Cambridge University Press.Google ScholarCross Ref
Joao Paulo Magalhaes and L. Moura Silva. 2011. Adaptive profiling for root-cause analysis of performance anomalies in web-based applications. In Proceedings of the 10th IEEE International Symposium on Network Computing and Applications (NCA’11). IEEE, 171--178. Google ScholarDigital Library
Joao Paulo Magalhaes and Luis Moura Silva. 2010. Detection of performance anomalies in web-based applications. In Proceedings of the 9th IEEE International Symposium on Network Computing and Applications (NCA’10). IEEE, 60--67. Google ScholarDigital Library
João Paulo Magalhães and Luis Moura Silva. 2011. Root-cause analysis of performance anomalies in web-based applications. In Proceedings of the 2011 ACM Symposium on Applied Computing. ACM, 209--216. Google ScholarDigital Library
Nihar R. Mahapatra and Balakrishna Venkatrao. 1999. The processor-memory bottleneck: Problems and solutions. Crossroads 5, 3es, 2. Google ScholarDigital Library
Simon Malkowski, Markus Hedwig, Jason Parekh, Calton Pu, and Akhil Sahai. 2007. Bottleneck detection using statistical intervention analysis. In Managing Virtualization of Networks and Services. Springer, 122--134. Google ScholarDigital Library
Simon Malkowski, Markus Hedwig, and Calton Pu. 2009. Experimental evaluation of N-tier systems: Observation and analysis of multi-bottlenecks. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). IEEE, 118--127. Google ScholarDigital Library
Markos Markou and Sameer Singh. 2003. Novelty detection: A review part 1: Statistical approaches. Signal processing 83, 12, 2481--2497. Google ScholarDigital Library
Andrew McHugh. 2013. Top 10 Web Outages of 2013. Retrieved March 11, 2015 from http://blog.smartbear.com/performance/top-10-web-outages-of-2013/.Google Scholar
Bob Melander, Mats Bjorkman, and Per Gunningberg. 2000. A new end-to-end probing and analysis method for estimating bandwidth bottlenecks. In Proceedings of the Global Telecommunications Conference (GLOBECOM’00). IEEE, Vol. 1. IEEE, 415--420.Google ScholarCross Ref
Ningfang Mi, Giuliano Casale, Ludmila Cherkasova, and Evgenia Smirni. 2008a. Burstiness in multi-tier applications: Symptoms, causes, and new models. In Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag, New York, 265--286. Google ScholarDigital Library
Ningfang Mi, Ludmila Cherkasova, Kivanc Ozonat, Julie Symons, and Evgenia Smirni. 2008b. Analysis of application performance and its change via representative application signatures. In Proceedings of the Network Operations and Management Symposium. IEEE, 216--223.Google ScholarCross Ref
Jogesh K. Muppala, Steven P. Woolet, and Kishor S. Trivedi. 1991. Real-time systems performance in the presence of failures. Computer 24, 5, 37--47. Google ScholarDigital Library
A. S. Navaz, V. Sangeetha, and C. Prabhadevi. 2013. Entropy based anomaly detection system to prevent ddos attacks in cloud. International Journal of Computer Applications (0975-8887) 62, 15. http://arxiv.org/abs/1308.6745Google Scholar
John E. Neilson, C. Murray Woodside, Dorina C. Petriu, and Shikharesh Majumdar. 1995. Software bottlenecking in client-server systems and rendezvous networks. IEEE Transactions on Software Engineering 21, 9, 776--782. Google ScholarDigital Library
Hiep Nguyen, Zhiming Shen, Yongmin Tan, and Xiaohui Gu. 2013. FChain: Toward black-box online fault localization for cloud systems. In Proceedings of the IEEE 33rd International Conference on Distributed Computing Systems (ICDCS’13). IEEE, 21--30. Google ScholarDigital Library
George Nychis, Vyas Sekar, David G. Andersen, Hyong Kim, and Hui Zhang. 2008. An empirical evaluation of entropy-based traffic anomaly detection. In Proceedings of the 8th ACM SIGCOMM conference on Internet Measurement. ACM, 151--156. Google ScholarDigital Library
John S. Oakland. 2008. Statistical Process control. Routledge.Google Scholar
Husanbir S. Pannu, Jianguo Liu, and Song Fu. 2012. A self-evolving anomaly detection framework for developing highly dependable utility clouds. In Proceedings of the Global Communications Conference (GLOBECOM’12). IEEE, 1605--1610.Google ScholarCross Ref
Iakovos Panourgias. 2011. NUMA Effects on Multicore, Multisocket Systems. The University of Edinburgh.Google Scholar
Jason Parekh, Gueyoung Jung, Galen Swint, Calton Pu, and Akhil Sahai. 2006. Issues in bottleneck detection in multi-tier enterprise applications. In Proceedings of the 14th IEEE International Workshop on Quality of Service (IWQoS’06). IEEE, 302--303.Google ScholarCross Ref
Emanuel Parzen. 1962. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 1065--1076.Google ScholarCross Ref
Johannes Passing. 2005. Profiling, monitoring and tracing in SAP web application server. Seminar Systems Modelling, Hasso Plattner Insitute for Software Systems Engineering.Google Scholar
Manjula Peiris, James H. Hill, Jorgen Thelin, Sergey Bykov, Gabriel Kliot, and Christian Konig. 2014. PAD: Performance anomaly detection in multi-server distributed systems. In Proceedings of the 7th IEEE International Conference on Cloud Computing (CLOUD’14). IEEE. Google ScholarDigital Library
Soila Pertet and Priya Narasimhan. 2005. Causes of failure in web applications (cmu-pdl-05-109). Parallel Data Laboratory, 48.Google Scholar
Rob Powers, Moises Goldszmidt, and Ira Cohen. 2005. Short term performance forecasting in enterprise systems. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM, 801--807. Google ScholarDigital Library
Calton Pu, Akhil Sahai, Jason Parekh, Gueyoung Jung, Ji Bae, You-Kyung Cha, Timothy Garcia, Danesh Irani, Jae Lee, and Qifeng Lin. 2007. An observation-based approach to performance characterization of distributed n-tier applications. In IEEE 10th International Symposium on Workload Characterization. IISWC 2007. IEEE, 161--170. Google ScholarDigital Library
Xing Pu, Ling Liu, Yiduo Mei, Sankaran Sivathanu, Younggyun Koh, and Calton Pu. 2010. Understanding performance interference of i/o workload in virtualized cloud environments. In Proceedings of the IEEE 3rd International Conference on Cloud Computing (CLOUD’10). IEEE, 51--58. Google ScholarDigital Library
Sutharshan Rajasegarar, Christopher Leckie, and Marimuthu Palaniswami. 2008. Anomaly detection in wireless sensor networks. Wireless Communications, IEEE 15, 4, 34--40. Google ScholarDigital Library
Christoph Rathfelder, Stefan Becker, Klaus Krogmann, and Ralf Reussner. 2012. Workload-aware system monitoring using performance predictions applied to a large-scale e-mail system. In Proceedings of the Joint Working IEEE/IFIP Conference on Software Architecture (WICSA) and European Conference on Software Architecture (ECSA’12). IEEE, 31--40. Google ScholarDigital Library
Charles Reiss, John Wilkes, and Joseph L. Hellerstein. 2011. Google cluster-usage traces: Format+ schema. Google Inc., White Paper.Google Scholar
Douglas Reynolds. 2009. Gaussian mixture models. Encyclopedia of Biometrics, 659--663.Google Scholar
S. Rogers and M. Girolami. 2011. A First Course in Machine Learning. Taylor & Francis. Google ScholarDigital Library
Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. 2011. Diagnosing performance changes by comparing request flows. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 43--56. Google ScholarDigital Library
Bianca Schroeder, Garth Gibson, and others. 2010. A Large-scale study of failures in high-performance-computing systems. IEEE Transactions on Dependable and Secure Computing 7, 4, 337--350. Google ScholarDigital Library
Craig A. Shallahamer. 1995. Predicting Computing System Capacity and Throughput. Oracle Corporation White Paper. Retrieved from http://www.orapub.com.Google Scholar
Claude Elwood Shannon. 2001. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5, 1, 3--55. Google ScholarDigital Library
Bikash Sharma, Praveen Jayachandran, Akshat Verma, and Chita R. Das. 2013. CloudPD: Problem determination and diagnosis in shared dynamic clouds. In Proceedings of the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’13). IEEE, 1--12. Google ScholarDigital Library
Sameer Shende. 1999. Profiling and tracing in Linux. In Proceedings of the Extreme Linux Workshop, Vol. 2. Citeseer.Google Scholar
Derek Smith, Qiang Guan, and Song Fu. 2010. An anomaly detection framework for autonomic management of compute cloud systems. In Proceedings of the IEEE 34th Annual Computer Software and Applications Conference Workshops (COMPSACW’10). IEEE, 376--381. Google ScholarDigital Library
Ralf Steuer, Jürgen Kurths, Carsten O. Daub, Janko Weise, and Joachim Selbig. 2002. The mutual information: Detecting and evaluating dependencies between variables. Bioinformatics 18, Suppl 2, S231--S240.Google ScholarCross Ref
Yongmin Tan and Xiaohui Helen Adviser-Gu. 2012. Online Performance Anomaly Prediction and Prevention for Complex Distributed Systems. North Carolina State University.Google Scholar
Yongmin Tan and Xiaohui Gu. 2010. On predictability of system anomalies in real world. In Proceedings of the IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS’’10). IEEE, 133--140. Google ScholarDigital Library
Yongmin Tan, Xiaohui Gu, and Haixun Wang. 2010. Adaptive system anomaly prediction for large-scale hosting infrastructures. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing. ACM, 173--182. Google ScholarDigital Library
Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Chitra Venkatramani, and Deepak Rajan. 2012. Prepare: Predictive performance anomaly prevention for virtualized cloud systems. In Proceedings of the IEEE 32nd International Conference on Distributed Computing Systems (ICDCS’12). IEEE, 285--294. Google ScholarDigital Library
Jean-Claude Tarby, Houcine Ezzedine, José Rouillard, Chi Dung Tran, Philippe Laporte, and Christophe Kolski. 2007. Traces using aspect oriented programming and interactive agent-based architecture for early usability evaluation: Basic principles and comparison. In Human-Computer Interaction. Interaction Design and Usability. Springer, 632--641. Google ScholarDigital Library
Igor Trubin. 2005. Capturing workload pathology by statistical exception detection system. In Proceedings of the Computer Measurement Group. Citeseer.Google Scholar
Igor A. Trubin and Linwood Merritt. 2004. Mainframe global and workload level statistical exception detection system, based on MASF. In Proceedings of the International CMG Conference. 671--678.Google Scholar
John Wilder. 1977. Exploratory data analysis. Addison-Wesley, Reading, Mass.Google Scholar
Arno Wagner and Bernhard Plattner. 2005. Entropy based worm and anomaly detection in fast IP networks. In 14th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise. IEEE, 172--177. Google ScholarDigital Library
Christian Walck. 2007. Handbook on statistical distributions for experimentalists. Internal Report SUF-PFY/96-01, University of Stockholm.Google Scholar
Chengwei Wang, Karsten Schwan, and Matthew Wolf. 2009. Ebat: An entropy based online anomaly tester for data center management. In Proceedings of the IFIP/IEEE International Symposium on Integrated Network Management-Workshops. IEEE, 79--80.Google ScholarCross Ref
Chengwei Wang, Vanish Talwar, Karsten Schwan, and Parthasarathy Ranganathan. 2010. Online detection of utility cloud anomalies using metric distributions. In Proceedings of the Network Operations and Management Symposium (NOMS’10). IEEE, 96--103.Google Scholar
Chengwei Wang, Krishnamurthy Viswanathan, Lakshminarayan Choudur, Vanish Talwar, Wade Satterfield, and Karsten Schwan. 2011. Statistical techniques for online anomaly detection in data centers. In Proceedings of the IFIP/IEEE International Symposium on Integrated Network Management (IM’11). IEEE, 385--392.Google ScholarCross Ref
Haichuan Wang, Qiming Teng, Xiao Zhong, and Peter F. Sweeney. 2009. Understanding cross-tier delay of multi-tier application using selective invocation context extraction. In Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag, New York, 34. Google ScholarDigital Library
Qingyang Wang, Yasuhiko Kanemasa, Jack Li, Deepal Jayasinghe, Toshihiro Shimizu, Masazumi Matsubara, Motoyuki Kawaba, and Calton Pu. 2013a. Detecting transient bottlenecks in n-tier applications through fine-grained analysis. In Proceedings of the IEEE 33rd International Conference on Distributed Computing Systems (ICDCS’13). IEEE, 31--40. Google ScholarDigital Library
Qingyang Wang, Yasuhiko Kanemasa, Jack Li, Deepal Jayasinghe, Toshihiro Shimizu, Masazumi Matsubara, Motoyuki Kawaba, and Calton Pu. 2013b. An experimental study of rapidly alternating bottlenecks in n-tier applications. In Proceedings of the IEEE 6th International Conference on Cloud Computing (CLOUD’13). IEEE, 171--178. Google ScholarDigital Library
Tao Wang, Jun Wei, Feng Qin, WenBo Zhang, Hua Zhong, and Tao Huang. 2013. Detecting performance anomaly with correlation analysis for Internetware. Science China Information Sciences 56, 8, 1--15.Google Scholar
Tao Wang, Jun Wei, Wenbo Zhang, Hua Zhong, and Tao Huang. 2014. Workload-aware anomaly detection for web applications. Journal of Systems and Software 89, 19--32. Google ScholarDigital Library
Tao Wang, Wenbo Zhang, Jun Wei, and Hua Zhong. 2012. Workload-aware online anomaly detection in enterprise applications with local outlier factor. In Proceedings of the IEEE 36th Annual Computer Software and Applications Conference (COMPSAC’12). IEEE, 25--34. Google ScholarDigital Library
Pengcheng Xiong, Calton Pu, Xiaoyun Zhu, and Rean Griffith. 2013. vPerfGuard: An automated model-driven framework for application performance diagnosis in consolidated cloud environments. In Proceedings of the ACM/SPEC International Conference on Performance Engineering. ACM, 271--282. Google ScholarDigital Library
Yahoo! 2014. Webscope dataset—Computer System Data. Retrieved from http://webscope.sandbox.yahoo.com/catalog.php?datatype=.Google Scholar
Lingyun Yang, Chuang Liu, Jennifer M. Schopf, and Ian Foster. 2007. Anomaly detection and diagnosis in grid environments. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing. IEEE, 1--9. Google ScholarDigital Library
Li Yu and Zhiling Lan. 2013. A scalable, non-parametric anomaly detection framework for Hadoop. In Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference. ACM, 22. Google ScholarDigital Library
Minlan Yu, Albert Greenberg, Dave Maltz, Jennifer Rexford, Lihua Yuan, Srikanth Kandula, and Changhoon Kim. 2011. Profiling network performance for multi-tier data center applications. In Proceedings of Symposium on Networked System Design and Implementation. 57--70. Google ScholarDigital Library
Qi Zhang, Lu Cheng, and Raouf Boutaba. 2010. Cloud computing: State-of-the-art and research challenges. Journal of Internet Services and Applications 1, 1, 7--18.Google ScholarCross Ref
Qi Zhang, Ludmila Cherkasova, Guy Mathews, Wayne Greene, and Evgenia Smirni. 2007b. R-capriccio: A capacity planning and anomaly detection tool for enterprise services with live workloads. In Middleware 2007. Springer, 244--265. Google ScholarDigital Library
Qi Zhang, Ludmila Cherkasova, and Evgenia Smirni. 2007a. A regression-based analytic model for dynamic resource provisioning of multi-tier applications. In Proceedings of the 4th International Conference on Autonomic Computing (ICAC’07). IEEE, 27--27. Google ScholarDigital Library
Steve Zhang, Ira Cohen, Moises Goldszmidt, Julie Symons, and Armando Fox. 2005. Ensembles of models for automated diagnosis of system performance problems. In Proceedings of International Conference on Dependable Systems and Networks. IEEE, 644--653. Google ScholarDigital Library

Index Terms

Performance Anomaly Detection and Bottleneck Identification
1. Software and its engineering
  1. Software creation and management
    1. Software development process management
      1. Software development methods

Recommendations

Robust and accurate performance anomaly detection and prediction for cloud applications: a novel ensemble learning-based framework
Abstract
Effectively detecting run-time performance anomalies is crucial for clouds to identify abnormal performance behavior and forestall future incidents. To be used for real-world applications, an effective anomaly detection framework should meet three ...
Read More
Adaptive performance anomaly detection for online service systems via pattern sketching
ICSE '22: Proceedings of the 44th International Conference on Software Engineering

To ensure the performance of online service systems, their status is closely monitored with various software and system metrics. Performance anomalies represent the performance degradation issues (e.g., slow response) of the service systems. When ...
Read More
Reference-driven performance anomaly identification
SIGMETRICS '09

Complex system software allows a variety of execution conditions on system configurations and workload properties. This paper explores a principled use of reference executions--those of similar execution conditions from the target--to help identify the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Computing Surveys Volume 48, Issue 1
September 2015
592 pages
ISSN:0360-0300
EISSN:1557-7341
DOI:10.1145/2808687
Editor:
Sartaj Sahni
Department of Computer and Information Science and Engineering / University of Florida / Gainesville, FL 32611
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 July 2015
- Accepted: 1 May 2015
- Revised: 1 March 2015
- Received: 1 December 2014
Published in csur Volume 48, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Systems performance
bottleneck detection
performance anomaly detection
performance problem identification
Qualifiers
- survey
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 111
  Total Citations
  View Citations
- 3,261
  Total Downloads
- Downloads (Last 12 months)295
- Downloads (Last 6 weeks)27
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Performance Anomaly Detection and Bottleneck Identification

ACM Computing Surveys

Abstract

References

Cited By

Index Terms

Recommendations

Robust and accurate performance anomaly detection and prediction for cloud applications: a novel ensemble learning-based framework

Adaptive performance anomaly detection for online service systems via pattern sketching

Reference-driven performance anomaly identification