Abstract
In order to meet stringent performance requirements, system administrators must effectively detect undesirable performance behaviours, identify potential root causes, and take adequate corrective measures. The problem of uncovering and understanding performance anomalies and their causes (bottlenecks) in different system and application domains is well studied. In order to assess progress, research trends, and identify open challenges, we have reviewed major contributions in the area and present our findings in this survey. Our approach provides an overview of anomaly detection and bottleneck identification research as it relates to the performance of computing systems. By identifying fundamental elements of the problem, we are able to categorize existing solutions based on multiple factors such as the detection goals, nature of applications and systems, system observability, and detection methods.
- Sandip Agarwala, Fernando Alegre, Karsten Schwan, and Jegannathan Mehalingham. 2007. E2EProf: Automated end-to-end performance management for enterprise systems. In Proceedings of the 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07). IEEE, 749--758. Google ScholarDigital Library
- Marcos K. Aguilera, Jeffrey C. Mogul, Janet L. Wiener, Patrick Reynolds, and Athicha Muthitacharoen. 2003. Performance debugging for distributed systems of black boxes. ACM SIGOPS Operating Systems Review 37, 74--89. Google ScholarDigital Library
- E. Alpaydin. 2014. Introduction to Machine Learning. MIT Press. Google ScholarDigital Library
- Paul Barham, Rebecca Isaacs, Richard Mortier, and Dushyanth Narayanan. 2003. Magpie: Online modelling and performance-aware systems. In Proceedings of the 9th Workshop on Hot Topics in Operating Systems (HotOS IX). 85--90. Google ScholarDigital Library
- Roberto Battiti. 1994. Using mutual information for selecting features in supervised neural net learning. IEEE Transactions on Neural Networks 5, 4, 537--550. Google ScholarDigital Library
- Muli Ben-Yehuda, David Breitgand, Michael Factor, Hillel Kolodner, Valentin Kravtsov, and Dan Pelleg. 2009. NAP: A building block for remediating performance bottlenecks via black box network analysis. In Proceedings of the 6th International Conference on Autonomic Computing. ACM, 179--188. Google ScholarDigital Library
- Frank M. Bereznay and Kaiser Permanente. 2006. Did something change? using statistical techniques to interpret service and resource metrics. In Proceedings of the International CMG Conference. 229--242.Google Scholar
- Pavel Berkhin. 2006. A survey of clustering data mining techniques. In Grouping Multidimensional Data. Springer, 25--71.Google Scholar
- Kanishka Bhaduri, Kamalika Das, and Bryan L. Matthews. 2011. Detecting abnormal machine characteristics in cloud infrastructures. In Proceedings of the IEEE 11th International Conference on Data Mining Workshops (ICDMW’11). IEEE, 137--144. Google ScholarDigital Library
- Walter Binder, Jarle Hulaas, and Philippe Moret. 2007. Advanced java bytecode instrumentation. In Proceedings of the 5th International Symposium on Principles and Practice of Programming in Java. ACM, 135--144. Google ScholarDigital Library
- Peter Bodík, Moises Goldszmidt, and Armando Fox. 2008. HiLighter: Automatically building robust signatures of performance behavior for small-and large-scale systems. In SysML. USENIX Association.Google Scholar
- Peter Bodik, Moises Goldszmidt, Armando Fox, Dawn B. Woodard, and Hans Andersen. 2010. Fingerprinting the datacenter: Automated classification of performance crises. In Proceedings of the 5th European Conference on Computer Systems. ACM, 111--124. Google ScholarDigital Library
- George E. P. Box and George C. Tiao. 1975. Intervention analysis with applications to economic and environmental problems. J. Amer. Statist. Assoc. 70, 349, 70--79.Google ScholarCross Ref
- John S. Breese and Russ Blake. 1995. Automating computer bottleneck detection with belief nets. In Proceedings of the 11th Conference on Uncertainty in Artificial Intelligence. Morgan Kaufmann, 36--45. Google ScholarDigital Library
- Jack Brey and Rick Sironi. 1990. Managing at the knee of the curve (The use of SPC in managing a data center). In Proceedings of the International CMG Conference. 895--901.Google Scholar
- Shaun Burke. 2001. Missing values, outliers, robust statistics & non-parametric methods. LC-GC Europe Online Supplement, Statistics & Data Analysis 2, 19--24.Google Scholar
- Rajkumar Buyya, Rodrigo N. Calheiros, and Xiaorong Li. 2012. Autonomic cloud computing: Open challenges and architectural elements. In Proceedings of the 3rd International Conference on Emerging Applications of Information Technology (EAIT’12). IEEE, 3--10.Google ScholarCross Ref
- Jeffrey P. Buzen and Annie W. Shum. 1995. Masf-multivariate adaptive statistical filtering. In Proceedings of the International CMG Conference. 1--10.Google Scholar
- Giuliano Casale, Amir Kalbasi, Diwakar Krishnamurthy, and Jerry Rolia. 2009. Automatic stress testing of multi-tier systems by dynamic bottleneck switch generation. In Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag New York, 20. Google ScholarDigital Library
- Giuliano Casale, Ningfang Mi, Ludmila Cherkasova, and Evgenia Smirni. 2012. Dealing with burstiness in multi-tier applications: Models and their parameterization. IEEE Transactions on Software Engineering 38, 5, 1040--1053. Google ScholarDigital Library
- Varun Chandola, Arindam Banerjee, and Vipin Kumar. 2009. Anomaly detection: A survey. ACM Computing Surveys (CSUR) 41, 3, 15. Google ScholarDigital Library
- Mike Y. Chen, Emre Kiciman, Eugene Fratkin, Armando Fox, and Eric Brewer. 2002. Pinpoint: Problem determination in large, dynamic internet services. In Proceedings of International Conference on Dependable Systems and Networks. IEEE, 595--604. Google ScholarDigital Library
- Ludmila Cherkasova, Kivanc Ozonat, Ningfang Mi, Julie Symons, and Evgenia Smirni. 2008. Anomaly? application change? or workload change? Towards automated detection of application performance anomaly and change. In Proceedings of the IEEE International Conference on Dependable Systems and Networks with FTCS and DCC. IEEE, 452--461.Google ScholarCross Ref
- Ludmila Cherkasova, Kivanc Ozonat, Ningfang Mi, Julie Symons, and Evgenia Smirni. 2009. Automated anomaly detection and performance modeling of enterprise applications. ACM Transactions on Computer Systems (TOCS) 27, 3, 6. Google ScholarDigital Library
- I-Hsin Chung, Guojing Cong, David Klepacki, Simone Sbaraglia, Seetharami Seelam, and Hui-Fang Wen. 2008. A framework for automated performance bottleneck detection. In Proceedings of the IEEE International Symposium on Parallel and Distributed Processing (IPDPS’08). IEEE, 1--7.Google Scholar
- Ira Cohen, Jeffrey S. Chase, Moises Goldszmidt, Terence Kelly, and Julie Symons. 2004. Correlating instrumentation data to system states: A building block for automated diagnosis and control. In OSDI, Vol. 4. 16--16. Google ScholarDigital Library
- Ira Cohen, Steve Zhang, Moises Goldszmidt, Julie Symons, Terence Kelly, and Armando Fox. 2005. Capturing, indexing, clustering, and retrieving system history. In ACM SIGOPS Operating Systems Review, Vol. 39. ACM, 105--118. Google ScholarDigital Library
- Brian F. Cooper, Adam Silberstein, Erwin Tam, Raghu Ramakrishnan, and Russell Sears. 2010. Benchmarking cloud serving systems with YCSB. In Proceedings of the 1st ACM Symposium on Cloud Computing. ACM, 143--154. Google ScholarDigital Library
- Marc Courtois and Murray Woodside. 2000. Using regression splines for software performance analysis. In Proceedings of the 2nd International Workshop on Software and Performance. ACM, 105--114. Google ScholarDigital Library
- Kaustav Das. 2009. Detecting patterns of anomalies. Technical Report CMU-ML-09-101. PhD thesis. Carnegie Mellon University, Department of Machine Learning. Google ScholarDigital Library
- Daniel Joseph Dean, Hiep Nguyen, and Xiaohui Gu. 2012. Ubl: Unsupervised behavior learning for predicting performance anomalies in virtualized cloud systems. In Proceedings of the 9th International Conference on Autonomic Computing. ACM, 191--200. Google ScholarDigital Library
- Daniel J. Dean, Hiep Nguyen, Peipei Wang, and Xiaohui Gu. 2014. PerfCompass: Toward runtime performance anomaly fault localization for infrastructure-as-a-service clouds. In Proceedings of the 6th USENIX Conference on Hot Topics in Cloud Computing. USENIX Association, 16--16. Google ScholarDigital Library
- Anh Vu Do, Junliang Chen, Chen Wang, Young Choon Lee, Albert Y. Zomaya, and Bing Bing Zhou. 2011. Profiling applications for virtual machine placement in clouds. In Proceedings of the IEEE International Conference on Cloud Computing (CLOUD’11). IEEE, 660--667. Google ScholarDigital Library
- Evolven. 2011. Downtime, Outages and Failures—Understanding Their True Costs. Retrieved March 11, 2015 from http://www.evolven.com/blog/downtime-outages-and-failures-understanding-their-true-costs.html.Google Scholar
- Imola K. Fodor. 2002. A survey of dimension reduction techniques. Technical Report UCRL-ID-148494. Lawrence Livermore National Laboratory.Google Scholar
- Song Fu. 2011. Performance metric selection for autonomic anomaly detection on cloud computing systems. In Proceedings of the Global Telecommunications Conference (GLOBECOM’11). IEEE, 1--5.Google Scholar
- Song Fu, Jianguo Liu, and Husanbir Pannu. 2012. A Hybrid anomaly detection framework in cloud computing using one-class and two-class support vector machines. In Advanced Data Mining and Applications. Springer, 726--738.Google Scholar
- Alessio Gambi and Giovanni Toffetti. 2012. Modeling cloud performance with kriging. In Proceedings of the 2012 International Conference on Software Engineering. IEEE Press, 1439--1440. Google ScholarDigital Library
- Chunye Gong, Jie Liu, Qiang Zhang, Haitao Chen, and Zhenghu Gong. 2010. The characteristics of cloud computing. In Proceedings of the 39th International Conference on Parallel Processing Workshops (ICPPW’10). IEEE, 275--279. Google ScholarDigital Library
- Brendan Gregg. 2013. Systems Performance: Enterprise and the Cloud. Pearson Education. Google ScholarDigital Library
- Frank E. Grubbs. 1969. Procedures for detecting outlying observations in samples. Technometrics 11, 1, 1--21.Google ScholarCross Ref
- Xiaohui Gu and Haixun Wang. 2009. Online anomaly prediction for robust cluster systems. In Proceedings of the IEEE 25th International Conference on Data Engineering (ICDE’09). IEEE, 1000--1011. Google ScholarDigital Library
- Qiang Guan and Song Fu. 2013a. Adaptive anomaly identification by exploring metric subspace in cloud computing infrastructures. In Proceedings of the IEEE 32nd International Symposium on Reliable Distributed Systems (SRDS’13). IEEE, 205--214. Google ScholarDigital Library
- Qiang Guan and Song Fu. 2013b. Wavelet-based multi-scale anomaly identification in cloud computing systems. In Proceedings of the Global Communications Conference (GLOBECOM’13). IEEE, 1379--1384.Google Scholar
- Qiang Guan, Song Fu, Nathan DeBardeleben, and Sean Blanchard. 2013. Exploring time and frequency domains for accurate and automated anomaly detection in cloud computing systems. In Proceedings of the IEEE 19th Pacific Rim International Symposium on Dependable Computing (PRDC’13). IEEE, 196--205. Google ScholarDigital Library
- Qiang Guan, Ziming Zhang, and Song Fu. 2011. Proactive failure management by integrated unsupervised and semi-supervised learning for dependable cloud systems. In Proceedings of the 6th International Conference on Availability, Reliability and Security (ARES’11). IEEE, 83--90. Google ScholarDigital Library
- Qiang Guan, Ziming Zhang, and Song Fu. 2012. Ensemble of bayesian predictors and decision trees for proactive failure management in cloud computing systems. Journal of Communications 7, 1, 52--61.Google ScholarCross Ref
- Dan Gunter, Brian L. Tierney, Aaron Brown, Martin Swany, John Bresnahan, and Jennifer M. Schopf. 2007. Log summarization and anomaly detection for troubleshooting distributed systems. In Proceedings of the 8th IEEE/ACM International Conference on Grid Computing. IEEE, 226--234. Google ScholarDigital Library
- Neil J. Gunther. 2004. Benchmarking blunders and things that go bump in the night. CoRR. http://arxiv.org/abs/cs.PF/0404043Google Scholar
- Neil J. Gunther. 2011. Analyzing Computer System Performance with Perl:: PDQ. Springer. Google ScholarDigital Library
- Masum Z. Hasan, Edgar Magana, Alexander Clemm, Lew Tucker, and Sree Lakshmi D. Gudreddi. 2012. Integrated and autonomic cloud resource scaling. In Proceedings of the Network Operations and Management Symposium (NOMS’12). IEEE, 1327--1334.Google ScholarCross Ref
- Victoria J. Hodge and Jim Austin. 2004. A survey of outlier detection methodologies. Artificial Intelligence Review 22, 2, 85--126. Google ScholarDigital Library
- Cheng Huang. 2011. Public DNS System and Global Traffic Management. Retrieved April 15, 2014 from http://research.microsoft.com/en-us/um/people/chengh/slides/pubdns11.pptx.pdf.Google ScholarCross Ref
- Su-Yun Huang, Mei-Hsien Lee, and Chuhsing Kate Hsiao. 2006. Kernel canonical correlation analysis and its applications to nonlinear measures of association and test of independence. Institute of Statistical Science: Academia Sinica, Taiwan.Google Scholar
- Tian Huang, Yan Zhu, Qiannan Zhang, Yongxin Zhu, Dongyang Wang, Meikang Qiu, and Lei Liu. 2013. An LOF-based adaptive anomaly detection scheme for cloud computing. In Proceedings of the IEEE 37th Annual Computer Software and Applications Conference Workshops (COMPSACW’13). IEEE, 206--211. Google ScholarDigital Library
- Waheed Iqbal, Matthew N. Dailey, David Carrera, and Paul Janecek. 2010. SLA-driven automatic bottleneck detection and resolution for read intensive multi-tier applications hosted on a cloud. In Advances in Grid and Pervasive Computing. Springer, 37--46. Google ScholarDigital Library
- Brendan Jennings and Rolf Stadler. 2014. Resource management in clouds: Survey and research challenges. Journal of Network and Systems Management, 1--53. Google ScholarDigital Library
- Gueyoung Jung, Galen Swint, Jason Parekh, Calton Pu, and Akhil Sahai. 2006. Detecting bottleneck in n-tier it applications through analysis. In Large Scale Management of Distributed Systems. Springer, 149--160. Google ScholarDigital Library
- Srikanth Kandula, Ratul Mahajan, Patrick Verkaik, Sharad Agarwal, Jitendra Padhye, and Victor Bahl. 2009. Detailed diagnosis in computer networks. In ACM SIGCOMM. Google ScholarDigital Library
- Hui Kang, Xiaoyun Zhu, and Jennifer L. Wong. 2012. DAPA: diagnosing application performance anomalies for virtualized infrastructures. In Presented as part of the 2nd USENIX Workshop on Hot Topics in Management of Internet, Cloud, and Enterprise Networks and Services. USENIX. Google ScholarDigital Library
- Terence Kelly. 2005a. Detecting performance anomalies in global applications. In Proceedings of the 2nd Workshop on Real, Large Distributed Systems (WORLDS’05). Google ScholarDigital Library
- Terence Kelly. 2005b. Transaction mix performance models: Methods and application to performance anomaly detection. In Proceedings of the 20th ACM Symposium on Operating Systems Principles. ACM, 1--3. Google ScholarDigital Library
- Kissmetrics. 2014. How Loading Time Affects Your Bottom Line. Retrieved April 15, 2014 from http://blog.kissmetrics.com/loading-time/.Google Scholar
- David Kleinbaum, Lawrence Kupper, Azhar Nizam, and Eli Rosenberg. 2013. Applied Regression Analysis and Other Multivariable Methods. Cengage Learning.Google Scholar
- Seth Koehler, Greg Stitt, and Alan D. George. 2011. Platform-aware bottleneck detection for reconfigurable computing applications. ACM Transactions on Reconfigurable Technology and Systems (TRETS) 4, 3, 30. Google ScholarDigital Library
- S. B. Kotsiantis. 2007. Supervised Machine Learning: A review of classification techniques. Informatica 31, 249--268.Google Scholar
- Zhiling Lan, Ziming Zheng, and Yawei Li. 2010. Toward automated anomaly identification in large-scale systems. IEEE Transactions on Parallel and Distributed Systems 21, 2, 174--187. Google ScholarDigital Library
- Aleksandar Lazarevic, Levent Ertöz, Vipin Kumar, Aysel Ozgur, and Jaideep Srivastava. 2003. A comparative study of anomaly detection schemes in network intrusion detection. In Proceedings of SIAM International Conference on Data Mining. SIAM, 25--36.Google ScholarCross Ref
- Aleksandar Lazarevic, Nisheeth Srivastava, Ashutosh Tiwari, Josh Isom, Nikunj C. Oza, and Jaideep Srivastava. 2009. Theoretically optimal distributed anomaly detection. In Proceedings of the IEEE International Conference on Data Mining Workshops (ICDMW’09). IEEE, 515--520. Google ScholarDigital Library
- Benjamin C. Lee and David M. Brooks. 2006. Accurate and efficient regression modeling for microarchitectural performance and power prediction. In ACM SIGPLAN Notices, Vol. 41. ACM, 185--194. Google ScholarDigital Library
- Donghun Lee, Sang K. Cha, and Arthur H. Lee. 2012. A performance anomaly detection and analysis framework for DBMS development. IEEE Transactions on Knowledge and Data Engineering 24, 8, 1345--1360. Google ScholarDigital Library
- Han Bok Lee and Benjamin G. Zorn. 1997. BIT: A Tool for instrumenting java bytecodes. In Proceedings of the USENIX Symposium on Internet Technologies and Systems. 73--82. Google ScholarDigital Library
- Wenke Lee and Dong Xiang. 2001. Information-theoretic measures for anomaly detection. In Proceedings of IEEE Symposium on Security and Privacy (S&P’’01). IEEE, 130--143. Google ScholarDigital Library
- Li Li and Allen D. Malony. 2006. Model-based performance diagnosis of master-worker parallel computations. In Euro-Par 2006 Parallel Processing. Springer, 35--46. Google ScholarDigital Library
- Yihua Liao and V. Rao Vemuri. 2002. Use of k-nearest neighbor classifier for intrusion detection. Computers & Security 21, 5, 439--448. Google ScholarDigital Library
- David J. Lilja. 2005. Measuring Computer Performance: A Practitioner’s Guide. Cambridge University Press.Google ScholarCross Ref
- Joao Paulo Magalhaes and L. Moura Silva. 2011. Adaptive profiling for root-cause analysis of performance anomalies in web-based applications. In Proceedings of the 10th IEEE International Symposium on Network Computing and Applications (NCA’11). IEEE, 171--178. Google ScholarDigital Library
- Joao Paulo Magalhaes and Luis Moura Silva. 2010. Detection of performance anomalies in web-based applications. In Proceedings of the 9th IEEE International Symposium on Network Computing and Applications (NCA’10). IEEE, 60--67. Google ScholarDigital Library
- João Paulo Magalhães and Luis Moura Silva. 2011. Root-cause analysis of performance anomalies in web-based applications. In Proceedings of the 2011 ACM Symposium on Applied Computing. ACM, 209--216. Google ScholarDigital Library
- Nihar R. Mahapatra and Balakrishna Venkatrao. 1999. The processor-memory bottleneck: Problems and solutions. Crossroads 5, 3es, 2. Google ScholarDigital Library
- Simon Malkowski, Markus Hedwig, Jason Parekh, Calton Pu, and Akhil Sahai. 2007. Bottleneck detection using statistical intervention analysis. In Managing Virtualization of Networks and Services. Springer, 122--134. Google ScholarDigital Library
- Simon Malkowski, Markus Hedwig, and Calton Pu. 2009. Experimental evaluation of N-tier systems: Observation and analysis of multi-bottlenecks. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC’09). IEEE, 118--127. Google ScholarDigital Library
- Markos Markou and Sameer Singh. 2003. Novelty detection: A review part 1: Statistical approaches. Signal processing 83, 12, 2481--2497. Google ScholarDigital Library
- Andrew McHugh. 2013. Top 10 Web Outages of 2013. Retrieved March 11, 2015 from http://blog.smartbear.com/performance/top-10-web-outages-of-2013/.Google Scholar
- Bob Melander, Mats Bjorkman, and Per Gunningberg. 2000. A new end-to-end probing and analysis method for estimating bandwidth bottlenecks. In Proceedings of the Global Telecommunications Conference (GLOBECOM’00). IEEE, Vol. 1. IEEE, 415--420.Google ScholarCross Ref
- Ningfang Mi, Giuliano Casale, Ludmila Cherkasova, and Evgenia Smirni. 2008a. Burstiness in multi-tier applications: Symptoms, causes, and new models. In Proceedings of the 9th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag, New York, 265--286. Google ScholarDigital Library
- Ningfang Mi, Ludmila Cherkasova, Kivanc Ozonat, Julie Symons, and Evgenia Smirni. 2008b. Analysis of application performance and its change via representative application signatures. In Proceedings of the Network Operations and Management Symposium. IEEE, 216--223.Google ScholarCross Ref
- Jogesh K. Muppala, Steven P. Woolet, and Kishor S. Trivedi. 1991. Real-time systems performance in the presence of failures. Computer 24, 5, 37--47. Google ScholarDigital Library
- A. S. Navaz, V. Sangeetha, and C. Prabhadevi. 2013. Entropy based anomaly detection system to prevent ddos attacks in cloud. International Journal of Computer Applications (0975-8887) 62, 15. http://arxiv.org/abs/1308.6745Google Scholar
- John E. Neilson, C. Murray Woodside, Dorina C. Petriu, and Shikharesh Majumdar. 1995. Software bottlenecking in client-server systems and rendezvous networks. IEEE Transactions on Software Engineering 21, 9, 776--782. Google ScholarDigital Library
- Hiep Nguyen, Zhiming Shen, Yongmin Tan, and Xiaohui Gu. 2013. FChain: Toward black-box online fault localization for cloud systems. In Proceedings of the IEEE 33rd International Conference on Distributed Computing Systems (ICDCS’13). IEEE, 21--30. Google ScholarDigital Library
- George Nychis, Vyas Sekar, David G. Andersen, Hyong Kim, and Hui Zhang. 2008. An empirical evaluation of entropy-based traffic anomaly detection. In Proceedings of the 8th ACM SIGCOMM conference on Internet Measurement. ACM, 151--156. Google ScholarDigital Library
- John S. Oakland. 2008. Statistical Process control. Routledge.Google Scholar
- Husanbir S. Pannu, Jianguo Liu, and Song Fu. 2012. A self-evolving anomaly detection framework for developing highly dependable utility clouds. In Proceedings of the Global Communications Conference (GLOBECOM’12). IEEE, 1605--1610.Google ScholarCross Ref
- Iakovos Panourgias. 2011. NUMA Effects on Multicore, Multisocket Systems. The University of Edinburgh.Google Scholar
- Jason Parekh, Gueyoung Jung, Galen Swint, Calton Pu, and Akhil Sahai. 2006. Issues in bottleneck detection in multi-tier enterprise applications. In Proceedings of the 14th IEEE International Workshop on Quality of Service (IWQoS’06). IEEE, 302--303.Google ScholarCross Ref
- Emanuel Parzen. 1962. On estimation of a probability density function and mode. The Annals of Mathematical Statistics, 1065--1076.Google ScholarCross Ref
- Johannes Passing. 2005. Profiling, monitoring and tracing in SAP web application server. Seminar Systems Modelling, Hasso Plattner Insitute for Software Systems Engineering.Google Scholar
- Manjula Peiris, James H. Hill, Jorgen Thelin, Sergey Bykov, Gabriel Kliot, and Christian Konig. 2014. PAD: Performance anomaly detection in multi-server distributed systems. In Proceedings of the 7th IEEE International Conference on Cloud Computing (CLOUD’14). IEEE. Google ScholarDigital Library
- Soila Pertet and Priya Narasimhan. 2005. Causes of failure in web applications (cmu-pdl-05-109). Parallel Data Laboratory, 48.Google Scholar
- Rob Powers, Moises Goldszmidt, and Ira Cohen. 2005. Short term performance forecasting in enterprise systems. In Proceedings of the 11th ACM SIGKDD International Conference on Knowledge Discovery in Data Mining. ACM, 801--807. Google ScholarDigital Library
- Calton Pu, Akhil Sahai, Jason Parekh, Gueyoung Jung, Ji Bae, You-Kyung Cha, Timothy Garcia, Danesh Irani, Jae Lee, and Qifeng Lin. 2007. An observation-based approach to performance characterization of distributed n-tier applications. In IEEE 10th International Symposium on Workload Characterization. IISWC 2007. IEEE, 161--170. Google ScholarDigital Library
- Xing Pu, Ling Liu, Yiduo Mei, Sankaran Sivathanu, Younggyun Koh, and Calton Pu. 2010. Understanding performance interference of i/o workload in virtualized cloud environments. In Proceedings of the IEEE 3rd International Conference on Cloud Computing (CLOUD’10). IEEE, 51--58. Google ScholarDigital Library
- Sutharshan Rajasegarar, Christopher Leckie, and Marimuthu Palaniswami. 2008. Anomaly detection in wireless sensor networks. Wireless Communications, IEEE 15, 4, 34--40. Google ScholarDigital Library
- Christoph Rathfelder, Stefan Becker, Klaus Krogmann, and Ralf Reussner. 2012. Workload-aware system monitoring using performance predictions applied to a large-scale e-mail system. In Proceedings of the Joint Working IEEE/IFIP Conference on Software Architecture (WICSA) and European Conference on Software Architecture (ECSA’12). IEEE, 31--40. Google ScholarDigital Library
- Charles Reiss, John Wilkes, and Joseph L. Hellerstein. 2011. Google cluster-usage traces: Format+ schema. Google Inc., White Paper.Google Scholar
- Douglas Reynolds. 2009. Gaussian mixture models. Encyclopedia of Biometrics, 659--663.Google Scholar
- S. Rogers and M. Girolami. 2011. A First Course in Machine Learning. Taylor & Francis. Google ScholarDigital Library
- Raja R. Sambasivan, Alice X. Zheng, Michael De Rosa, Elie Krevat, Spencer Whitman, Michael Stroucken, William Wang, Lianghong Xu, and Gregory R. Ganger. 2011. Diagnosing performance changes by comparing request flows. In Proceedings of the 8th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 43--56. Google ScholarDigital Library
- Bianca Schroeder, Garth Gibson, and others. 2010. A Large-scale study of failures in high-performance-computing systems. IEEE Transactions on Dependable and Secure Computing 7, 4, 337--350. Google ScholarDigital Library
- Craig A. Shallahamer. 1995. Predicting Computing System Capacity and Throughput. Oracle Corporation White Paper. Retrieved from http://www.orapub.com.Google Scholar
- Claude Elwood Shannon. 2001. A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review 5, 1, 3--55. Google ScholarDigital Library
- Bikash Sharma, Praveen Jayachandran, Akshat Verma, and Chita R. Das. 2013. CloudPD: Problem determination and diagnosis in shared dynamic clouds. In Proceedings of the 43rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’13). IEEE, 1--12. Google ScholarDigital Library
- Sameer Shende. 1999. Profiling and tracing in Linux. In Proceedings of the Extreme Linux Workshop, Vol. 2. Citeseer.Google Scholar
- Derek Smith, Qiang Guan, and Song Fu. 2010. An anomaly detection framework for autonomic management of compute cloud systems. In Proceedings of the IEEE 34th Annual Computer Software and Applications Conference Workshops (COMPSACW’10). IEEE, 376--381. Google ScholarDigital Library
- Ralf Steuer, Jürgen Kurths, Carsten O. Daub, Janko Weise, and Joachim Selbig. 2002. The mutual information: Detecting and evaluating dependencies between variables. Bioinformatics 18, Suppl 2, S231--S240.Google ScholarCross Ref
- Yongmin Tan and Xiaohui Helen Adviser-Gu. 2012. Online Performance Anomaly Prediction and Prevention for Complex Distributed Systems. North Carolina State University.Google Scholar
- Yongmin Tan and Xiaohui Gu. 2010. On predictability of system anomalies in real world. In Proceedings of the IEEE International Symposium on Modeling, Analysis & Simulation of Computer and Telecommunication Systems (MASCOTS’’10). IEEE, 133--140. Google ScholarDigital Library
- Yongmin Tan, Xiaohui Gu, and Haixun Wang. 2010. Adaptive system anomaly prediction for large-scale hosting infrastructures. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing. ACM, 173--182. Google ScholarDigital Library
- Yongmin Tan, Hiep Nguyen, Zhiming Shen, Xiaohui Gu, Chitra Venkatramani, and Deepak Rajan. 2012. Prepare: Predictive performance anomaly prevention for virtualized cloud systems. In Proceedings of the IEEE 32nd International Conference on Distributed Computing Systems (ICDCS’12). IEEE, 285--294. Google ScholarDigital Library
- Jean-Claude Tarby, Houcine Ezzedine, José Rouillard, Chi Dung Tran, Philippe Laporte, and Christophe Kolski. 2007. Traces using aspect oriented programming and interactive agent-based architecture for early usability evaluation: Basic principles and comparison. In Human-Computer Interaction. Interaction Design and Usability. Springer, 632--641. Google ScholarDigital Library
- Igor Trubin. 2005. Capturing workload pathology by statistical exception detection system. In Proceedings of the Computer Measurement Group. Citeseer.Google Scholar
- Igor A. Trubin and Linwood Merritt. 2004. Mainframe global and workload level statistical exception detection system, based on MASF. In Proceedings of the International CMG Conference. 671--678.Google Scholar
- John Wilder. 1977. Exploratory data analysis. Addison-Wesley, Reading, Mass.Google Scholar
- Arno Wagner and Bernhard Plattner. 2005. Entropy based worm and anomaly detection in fast IP networks. In 14th IEEE International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprise. IEEE, 172--177. Google ScholarDigital Library
- Christian Walck. 2007. Handbook on statistical distributions for experimentalists. Internal Report SUF-PFY/96-01, University of Stockholm.Google Scholar
- Chengwei Wang, Karsten Schwan, and Matthew Wolf. 2009. Ebat: An entropy based online anomaly tester for data center management. In Proceedings of the IFIP/IEEE International Symposium on Integrated Network Management-Workshops. IEEE, 79--80.Google ScholarCross Ref
- Chengwei Wang, Vanish Talwar, Karsten Schwan, and Parthasarathy Ranganathan. 2010. Online detection of utility cloud anomalies using metric distributions. In Proceedings of the Network Operations and Management Symposium (NOMS’10). IEEE, 96--103.Google Scholar
- Chengwei Wang, Krishnamurthy Viswanathan, Lakshminarayan Choudur, Vanish Talwar, Wade Satterfield, and Karsten Schwan. 2011. Statistical techniques for online anomaly detection in data centers. In Proceedings of the IFIP/IEEE International Symposium on Integrated Network Management (IM’11). IEEE, 385--392.Google ScholarCross Ref
- Haichuan Wang, Qiming Teng, Xiao Zhong, and Peter F. Sweeney. 2009. Understanding cross-tier delay of multi-tier application using selective invocation context extraction. In Proceedings of the 10th ACM/IFIP/USENIX International Conference on Middleware. Springer-Verlag, New York, 34. Google ScholarDigital Library
- Qingyang Wang, Yasuhiko Kanemasa, Jack Li, Deepal Jayasinghe, Toshihiro Shimizu, Masazumi Matsubara, Motoyuki Kawaba, and Calton Pu. 2013a. Detecting transient bottlenecks in n-tier applications through fine-grained analysis. In Proceedings of the IEEE 33rd International Conference on Distributed Computing Systems (ICDCS’13). IEEE, 31--40. Google ScholarDigital Library
- Qingyang Wang, Yasuhiko Kanemasa, Jack Li, Deepal Jayasinghe, Toshihiro Shimizu, Masazumi Matsubara, Motoyuki Kawaba, and Calton Pu. 2013b. An experimental study of rapidly alternating bottlenecks in n-tier applications. In Proceedings of the IEEE 6th International Conference on Cloud Computing (CLOUD’13). IEEE, 171--178. Google ScholarDigital Library
- Tao Wang, Jun Wei, Feng Qin, WenBo Zhang, Hua Zhong, and Tao Huang. 2013. Detecting performance anomaly with correlation analysis for Internetware. Science China Information Sciences 56, 8, 1--15.Google Scholar
- Tao Wang, Jun Wei, Wenbo Zhang, Hua Zhong, and Tao Huang. 2014. Workload-aware anomaly detection for web applications. Journal of Systems and Software 89, 19--32. Google ScholarDigital Library
- Tao Wang, Wenbo Zhang, Jun Wei, and Hua Zhong. 2012. Workload-aware online anomaly detection in enterprise applications with local outlier factor. In Proceedings of the IEEE 36th Annual Computer Software and Applications Conference (COMPSAC’12). IEEE, 25--34. Google ScholarDigital Library
- Pengcheng Xiong, Calton Pu, Xiaoyun Zhu, and Rean Griffith. 2013. vPerfGuard: An automated model-driven framework for application performance diagnosis in consolidated cloud environments. In Proceedings of the ACM/SPEC International Conference on Performance Engineering. ACM, 271--282. Google ScholarDigital Library
- Yahoo! 2014. Webscope dataset—Computer System Data. Retrieved from http://webscope.sandbox.yahoo.com/catalog.php?datatype=.Google Scholar
- Lingyun Yang, Chuang Liu, Jennifer M. Schopf, and Ian Foster. 2007. Anomaly detection and diagnosis in grid environments. In Proceedings of the 2007 ACM/IEEE Conference on Supercomputing. IEEE, 1--9. Google ScholarDigital Library
- Li Yu and Zhiling Lan. 2013. A scalable, non-parametric anomaly detection framework for Hadoop. In Proceedings of the 2013 ACM Cloud and Autonomic Computing Conference. ACM, 22. Google ScholarDigital Library
- Minlan Yu, Albert Greenberg, Dave Maltz, Jennifer Rexford, Lihua Yuan, Srikanth Kandula, and Changhoon Kim. 2011. Profiling network performance for multi-tier data center applications. In Proceedings of Symposium on Networked System Design and Implementation. 57--70. Google ScholarDigital Library
- Qi Zhang, Lu Cheng, and Raouf Boutaba. 2010. Cloud computing: State-of-the-art and research challenges. Journal of Internet Services and Applications 1, 1, 7--18.Google ScholarCross Ref
- Qi Zhang, Ludmila Cherkasova, Guy Mathews, Wayne Greene, and Evgenia Smirni. 2007b. R-capriccio: A capacity planning and anomaly detection tool for enterprise services with live workloads. In Middleware 2007. Springer, 244--265. Google ScholarDigital Library
- Qi Zhang, Ludmila Cherkasova, and Evgenia Smirni. 2007a. A regression-based analytic model for dynamic resource provisioning of multi-tier applications. In Proceedings of the 4th International Conference on Autonomic Computing (ICAC’07). IEEE, 27--27. Google ScholarDigital Library
- Steve Zhang, Ira Cohen, Moises Goldszmidt, Julie Symons, and Armando Fox. 2005. Ensembles of models for automated diagnosis of system performance problems. In Proceedings of International Conference on Dependable Systems and Networks. IEEE, 644--653. Google ScholarDigital Library
Index Terms
- Performance Anomaly Detection and Bottleneck Identification
Recommendations
Robust and accurate performance anomaly detection and prediction for cloud applications: a novel ensemble learning-based framework
AbstractEffectively detecting run-time performance anomalies is crucial for clouds to identify abnormal performance behavior and forestall future incidents. To be used for real-world applications, an effective anomaly detection framework should meet three ...
Adaptive performance anomaly detection for online service systems via pattern sketching
ICSE '22: Proceedings of the 44th International Conference on Software EngineeringTo ensure the performance of online service systems, their status is closely monitored with various software and system metrics. Performance anomalies represent the performance degradation issues (e.g., slow response) of the service systems. When ...
Reference-driven performance anomaly identification
SIGMETRICS '09Complex system software allows a variety of execution conditions on system configurations and workload properties. This paper explores a principled use of reference executions--those of similar execution conditions from the target--to help identify the ...
Comments