ABSTRACT
Anomaly detection is a critical step towards building a secure and trustworthy system. The primary purpose of a system log is to record system states and significant events at various critical points to help debug system failures and perform root cause analysis. Such log data is universally available in nearly all computer systems. Log data is an important and valuable resource for understanding system status and performance issues; therefore, the various system logs are naturally excellent source of information for online monitoring and anomaly detection. We propose DeepLog, a deep neural network model utilizing Long Short-Term Memory (LSTM), to model a system log as a natural language sequence. This allows DeepLog to automatically learn log patterns from normal execution, and detect anomalies when log patterns deviate from the model trained from log data under normal execution. In addition, we demonstrate how to incrementally update the DeepLog model in an online fashion so that it can adapt to new log patterns over time. Furthermore, DeepLog constructs workflows from the underlying system log so that once an anomaly is detected, users can diagnose the detected anomaly and perform root cause analysis effectively. Extensive experimental evaluations over large log data have shown that DeepLog has outperformed other existing log-based anomaly detection methods based on traditional data mining methodologies.
Supplemental Material
- VAST Challenge 2011. 2011. MC2 - Computer Networking Operations. (2011). http://hcil2.cs.umd.edu/newvarepository/VAST%20Challenge%202011/challenges/MC2%20-%20Computer%20Networking%20Operations/ [Online; accessed 08-May-2017].Google Scholar
- Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, et almbox. 2016 TensorFlow: A system for large-scale machine learning Proc. USENIX Symposium on Operating Systems Design and Implementation (OSDI). 264--285.Google Scholar
- Yoshua Bengio, Réjean Ducharme, Pascal Vincent, and Christian Jauvin 2003. A neural probabilistic language model. Journal of machine learning research Vol. 3, Feb (2003), 1137--1155.Google ScholarDigital Library
- Ivan Beschastnikh, Yuriy Brun, Michael D Ernst, and Arvind Krishnamurthy 2014. Inferring models of concurrent systems from logs of their behavior with CSight Proc. International Conference on Software Engineering (ICSE ). 468--479.Google Scholar
- Andrea Bittau, Adam Belay, Ali Mashtizadeh, David Mazières, and Dan Boneh. 2014. Hacking blind Security and Privacy (SP), 2014 IEEE Symposium on. IEEE, 227--242.Google Scholar
- François Chollet. 2015. keras. https://github.com/fchollet/keras. (2015). [Online; accessed 08-May-2017].Google Scholar
- Marcello Cinque, Domenico Cotroneo, and Antonio Pecchia. 2013. Event logs for the analysis of software failures: A rule-based approach. IEEE Transactions on Software Engineering (TSE) (2013), 806--821. Google ScholarDigital Library
- Andrew M Dai and Quoc V Le 2015. Semi-supervised sequence learning. In Proc. Neural Information Processing Systems Conference (NIPS). 3079--3087.Google Scholar
- Min Du and Feifei Li. 2016. Spell: Streaming Parsing of System Event Logs. In Proc. IEEE International Conference on Data Mining (ICDM). 859--864. Google ScholarCross Ref
- Min Du and Feifei Li. 2017. ATOM: Efficient Tracking, Monitoring, and Orchestration of Cloud Resources. IEEE Transactions on Parallel and Distributed Systems (2017).Google Scholar
- Qiang Fu, Jian-Guang Lou, Yi Wang, and Jiang Li. 2009. Execution anomaly detection in distributed systems through unstructured log analysis Proc. IEEE International Conference on Data Mining (ICDM). 149--158.Google Scholar
- Yoav Goldberg. 2016. A primer on neural network models for natural language processing. Journal of Artificial Intelligence Research Vol. 57 (2016), 345--420.Google ScholarDigital Library
- Ian Goodfellow, Yoshua Bengio, and Aaron Courville. 2016. Deep Learning. MIT Press. http://www.deeplearningbook.org.Google ScholarDigital Library
- Hossein Hamooni, Biplob Debnath, Jianwu Xu, Hui Zhang, Guofei Jiang, and Abdullah Mueen. 2016. LogMine: Fast Pattern Recognition for Log Analytics Proc. Conference on Information and Knowledge Management (CIKM). 1573--1582. Google ScholarDigital Library
- Stephen E Hansen and E Todd Atkins 1993. Automated System Monitoring and Notification with Swatch. Proc. Large Installation System Administration Conference (LISA). 145--152.Google Scholar
- Pinjia He, Jieming Zhu, Shilin He, Jian Li, and Michael R Lyu 2016. An evaluation study on log parsing and its use in log mining Proc. International Conference on Dependable Systems and Networks (DSN). 654--661.Google Scholar
- Shilin He, Jieming Zhu, Pinjia He, and Michael R Lyu. 2016. Experience Report: System Log Analysis for Anomaly Detection Proc. International Symposium on Software Reliability Engineering (ISSRE). 207--218. Google ScholarCross Ref
- Sepp Hochreiter and Jürgen Schmidhuber 1997. Long short-term memory. Neural computation (1997), 1735--1780. Google ScholarDigital Library
- Qingwei Lin, Hongyu Zhang, Jian-Guang Lou, Yu Zhang, and Xuewei Chen 2016. Log clustering based problem identification for online service systems Proc. International Conference on Software Engineering (ICSE ). 102--111.Google Scholar
- Chaochun Liu, Huan Sun, Nan Du, Shulong Tan, Hongliang Fei, Wei Fan, Tao Yang, Hao Wu, Yaliang Li, and Chenwei Zhang. 2016. Augmented LS™ Framework to Construct Medical Self-diagnosis Android Proc. IEEE International Conference on Data Mining (ICDM). 251--260.Google Scholar
- Jian-Guang Lou, Qiang Fu, Shengqi Yang, Jiang Li, and Bin Wu 2010. Mining program workflow from interleaved traces. Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD). Google ScholarDigital Library
- Jian-Guang Lou, Qiang Fu, Shengqi Yang, Ye Xu, and Jiang Li 2010. Mining Invariants from Console Logs for System Problem Detection. Proc. USENIX Annual Technical Conference (ATC). 231--244.Google Scholar
- Adetokunbo AO Makanju, A Nur Zincir-Heywood, and Evangelos E Milios 2009. Clustering event logs using iterative partitioning Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD). 1255--1264.Google Scholar
- Christopher D Manning and Hinrich Schütze 1999. Foundations of statistical natural language processing. MIT Press.Google ScholarDigital Library
- Tomas Mikolov, Martin Karafiát, Lukas Burget, Jan Cernockỳ, and Sanjeev Khudanpur. 2010. Recurrent neural network based language model.. In Interspeech, Vol. Vol. 2. 3.Google ScholarCross Ref
- Karthik Nagaraj, Charles Killian, and Jennifer Neville. 2012. Structured comparative analysis of systems logs to diagnose performance problems Proc. USENIX Symposium on Networked Systems Design and Implementation (NSDI). 26--26.Google Scholar
- Christopher Olah. 2015. Understanding LS™ Networks. (2015). http://colah.github.io/posts/2015-08-Understanding-LSTMsshownote[Online; accessed 16-May-2017].Google Scholar
- Alina Oprea, Zhou Li, Ting-Fang Yen, Sang H Chin, and Sumayah Alrwais 2015. Detection of early-stage enterprise infection by mining large-scale log data Proc. International Conference on Dependable Systems and Networks (DSN). 45--56.Google Scholar
- James E Prewett. 2003. Analyzing cluster log files using Logsurfer. In Proc. Annual Conference on Linux Clusters.Google Scholar
- Robert Ricci, Eric Eide, and The CloudLab Team. 2014. Introducing CloudLab: Scientific Infrastructure for Advancing Cloud Architectures and Applications. USENIX ;login:, Vol. 39, 6 (Dec. 2014). https://www.usenix.org/publications/login/dec14/ricciGoogle Scholar
- John P Rouillard. 2004. Real-time Log File Analysis Using the Simple Event Correlator (SEC). Proc. Large Installation System Administration Conference (LISA). 133--150.Google Scholar
- Sudip Roy, Arnd Christian König, Igor Dvorkin, and Manish Kumar 2015. Perfaugur: Robust diagnostics for performance anomalies in cloud services Proc. IEEE International Conference on Data Engineering (ICDE). IEEE, 1167--1178. Google ScholarCross Ref
- Elastic Stack. 2017. The Open Source Elastic Stack. (2017). https://www.elastic.co/products[Online; accessed 16-May-2017].Google Scholar
- Martin Sundermeyer, Ralf Schlüter, and Hermann Ney. 2012. LSTM Neural Networks for Language Modeling.. In Interspeech. 194--197.Google Scholar
- Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks Proc. Neural Information Processing Systems Conference (NIPS). 3104--3112.Google Scholar
- Liang Tang and Tao Li. 2010. LogTree: A framework for generating system events from raw textual logs Proc. IEEE International Conference on Data Mining (ICDM). 491--500. Google ScholarDigital Library
- Liang Tang, Tao Li, and Chang-Shing Perng 2011. LogSig: Generating system events from raw textual logs Proc. Conference on Information and Knowledge Management (CIKM). 785--794. Google ScholarDigital Library
- Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael Jordan 2009. Online system problem detection by mining patterns of console logs Proc. IEEE International Conference on Data Mining (ICDM). 588--597.Google Scholar
- Wei Xu, Ling Huang, Armando Fox, David Patterson, and Michael I Jordan 2009. Detecting large-scale system problems by mining console logs Proc. ACM Symposium on Operating Systems Principles (SOSP). 117--132.Google ScholarDigital Library
- Kenji Yamanishi and Yuko Maruyama 2015. Dynamic syslog mining for network failure monitoring Proc. ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD). 499--508.Google Scholar
- Ting-Fang Yen, Alina Oprea, Kaan Onarlioglu, Todd Leetham, William Robertson, Ari Juels, and Engin Kirda 2013. Beehive: Large-scale log analysis for detecting suspicious activity in enterprise networks Proc. International Conference on Dependable Systems and Networks (ACSAC). 199--208. Google ScholarDigital Library
- Xiao Yu, Pallavi Joshi, Jianwu Xu, Guoliang Jin, Hui Zhang, and Guofei Jiang. 2016. CloudSeer: Workflow Monitoring of Cloud Infrastructures via Interleaved Logs Proc. ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS). 489--502. Google ScholarDigital Library
- Ding Yuan, Haohui Mai, Weiwei Xiong, Lin Tan, Yuanyuan Zhou, and Shankar Pasupathy. 2010. SherLog: error diagnosis by connecting clues from run-time logs ACM SIGARCH computer architecture news. ACM, 143--154. Google ScholarDigital Library
- Ke Zhang, Jianwu Xu, Martin Renqiang Min, Guofei Jiang, Konstantinos Pelechrinis, and Hui Zhang 2016. Automated IT system failure prediction: A deep learning approach Proc. IEEE International Conference on Big Data (IEEE BigData). 1291--1300.Google ScholarCross Ref
- Xu Zhao, Kirk Rodrigues, Yu Luo, Ding Yuan, and Michael Stumm 2016. Non-intrusive performance profiling for entire software stacks based on the flow reconstruction principle. In Proc. USENIX Symposium on Operating Systems Design and Implementation (OSDI). 603--618.Google Scholar
Index Terms
- DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning
Recommendations
LAnoBERT: System log anomaly detection based on BERT masked language model
AbstractThe system log generated in a computer system refers to large-scale data that are collected simultaneously and used as the basic data for determining errors, intrusion and abnormal behaviors. The aim of system log anomaly detection is ...
Highlights- We propose LAnoBERT, a new log parser-free and unsupervised framework
- We ...
Robust log-based anomaly detection on unstable log data
ESEC/FSE 2019: Proceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software EngineeringLogs are widely used by large and complex software-intensive systems for troubleshooting. There have been a lot of studies on log-based anomaly detection. To detect the anomalies, the existing methods mainly construct a detection model using log event ...
Utilizing persistence for post facto suppression of invalid anomalies using system logs
ICSE-NIER '22: Proceedings of the ACM/IEEE 44th International Conference on Software Engineering: New Ideas and Emerging ResultsThe robustness and availability of cloud services are becoming increasingly important as more applications migrate to the cloud. The operations landscape today is more complex, than ever. Site reliability engineers (SREs) are expected to handle more ...
Comments