ABSTRACT
Diagnosing production run failures is a challenging yet importanttask. Most previous work focuses on offsite diagnosis, i.e.development site diagnosis with the programmers present. This is insufficient for production-run failures as: (1) it is difficult to reproduce failures offsite for diagnosis; (2) offsite diagnosis cannot provide timely guidance for recovery or security purposes; (3)it is infeasible to provide a programmer to diagnose every production run failure; and (4) privacy concerns limit the release of information(e.g. coredumps) to programmers.
To address production-run failures, we propose a system, called Triage, that automatically performs onsite software failure diagnosis at the very moment of failure. It provides a detailed diagnosis report, including the failure nature, triggering conditions, related code and variables, the fault propagation chain, and potential fixes. Triage achieves this by leveraging lightweight reexecution support to efficiently capture the failure environment and repeatedly replay the moment of failure, and dynamically--using different diagnosis techniques--analyze an occurring failure. Triage employs afailure diagnosis protocol that mimics the steps a human takes in debugging. This extensible protocol provides a framework to enable the use of various existing and new diagnosis techniques. We also propose a new failure diagnosis technique, delta analysis, to identify failure related conditions, code, and variables.
We evaluate these ideas in real system experiments with 10 real software failures from 9 open source applications including four servers. Triage accurately diagnoses the evaluated failures, providing likely root causes and even the fault propagation chain, while keeping normal-run overhead to under 5%. Finally, our user study of the diagnosis and repair of real bugs shows that Triagesaves time (99.99% confidence), reducing the total time to fix by almost half.
Supplemental Material
Available for Download
Slides from the presentation
Supplemental material for Triage: diagnosing production run failures at the user's site
- H. Agrawal, R. A. DeMillo, and E. H. Spafford. An execution--backtracking approach to debugging. IEEE Software, 8(3):21--26, 1991. Google ScholarDigital Library
- M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, 2003. Google ScholarDigital Library
- E. D. Berger and B. G. Zorn. Diehard: Probabilistic memory safety for unsafe languages. In Proceedings of the ACM SIGPLAN 2006 Conference on Programming Language Design and Implementation, 2006. Google ScholarDigital Library
- D. L. Bruening. Efficient, transparent, and comprehensive runtime code manipulation. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2004. Supervisor-Saman Amarasinghe. Google ScholarDigital Library
- D. Brumley, J. Newsome, D. Song, H. Wang, and S. Jha. Towards automatic generation of vulnerability--based signatures. In Proceedings of the 2006 IEEE Symposium on Security and Privacy, 2006. Google ScholarDigital Library
- M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic systems. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, 2002. Google ScholarDigital Library
- G. Clarke. How to diagnose and solve software errors. PC World, 1999.Google Scholar
- M. Costa, J. Crowcroft, M. Castro, A. Rowstron, L. Zhou, L. Zhang, and P. Barham. Vigilante: End-to-end containment of internet worms. In Proceedings of the 20th ACM Symposium on Operating Systems Principles, 2005. Google ScholarDigital Library
- D. Geels, G. Altekar, S. Shenker, and I. Stoica. Replay debugging for distributed applications. In Proceedings of the 2006 USENIX Annual Technical Conference, 2006. Google ScholarDigital Library
- GNU. Gdb: The gnu project debugger.Google Scholar
- R. Hastings and B. Joyce. Purify: Fast detection of memory leaks and access errors. In Proceedings of the 1992 USENIX Winter Technical Conference, 1992.Google Scholar
- M. Hauswirth and T. M. Chilimbi. Low-overhead memory leak detection using adaptive statistical profiling. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems, 2004. Google ScholarDigital Library
- H.-A. Kim and B. Karp. Autograph: Toward automated, distributed worm signature detection. In Proceedings of the 13th USENIX Security Symposium, 2004. Google ScholarDigital Library
- S. T. King, G. W. Dunlap, and P. M. Chen. Debugging operating systems with time-traveling virtual machines. In Proceedings of the 2005 USENIX Annual Technical Conference, 2005. Google ScholarDigital Library
- C. Kreibich and J. Crowcroft. Honeycomb: Creating intrusion detection signatures using honeypots. SIGCOMM Computer Communication Review, 2004. Google ScholarDigital Library
- B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, 2003. Google ScholarDigital Library
- D. E. Lowell and P. M. Chen. Free transactions with Rio Vista. In Proceedings of the 16th ACM Symposium on Operating Systems Principles, 1997. Google ScholarDigital Library
- C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, 2005. Google ScholarDigital Library
- R. Manevich, M. Sridharan, S. Adams, M. Das, and Z. Yang. PSE: Explaining program failures via postmortem static analysis. SIGSOFT Software Engineering Notes, 29(6):63--72, 2004. Google ScholarDigital Library
- Microsoft Corporation. Dr. Watson overview.Google Scholar
- G. Misherghi and Z. Su. HDD: Hierarchical delta debugging. In Proceedings of the 28th International Conference on Software Engineering, 2006. Google ScholarDigital Library
- mozilla.org. Quality feedback agent.Google Scholar
- E. W. Myers. An O(ND) difference algorithm and its variations. Algorithmica, 1(2):251--266, 1986.Google ScholarCross Ref
- S. Narayanasamy, G. Pokam, and B. Calder. BugNet: Continuously recording program execution for deterministic replay debugging. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005. Google ScholarDigital Library
- G. C. Necula, S. McPeak, and W. Weimer. CCured: Type-safe retrofitting of legacy code. In Proceedings of the 29th Annual ACM SIGPLAN -- SIGACT Symposium on Principloes of Programming Languages, 2002. Google ScholarDigital Library
- N. Nethercote and J. Seward. Valgrind: A program supervision framework. Electronic Notes in Theoretical Computer Science, 2003.Google Scholar
- R. H. B. Netzer and B. P. Miller. Improving the accuracy of data race detection. In Proceedings of the 3rd ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, 1991. Google ScholarDigital Library
- J. Newsome and D. Song. Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. In Proceedings of the 12th Annual Network and Distributed System Security Symposium, 2005.Google Scholar
- S. Osman, D. Subhraveti, G. Su, and J. Nieh. The design and implementation of Zap: A system for migrating computing environments. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation, 2002. Google ScholarDigital Library
- R. Perdisci, D. Dagon, W. Lee, P. Fogla, and M. Sharif. Misleading worm signature generators using deliberate noise injection. In Proceedings of the 2006 IEEE Symposium on Security and Privacy, May 2006. Google ScholarDigital Library
- F. Qin, S. Lu, and Y. Zhou. Safemem: Exploiting ECC-Memory for detecting memory leaks and memory corruption during production runs. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture, 2005. Google ScholarDigital Library
- F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou. Rx: Treating bugs as allergies -- A safe method to survive software failures. In Proceedings of the 20th ACM Symposium on Operating Systems Principles, 2005. Google ScholarDigital Library
- B. Randell. Facing up to faults. The Computer Journal, 2000.Google Scholar
- M. Rinard, C. Cadar, D. Dumitran, D. M. Roy, T. Leu, and W. S. Beebee, Jr. Enhancing server availability and security through failure-oblivious computing. In Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation, 2004. Google ScholarDigital Library
- A. C. Rosander. Elementary Principles of Statistics. D. Van Nostrand Company, 1951.Google Scholar
- A. Sabelfeld and A. Myers. Language-based information-flow security. In IEEE Journal on Selected Areas in Communications, 2003. Google ScholarDigital Library
- S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: A dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems, 15(4), 1997. Google ScholarDigital Library
- S. Sidiroglou, M. E. Locasto, S. W. Boyd, and A. D. Keromytis. Building a reactive immune system for software services. In Proceedings of the 2005 USENIX Annual Technical Conference, Apr 2005. Google ScholarDigital Library
- S. Singh, C. Estan, G. Varghese, and S. Savage. Automated worm fingerprinting. In Proceedings of the 6th USENIX Symposium on Operating System Design and Implementation, 2004. Google ScholarDigital Library
- B. So, B. P. Miller, and L. Fredriksen. An empirical study of the reliability of unix utilites. http://www.cs.wisc.edu/~bart/fuzz/fuzz.html. Google ScholarDigital Library
- S. M. Srinivasan, S. Kandula, C. R. Andrews, and Y. Zhou. Flashback: A lightweight extension for rollback and deterministic replay for software debugging. In Proceedings of the 2004 USENIX Annual Technical Conference, 2004. Google ScholarDigital Library
- J. Tucek, J. Newsome, S. Lu, C. Huang, S. Xanthos, D. Brumley, Y. Zhou, and D. Song. Sweeper: A lightweight end-to-end system for defending against fast worms. In Proceedings of the 2007 EuroSys Conference, 2007. Google ScholarDigital Library
- C. Verbowski, E. Kiciman, A. Kumar, B. Daniels, S. Lu, J. Lee, Y.-M. WAng, and R. Roussev. Flight data recorder: Monitoring persistent-state interactions to improve systems management. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, 2006. Google ScholarDigital Library
- H. J. Wang, J. Platt, Y. Chen, R. Zhang, and Y.-M. Wang. Peerpressure for automatic troubleshooting. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), 2004. Google ScholarDigital Library
- M. Weiser.Programmers use slices when debugging. Communications of the ACM, 25(7):446--452, 1982. Google ScholarDigital Library
- M. Xu, R. Bodik, and M. D. Hill. A "Flight Data Recorder" for enabling full--system multiprocessor deterministic replay. In Proceedings of the 30th Annual International Symposium on Computer Architecture, 2003. Google ScholarDigital Library
- A. Zeller. Isolating cause-effect chains from computer programs. In Proceedings of the 10th ACM SIGSOFT Symposium on Foundations of Software Engineering, 2002. Google ScholarDigital Library
- X. Zhang, R. Gupta, and Y. Zhang. Precise dynamic slicing algorithms. In Proceedings of the 25th International Conference on Software Engineering, 2003. Google ScholarDigital Library
Index Terms
- Triage: diagnosing production run failures at the user's site
Recommendations
The inflection point hypothesis: a principled debugging approach for locating the root cause of a failure
SOSP '19: Proceedings of the 27th ACM Symposium on Operating Systems PrinciplesThe end goal of failure diagnosis is to locate the root cause. Prior root cause localization approaches almost all rely on statistical analysis. This paper proposes taking a different approach based on the observation that if we model an execution as a ...
Triage: diagnosing production run failures at the user's site
SOSP '07Diagnosing production run failures is a challenging yet importanttask. Most previous work focuses on offsite diagnosis, i.e.development site diagnosis with the programmers present. This is insufficient for production-run failures as: (1) it is difficult ...
Concordance of Diagnosis Based on Zangfu-Organs Syndrome Differentiation by Clinicians of Traditional Chinese Medicine
IJCBS '09: Proceedings of the 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent ComputingObjective: We aimed to assess concordance of Zangfu-organs Syndrome Differentiation by evaluating the concordance of successive diagnosis by the same clinician of traditional Chinese medicine (TCM) and that of diagnosis by different clinicians of TCM. ...
Comments