skip to main content
10.1145/1294261.1294275acmconferencesArticle/Chapter ViewAbstractPublication PagessospConference Proceedingsconference-collections
Article

Triage: diagnosing production run failures at the user's site

Authors Info & Claims
Published:14 October 2007Publication History

ABSTRACT

Diagnosing production run failures is a challenging yet importanttask. Most previous work focuses on offsite diagnosis, i.e.development site diagnosis with the programmers present. This is insufficient for production-run failures as: (1) it is difficult to reproduce failures offsite for diagnosis; (2) offsite diagnosis cannot provide timely guidance for recovery or security purposes; (3)it is infeasible to provide a programmer to diagnose every production run failure; and (4) privacy concerns limit the release of information(e.g. coredumps) to programmers.

To address production-run failures, we propose a system, called Triage, that automatically performs onsite software failure diagnosis at the very moment of failure. It provides a detailed diagnosis report, including the failure nature, triggering conditions, related code and variables, the fault propagation chain, and potential fixes. Triage achieves this by leveraging lightweight reexecution support to efficiently capture the failure environment and repeatedly replay the moment of failure, and dynamically--using different diagnosis techniques--analyze an occurring failure. Triage employs afailure diagnosis protocol that mimics the steps a human takes in debugging. This extensible protocol provides a framework to enable the use of various existing and new diagnosis techniques. We also propose a new failure diagnosis technique, delta analysis, to identify failure related conditions, code, and variables.

We evaluate these ideas in real system experiments with 10 real software failures from 9 open source applications including four servers. Triage accurately diagnoses the evaluated failures, providing likely root causes and even the fault propagation chain, while keeping normal-run overhead to under 5%. Finally, our user study of the diagnosis and repair of real bugs shows that Triagesaves time (99.99% confidence), reducing the total time to fix by almost half.

Skip Supplemental Material Section

Supplemental Material

1294275.mp4

mp4

184.2 MB

References

  1. H. Agrawal, R. A. DeMillo, and E. H. Spafford. An execution--backtracking approach to debugging. IEEE Software, 8(3):21--26, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. K. Aguilera, J. C. Mogul, J. L. Wiener, P. Reynolds, and A. Muthitacharoen. Performance debugging for distributed systems of black boxes. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. E. D. Berger and B. G. Zorn. Diehard: Probabilistic memory safety for unsafe languages. In Proceedings of the ACM SIGPLAN 2006 Conference on Programming Language Design and Implementation, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. D. L. Bruening. Efficient, transparent, and comprehensive runtime code manipulation. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, 2004. Supervisor-Saman Amarasinghe. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Brumley, J. Newsome, D. Song, H. Wang, and S. Jha. Towards automatic generation of vulnerability--based signatures. In Proceedings of the 2006 IEEE Symposium on Security and Privacy, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. M. Chen, E. Kiciman, E. Fratkin, A. Fox, and E. Brewer. Pinpoint: Problem determination in large, dynamic systems. In Proceedings of the 2002 International Conference on Dependable Systems and Networks, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. G. Clarke. How to diagnose and solve software errors. PC World, 1999.Google ScholarGoogle Scholar
  8. M. Costa, J. Crowcroft, M. Castro, A. Rowstron, L. Zhou, L. Zhang, and P. Barham. Vigilante: End-to-end containment of internet worms. In Proceedings of the 20th ACM Symposium on Operating Systems Principles, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. D. Geels, G. Altekar, S. Shenker, and I. Stoica. Replay debugging for distributed applications. In Proceedings of the 2006 USENIX Annual Technical Conference, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. GNU. Gdb: The gnu project debugger.Google ScholarGoogle Scholar
  11. R. Hastings and B. Joyce. Purify: Fast detection of memory leaks and access errors. In Proceedings of the 1992 USENIX Winter Technical Conference, 1992.Google ScholarGoogle Scholar
  12. M. Hauswirth and T. M. Chilimbi. Low-overhead memory leak detection using adaptive statistical profiling. In Proceedings of the 11th International Conference on Architectural Support for Programming Languages and Operating Systems, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. H.-A. Kim and B. Karp. Autograph: Toward automated, distributed worm signature detection. In Proceedings of the 13th USENIX Security Symposium, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. T. King, G. W. Dunlap, and P. M. Chen. Debugging operating systems with time-traveling virtual machines. In Proceedings of the 2005 USENIX Annual Technical Conference, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. Kreibich and J. Crowcroft. Honeycomb: Creating intrusion detection signatures using honeypots. SIGCOMM Computer Communication Review, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. B. Liblit, A. Aiken, A. X. Zheng, and M. I. Jordan. Bug isolation via remote program sampling. In Proceedings of the ACM SIGPLAN 2003 Conference on Programming Language Design and Implementation, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. D. E. Lowell and P. M. Chen. Free transactions with Rio Vista. In Proceedings of the 16th ACM Symposium on Operating Systems Principles, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: Building customized program analysis tools with dynamic instrumentation. In Proceedings of the ACM SIGPLAN 2005 Conference on Programming Language Design and Implementation, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. R. Manevich, M. Sridharan, S. Adams, M. Das, and Z. Yang. PSE: Explaining program failures via postmortem static analysis. SIGSOFT Software Engineering Notes, 29(6):63--72, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Microsoft Corporation. Dr. Watson overview.Google ScholarGoogle Scholar
  21. G. Misherghi and Z. Su. HDD: Hierarchical delta debugging. In Proceedings of the 28th International Conference on Software Engineering, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. mozilla.org. Quality feedback agent.Google ScholarGoogle Scholar
  23. E. W. Myers. An O(ND) difference algorithm and its variations. Algorithmica, 1(2):251--266, 1986.Google ScholarGoogle ScholarCross RefCross Ref
  24. S. Narayanasamy, G. Pokam, and B. Calder. BugNet: Continuously recording program execution for deterministic replay debugging. In Proceedings of the 32nd Annual International Symposium on Computer Architecture, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. G. C. Necula, S. McPeak, and W. Weimer. CCured: Type-safe retrofitting of legacy code. In Proceedings of the 29th Annual ACM SIGPLAN -- SIGACT Symposium on Principloes of Programming Languages, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. N. Nethercote and J. Seward. Valgrind: A program supervision framework. Electronic Notes in Theoretical Computer Science, 2003.Google ScholarGoogle Scholar
  27. R. H. B. Netzer and B. P. Miller. Improving the accuracy of data race detection. In Proceedings of the 3rd ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Newsome and D. Song. Dynamic taint analysis for automatic detection, analysis, and signature generation of exploits on commodity software. In Proceedings of the 12th Annual Network and Distributed System Security Symposium, 2005.Google ScholarGoogle Scholar
  29. S. Osman, D. Subhraveti, G. Su, and J. Nieh. The design and implementation of Zap: A system for migrating computing environments. In Proceedings of the 5th Symposium on Operating Systems Design and Implementation, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. R. Perdisci, D. Dagon, W. Lee, P. Fogla, and M. Sharif. Misleading worm signature generators using deliberate noise injection. In Proceedings of the 2006 IEEE Symposium on Security and Privacy, May 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. F. Qin, S. Lu, and Y. Zhou. Safemem: Exploiting ECC-Memory for detecting memory leaks and memory corruption during production runs. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. F. Qin, J. Tucek, J. Sundaresan, and Y. Zhou. Rx: Treating bugs as allergies -- A safe method to survive software failures. In Proceedings of the 20th ACM Symposium on Operating Systems Principles, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. B. Randell. Facing up to faults. The Computer Journal, 2000.Google ScholarGoogle Scholar
  34. M. Rinard, C. Cadar, D. Dumitran, D. M. Roy, T. Leu, and W. S. Beebee, Jr. Enhancing server availability and security through failure-oblivious computing. In Proceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. A. C. Rosander. Elementary Principles of Statistics. D. Van Nostrand Company, 1951.Google ScholarGoogle Scholar
  36. A. Sabelfeld and A. Myers. Language-based information-flow security. In IEEE Journal on Selected Areas in Communications, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. S. Savage, M. Burrows, G. Nelson, P. Sobalvarro, and T. Anderson. Eraser: A dynamic data race detector for multithreaded programs. ACM Transactions on Computer Systems, 15(4), 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. S. Sidiroglou, M. E. Locasto, S. W. Boyd, and A. D. Keromytis. Building a reactive immune system for software services. In Proceedings of the 2005 USENIX Annual Technical Conference, Apr 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. S. Singh, C. Estan, G. Varghese, and S. Savage. Automated worm fingerprinting. In Proceedings of the 6th USENIX Symposium on Operating System Design and Implementation, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. B. So, B. P. Miller, and L. Fredriksen. An empirical study of the reliability of unix utilites. http://www.cs.wisc.edu/~bart/fuzz/fuzz.html. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. S. M. Srinivasan, S. Kandula, C. R. Andrews, and Y. Zhou. Flashback: A lightweight extension for rollback and deterministic replay for software debugging. In Proceedings of the 2004 USENIX Annual Technical Conference, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. J. Tucek, J. Newsome, S. Lu, C. Huang, S. Xanthos, D. Brumley, Y. Zhou, and D. Song. Sweeper: A lightweight end-to-end system for defending against fast worms. In Proceedings of the 2007 EuroSys Conference, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. C. Verbowski, E. Kiciman, A. Kumar, B. Daniels, S. Lu, J. Lee, Y.-M. WAng, and R. Roussev. Flight data recorder: Monitoring persistent-state interactions to improve systems management. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. H. J. Wang, J. Platt, Y. Chen, R. Zhang, and Y.-M. Wang. Peerpressure for automatic troubleshooting. In Proceedings of the International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS), 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. M. Weiser.Programmers use slices when debugging. Communications of the ACM, 25(7):446--452, 1982. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. M. Xu, R. Bodik, and M. D. Hill. A "Flight Data Recorder" for enabling full--system multiprocessor deterministic replay. In Proceedings of the 30th Annual International Symposium on Computer Architecture, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. A. Zeller. Isolating cause-effect chains from computer programs. In Proceedings of the 10th ACM SIGSOFT Symposium on Foundations of Software Engineering, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. X. Zhang, R. Gupta, and Y. Zhang. Precise dynamic slicing algorithms. In Proceedings of the 25th International Conference on Software Engineering, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Triage: diagnosing production run failures at the user's site

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SOSP '07: Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles
        October 2007
        378 pages
        ISBN:9781595935915
        DOI:10.1145/1294261
        • cover image ACM SIGOPS Operating Systems Review
          ACM SIGOPS Operating Systems Review  Volume 41, Issue 6
          SOSP '07
          December 2007
          363 pages
          ISSN:0163-5980
          DOI:10.1145/1323293
          Issue’s Table of Contents

        Copyright © 2007 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 14 October 2007

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        Overall Acceptance Rate131of716submissions,18%

        Upcoming Conference

        SOSP '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader