article

Rx: Treating bugs as allergies—a safe method to survive software failures

Authors:
Feng Qin

The Ohio State University, Columbus, OH

The Ohio State University, Columbus, OH
View Profile

,
Joseph Tucek

University of Illinois at Urbana-Champaign, Urbana, IL

University of Illinois at Urbana-Champaign, Urbana, IL
View Profile

,
Yuanyuan Zhou

University of Illinois at Urbana-Champaign, Urbana, IL

University of Illinois at Urbana-Champaign, Urbana, IL
View Profile

,
Jagadeesan Sundaresan

Citadel Investment Group, Chicago, IL

Citadel Investment Group, Chicago, IL
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 25 Issue 3pp 7–eshttps://doi.org/10.1145/1275517.1275519

Published:01 August 2007Publication History

ACM Transactions on Computer Systems

Abstract

Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limitations: required application restructuring, inability to address deterministic software bugs, unsafe speculation on program execution, and long recovery time.

This paper proposes an innovative safe technique, called Rx, which can quickly recover programs from many types of software bugs, both deterministic and nondeterministic. Our idea, inspired from allergy treatment in real life, is to rollback the program to a recent checkpoint upon a software failure, and then to reexecute the program in a modified environment. We base this idea on the observation that many bugs are correlated with the execution environment, and therefore can be avoided by removing the “allergen” from the environment. Rx requires few to no modifications to applications and provides programmers with additional feedback for bug diagnosis.

We have implemented Rx on Linux. Our experiments with five server applications that contain seven bugs of various types show that Rx can survive six out of seven software failures and provide transparent fast recovery within 0.017--0.16 seconds, 21--53 times faster than the whole program restart approach for all but one case (CVS). In contrast, the two tested alternatives, a whole program restart approach and a simple rollback and reexecution without environmental changes, cannot successfully recover the four servers (Squid, Apache, CVS, and ypserv) that contain deterministic bugs, and have only a 40% recovery rate for the server (MySQL) that contains a nondeterministic concurrency bug. Additionally, Rx's checkpointing system is lightweight, imposing small time and space overheads.

References

Abrams, M., Standridge, C. R., Abdulla, G., Williams, S., and Fox, E. A. 1995. Caching proxies: Limitations and potentials. In Proceedings of the 4th International WWW Conference.Google Scholar
Alvisi, L. and Marzullo, K. 1996. Trade-offs in implementing optimal message logging protocols. In Proceedings of the 15th ACM Symposium on the Principles of Distributed Computing. Google ScholarDigital Library
Ammann, P. E. and Knight, J. C. 1988. Data diversity: An approach to software fault tolerance. IEEE Trans. Comput. 37, 4, 418--425. Google ScholarDigital Library
Amza, C., Cox, A., and Zwaenepoel, W. 2000. Data replication strategies for fault tolerance and availability on commodity clusters. In Proceedings of the International Conference on Dependable Systems and Networks. Google ScholarDigital Library
Avizienis, A. 1985. The N-version approach to fault-tolerant software. IEEE Trans. Softw. Engin. 11, 12. Google ScholarDigital Library
Avizienis, A. and Chen, L. 1977. On the implementation of N-version programming for software fault tolerance during execution. In Proceedings of the 1st International Computer Software and Applications Conference.Google Scholar
Bartlett, J. F. 1981. A NonStop kernel. In Proceedings of the 8th Symposium on Operating Systems Principles. Google ScholarDigital Library
Berger, E. D. and Zorn, B. G. 2006. Diehard: Probabilistic memory safety for unsafe languages. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
Birman, K. P. 1996. Building Secure and Reliable Network Applications. Manning ISBN: 1-884777-29-5, Chapter 19. Google ScholarDigital Library
Bobbio, A. and Sereno, M. 1998. Fine grained software rejuvenation models. In Proceedings of the International Computer Performance and Dependability Symposium.Google Scholar
Bohra, A., Neamtiu, I., Gallard, P., Sultan, F., and Iftode, L. 2004. Remote repair of operating system state using backdoors. In Proceedings of the International Conference on Autonomic Computing. Google ScholarDigital Library
Borg, A., Baumbach, J., and Glazer, S. 1983. A message system supporting fault tolerance. In Proceedings of the 9th Symposium on Operating Systems Principles. Google ScholarDigital Library
Borg, A., Blau, W., Graetsch, W., Herrmann, F., and Oberle, W. 1989. Fault tolerance under UNIX. ACM Trans. Comput. Syst. 7, 1. Google ScholarDigital Library
Bressoud, T. C. and Schneider, F. B. 1996. Hypervisor-based fault tolerance. ACM Trans. Comput. Syst. 14, 1 (Feb), 80--107. Google ScholarDigital Library
Bush, W., Pincus, J., and Sielaff, D. 2000. A static analyzer for finding dynamic programming errors. Softw. Pract. Experi. 30, 7, 775--802. Google ScholarDigital Library
Candea, G., Cutler, J., Fox, A., Doshi, R., Garg, P., and Gowda, R. 2002. Reducing recovery time in a small recursively restartable system. In Proceedings of the International Conference on Dependable Systems and Networks. Google ScholarDigital Library
Candea, G., Kawamoto, S., Fujiki, Y., Friedman, G., and Fox, A. 2004. Microreboot---A technique for cheap recovery. In Proceedings of the 6th Symposium on Operating System Design and Implementation. Google ScholarDigital Library
Castro, M. and Liskov, B. 1999. Practical byzantine fault tolerance. In Proceedings of the 3rd Symposium on Operating System Design and Implementation. Google ScholarDigital Library
Castro, M. and Liskov, B. 2000. Proactive recovery in a Byzantine-Fault-Tolerant system. In Proceedings of the 4th Symposium on Operating System Design and Implementation. Google ScholarDigital Library
CERT/CC. Advisories. http://www.cert.org/advisories/.Google Scholar
Chandra, S. and Chen, P. M. 2000. Whither generic recovery from application faults? A fault study using open-source software. In Proceedings of the International Conference on Dependable Systems and Networks. Google ScholarDigital Library
Chandra, S. and Chen, P. M. 2002. The impact of recovery mechanisms on the likelihood of saving corrupted state. In Proceedings of the 13th International Symposium on Software Reliability Engineering. Google ScholarDigital Library
Chen, Y., Plank, J. S., and Li, K. 1997. CLIP: A checkpointing tool for message-passing parallel programs. In Proceedings of the ACM/IEEE Supercomputing Conference. Google ScholarDigital Library
Condit, J., Harren, M., McPeak, S., Necula, G. C., and Weimer, W. 2003. CCured in the real world. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
Cowan, C., Pu, C., Maier, D., Walpole, J., Bakke, P., Beattie, S., Grier, A., Wagle, P., Zhang, Q., and Hinton, H. 1998. StackGuard: Automatic adaptive detection and prevention of buffer-overflow attacks. In Proceedings of the 7th USENIX Security Symposium. Google ScholarDigital Library
Dunlap, G. W., King, S. T., Cinar, S., Basrai, M. A., and Chen, P. M. 2002. Revirt: Enabling intrusion analysis through virtual-machine logging and replay. In Proceedings of the 5th Symposium on Operating System Design and Implementation. Google ScholarDigital Library
Elnozahy, E. N. M., Alvisi, L., Wang, Y.-M., and Johnson, D. B. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv. 34, 3, 375--408. Google ScholarDigital Library
Feldman, Y. A. and Schneider, H. 1993. Simulating reactive systems by deduction. ACM Trans. Softw. Engin. Method. 2, 2, 128--175. Google ScholarDigital Library
Garg, S., Puliafito, A., Telek, M., and Trivedi, K. S. 1997. On the analysis of software rejuvenation policies. In Proceedings of the Annual Conference on Computer Assurance.Google Scholar
Gray, J. 1986. Why do computers stop and what can be done about it? In Proceedings of the 5th Symposium on Reliable Distributed Systems.Google Scholar
Gu, W., Kalbarczyk, Z., Iyer, R. K., and Yang, Z.-Y. 2003. Characterization of Linux kernel behavior under errors. In Proceedings of the International Conference on Dependable Systems and Networks.Google Scholar
Hallem, S., Chelf, B., Xie, Y., and Engler, D. 2002. A system and language for building system-specific, static analyses. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. 69--82. Google ScholarDigital Library
Hasting, R. and Joyce, B. 1992. Purify: Fast detection of memory leaks and access errors. In Proceedings of the USENIX Winter Technical Conference.Google Scholar
Heine, D. and Lam, M. 2003. A practical flow-sensitive and context-sensitive c and c++ memory leak detector. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
Huang, Y., Kintala, C., Kolettis, N., and Fulton, N. D. 1995. Software rejuvenation: Analysis, module and applications. In Proceedings of the 25th Annual International Symposium on Fault-Tolerant Computing. Google ScholarDigital Library
Johnson, D. and Zwaenepoel, W. 1988. Recovery in distributed systems using optimistic message logging and checkpointing. In Proceedings of the 7th Annual ACM Symposium on Principles of Distributed Computing. Google ScholarDigital Library
Johnson, D. B. and Zwaenepoel, W. 1990. Recovery in distributed systems using optimistic message logging and check-pointing. J. Algor. 11, 3, 462--491. Google ScholarDigital Library
Li, K., Naughton, J., and Plank, J. 1990. Concurrent real-time checkpoint for parallel programs. In Proceedings of the 2nd ACM SIGPLAN Symposium on Princiles & Practice of Parallel Programming. Google ScholarDigital Library
Li, Z., Tan, L., Wang, X., Lu, S., Zhou, Y., and Zhai, C. 2006. Have things changed now?---an empirical study of bug characteristics in modern open source software. In Procedings of the 1st Workshop on Architecture and System Support for Improving Software Dependability. Google ScholarDigital Library
Lowell, D. E., Chandra, S., and Chen, P. M. 2000. Exploring failure transparency and the limits of generic recovery. In Proceedings of the 4th Symposium on Operating System Design and Implementation. Google ScholarDigital Library
Lowell, D. E. and Chen, P. M. 1997. Free transactions with rio vista. In Proceedings of the 16th Symposium on Operating Systems Principles. Google ScholarDigital Library
Lowell, D. E. and Chen, P. M. 1998. Discount checking: Transparent, low-overhead recovery for general applications. Tech. rep., CSE-TR-410-99, University of Michigan.Google Scholar
Luotonen, A. and Altis, K. 1994. World-wide web proxies. In Proceedings of the 1st International WWW Conference. Google ScholarDigital Library
Mosberger, D. and Jin, T. 1998. httperf---a tool for measuring web server performance. SIGMETRICS Perform. Eval. Rev. 26, 3, 31--37. Google ScholarDigital Library
Patterson, D., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., Enriquez, P., Fox, A., Kiciman, E., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff, W., Traupman, J., and Treuhaft, N. 2002. Recovery oriented computing (ROC): Motivation, definition, techniques, and case studies. Tech. rep., UCB//CSD-02-1175, University of California, Berkeley. Google ScholarDigital Library
Plank, J. S., Li, K., and Puening, M. A. 1998. Diskless checkpointing. IEEE Transactions on Parallel and Distrib. Syst. 9, 10, 972--986. Google ScholarDigital Library
Prvulovic, M. and Torrellas, J. 2003. ReEnact: Using thread-level speculation mechanisms to debug data races in multithreaded codes. In Proceedings of the 30th International Symposium on Computer Architecture. Google ScholarDigital Library
Qin, F., Lu, S., and Zhou, Y. 2005. Safemem: Exploiting ECC-memory for detecting memory leaks and memory corruption during production runs. In Proceedings of the 11th International Symposium on High-Performance Computer Architecture. Google ScholarDigital Library
Randell, B. 1975. System structure for software fault tolerance. IEEE Trans. Softw. Engin. 1, 2, 220--232.Google ScholarDigital Library
Randell, B., Lee, P. A., and Treleaven, P. C. 1978. Reliability issues in computing system design. ACM Comput. Surv. 10, 2, 123--165. Google ScholarDigital Library
Rinard, M., Cadar, C., Dumitran, D., Roy, D. M., Leu, T., and Beebee, Jr., W. S. 2004. Enhancing server availability and security through failure-oblivious computing. In Proceedings of the 6th Symposium on Operating System Design and Implementation. Google ScholarDigital Library
Rocco, E. and Igou, B. 2001. The critical role of high-availability infrastructure support services.Google Scholar
Rodrigues, R., Castro, M., and Liskov, B. 2001. BASE: Using abstraction to improve fault tolerance. In Proceedings of the 18th Symposium on Operating Systems Principles. Google ScholarDigital Library
Russinovich, M. and Cogswell, B. 1996. Replay for concurrent non-deterministic shared-memory applications. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation. Google ScholarDigital Library
Santry, D. S., Feeley, M. J., Hutchinson, N. C., Veitch, A. C., Carton, R. W., and Ofir, J. 1999. Deciding when to forget in the Elephant file system. In Proceedings of the 17th ACM Symposium on Operating System Principles. Google ScholarDigital Library
Savage, S., Burrows, M., Nelson, G., Sobalvarro, P., and Anderson, T. 1997. Eraser: A dynamic data race detector for multithreaded programs. ACM Trans. Comput. Syst. 15, 4, 391--411. Google ScholarDigital Library
Scott, D. 1998. Assessing the costs of application downtime. Gartner Group.Google Scholar
Sidiroglou, S., Locasto, M. E., Boyd, S. W., and Keromytis, A. D. 2005. Building a reactive immune system for software services. In Proceedings of the USENIX Annual Technical Conference. Google ScholarDigital Library
Srinivasan, S., Andrews, C., Kandula, S., and Zhou, Y. 2004. Flashback: A light-weight extension for rollback and deterministic replay for software debugging. In Proceedings of the USENIX Annual Technical Conference. Google ScholarDigital Library
Staniford, S., Paxson, V., and Weaver, N. 2002. How to own the internet in your spare time. In Proceedings of the 11th USENIX Security Symposium. Google ScholarDigital Library
Stoller, S. D. 2002. Testing concurrent Java programs using randomized scheduling. In Proceedings of the 2nd Workshop on Runtime Verification.Google ScholarCross Ref
Strom, R. and Yemini, S. 1985. Optimistic recovery in distributed systems. ACM Trans. Comput. Syst. 3, 3, 204--226. Google ScholarDigital Library
Sullivan, M. and Chillarege, R. 1991. Software defects and their impact on system availability---A study of field failures in operating systems. In Proceedings of the 21th Annual International Symposium on Fault-Tolerant Computing.Google Scholar
Sullivan, M. and Chillarege, R. 1992. A comparison of software defects in database management systems and operating systems. In Proceedings of the 22th Annual International Symposium on Fault-Tolerant Computing.Google Scholar
Swift, M. M., Annamalai, M., Bershad, B. N., and Levy, H. M. 2004. Recovering device drivers. In Proceedings of the 6th Symposium on Operating System Design and Implementation. Google ScholarDigital Library
Trent, G. and Sake, M. 1995. Webstone: The first generation in http server benchmarking. http://www.cs.virginia.edu/~zaher/classes/cs851/papers/webstone.Google Scholar
Vogels, W., Dumitriu, D., Agrawal, A., Chia, T., and Guo, K. 1998. Scalability of the Microsoft cluster service. In Proceedings of the 2nd USENIX Windows NT Symposium. Google ScholarDigital Library
Vogels, W., Dumitriu, D., Birman, K., Gamache, R., Massa, M., Short, R., Vert, J., Barrera, J., and Gray, J. 1998. The design and architecture of the Microsoft cluster service. In Proceedings of the 28th Annual International Symposium on Fault-Tolerant Computing. Google ScholarDigital Library
Wang, Y.-M., Huang, Y., and Fuchs, W. K. 1993. Progressive retry for software error recovery in distributed systems. In Proceedings of the 23rd Annual International Symposium on Fault-Tolerant Computing.Google Scholar
Wang, Y.-M., Huang, Y., Vo, K.-P., Chung, P.-Y., and Kintala, C. 1995a. Checkpointing and its applications. In Proceedings of the 25th Annual International Symposium on Fault-Tolerant Computing. Google ScholarDigital Library
Wang, Y.-M., Huang, Y., Vo, K.-P., Chung, P.-Y., and Kintala, C. M. R. 1995b. Checkpointing and its applications. In Proceedings of the 25th Annual International Symposium on Fault-Tolerant Computing. Google ScholarDigital Library
Zhou, Y., Chen, P. M., and Li, K. 1999. Fast cluster failover using virtual memory-mapped communication. In Proceedings of the ACM International Conference on Supercomputing. Google ScholarDigital Library

Index Terms

Rx: Treating bugs as allergies—a safe method to survive software failures
1. General and reference
  1. Cross-computing tools and techniques
    1. Reliability
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software reliability

Recommendations

Rx: treating bugs as allergies---a safe method to survive software failures
SOSP '05: Proceedings of the twentieth ACM symposium on Operating systems principles

Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limitations: Required application restructuring, inability to ...
Read More
Rx: treating bugs as allergies---a safe method to survive software failures
SOSP '05

Many applications demand availability. Unfortunately, software failures greatly reduce system availability. Prior work on surviving software failures suffers from one or more of the following limitations: Required application restructuring, inability to ...
Read More
Modeling and analysis of software aging and software failure

Many studies reported that system suffered from outages more due to software faults than hardware faults. Recently, the phenomenon of "software aging", which was caused by aging-related faults, is observed in many software systems. Software aging, ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Computer Systems Volume 25, Issue 3
August 2007
121 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/1275517
Issue’s Table of Contents

Copyright © 2007 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 August 2007
Published in tocs Volume 25, Issue 3

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Availability
Bug
Reliability
Software Failure
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 47
  Total Citations
  View Citations
- 1,553
  Total Downloads
- Downloads (Last 12 months)14
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Rx: Treating bugs as allergies—a safe method to survive software failures

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Rx: treating bugs as allergies---a safe method to survive software failures

Rx: treating bugs as allergies---a safe method to survive software failures

Modeling and analysis of software aging and software failure

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Rx: Treating bugs as allergies—a safe method to survive software failures

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Rx: treating bugs as allergies---a safe method to survive software failures

Rx: treating bugs as allergies---a safe method to survive software failures

Modeling and analysis of software aging and software failure

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media