ABSTRACT
We describe a strategy for enabling existing commodity operating systems to recover from unexpected run-time errors in nearly any part of the kernel, including core kernel components. Our approach is dynamic and request-oriented; it isolates the effects of a fault to the requests that caused the fault rather than to static kernel components. This approach is based on a notion of "recovery domains," an organizing principle to enable rollback of state affected by a request in a multithreaded system with minimal impact on other requests or threads. We have applied this approach on v2.4.22 and v2.6.27 of the Linux kernel and it required 132 lines of changed or new code: the other changes are all performed by a simple instrumentation pass of a compiler. Our experiments show that the approach is able to recover from otherwise fatal faults with minimal collateral impact during a recovery event.
- P. A. Bernstein, V. Hadzilacos, and N. Goodman. Concurrency Control and Recovery in Database Systems. Addison-Wesley, 1987. Google ScholarDigital Library
- T. C. Bressoud and F. B. Schneider. Hypervisor-based fault tolerance. ACM Transactions on Computer Systems, 14(1):80--107, February 1996. Google ScholarDigital Library
- G. Candea, S. Kawamoto, Y. Fujiki, G. Friedman, and A. Fox. Microreboot-a technique for cheap recovery. In 6th Symposium on Operating Systems Design and Implementation (OSDI), pages 31--44, December 2004. Google ScholarDigital Library
- J. Criswell, A. Lenharth, D. Dhurjati, and V. Adve. Secure virtual architecture: A safe execution environment for commodity operating systems. In SOSP '07: Proceedings of the Twenty First ACM Symposium on Operating Systems Principles, October 2007. Google ScholarDigital Library
- D. Dhurjati, S. Kowshik, and V. Adve. SAFECode: Enforcing alias analysis for weakly typed languages. In Proc. ACM SIGPLAN Conf. on Programming Language Design and Implementation (PLDI), June 2006. Google ScholarDigital Library
- E. N. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A Survey of Rollback-Recovery Protocols in Message-Passing Systems. ACM Computing Surveys, 34(3), September 2002. Google ScholarDigital Library
- W. Feng. Making a case for efficient supercomputing. Queue, 1(7):54--64, 2003. Google ScholarDigital Library
- J. Gray. The transaction concept: Virtues and limitations. In Proc. Int'l Conf. on Very Large Data Bases, pages 144--154, 1981. Google ScholarDigital Library
- H. S. Gunawi, V. Prabhakaran, S. Krishnan, A. C. Arpaci-Dusseau, and R. H. Arpaci-Dusseau. Improving file system reliability with i/o shepherding. In SOSP '07: Proceedings of twenty-first ACM SIGOPS symposium on Operating systems principles, pages 293--306, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- M. Herlihy and J. E. B. Moss. Transactional memory: Architectural support for lock-free data structures. In Proc. Int'l Conf. on Comp. Arch. (ISCA), pages 289--300, New York, NY, USA, 1993. ACM Press. Google ScholarDigital Library
- G. C. Hunt, J. R. Larus, M. Abadi, M. Aiken, P. Barham, M. Fýhndrich, C. H. O. Hodson, S. Levi, N. Murphy, B. Steensgaard, D. Tarditi, T. Wobber, and B. Zill. An overview of the Singularity project. Technical Report MSR-TR-2005-135, Microsoft Research, October 2005.Google Scholar
- C. Lattner and V. Adve. LLVM: A Compilation Framework for Lifelong Program Analysis and Transformation. In Proc. Conf. on Code Generation and Optimization, Mar 2004. Google ScholarDigital Library
- D. Lowell, S. Chandra, and P. Chen. Exploring failure transparency and the limits of generic recovery. In Proceedings of the 4th USENIX Symposium on Operating Systems Design and Implementation, pages 289--304. Google ScholarDigital Library
- D. E. Lowell and P. M. Chen. Free transactions with rio vista. In SOSP '97: Proceedings of the sixteenth ACM symposium on Operating systems principles, pages 92--101, New York, NY, USA, 1997. ACM Press. Google ScholarDigital Library
- C. C. Minh, M. Trautmann, J. Chung, A. McDonald, N. Bronson, J. Casper, C. Kozyrakis, and K. Olukotun. An effective hybrid transactional memory system with strong isolation guarantees. In ISCA '07: Proceedings of the 34th annual international symposium on Computer architecture, pages 69--80, New York, NY, USA, 2007. ACM. Google ScholarDigital Library
- G. C. Necula, J. Condit, M. Harren, S. McPeak, and W. Weimer. Ccured: type-safe retrofitting of legacy software. ACM Transactions on Programming Languages and Systems, 2005. Google ScholarDigital Library
- C. J. Rossbach, O. S. Hofmann, D. E. P. ter, H. E. Ramadan, A. Bhandari, and E. Witchel. Txlinux: Using and managing hardware transactional memory in an operating system. In SOSP '07: Proceedings of the Twenty First ACM Symposium on Operating Systems Principles, October 2007. Google ScholarDigital Library
- M. I. Seltzer, Y. Endo, C. Small, and K. A. Smith. Dealing with disaster: Surviving misbehaved kernel extensions. In Proceedings of the 2nd Symposium on Operating Systems Design and Implementation, pages 213--227, Seattle, Washington, 1996. Google ScholarDigital Library
- N. Shavit and D. Touitou. Software transactional memory. In Symp. on Principles of Distrib. Comp., pages 204--213, New York, NY, 1995. ACM Press. Google ScholarDigital Library
- A. Shinnar, D. Tarditi, M. Plesko, and B. Steensgaard. Integrating support for undo with exception handling. Technical Report MSR-TR-2004-140, Microsoft Research, Dec. 2004.Google Scholar
- P. Starzetz and W. Purczynski. Linux kernel setsockopt MCAST_MSFILTER integer overflow vulnerability, 2004. http://www.securityfocus.com/bid/10179.Google Scholar
- M. Swift, M. Annamalai, B. Bershad, and H. Levy. Recovering device drivers. In Proceedings of the 2004 Symposium on Operating Systems Design and Implementation (OSDI), Nov 2004. Google ScholarDigital Library
- M. Swift, B. Bershad, and H. Levy. Improving the reliability of commodity operating systems. In Proceedings of the 19th Symposium on Operating Systems Principles, New York, 2003. Google ScholarDigital Library
- I. L. Traiger. Trends in systems aspects of database management. In In Int'l Conf. on Databases, pages 1--21, 1983.Google Scholar
- W. Weimer and G. Necula. Finding and preventing run-time error handling mistakes, 2004.Google Scholar
- J. Xu, B. Randell, A. Romanovsky, C. M. F. Rubira, and Z. Wu. Fault tolerance in concurrent object-oriented software through coordinated error recovery. In FTCS '95: Proceedings of the Twenty-Fifth International Symposium on Fault-Tolerant Computing, page 499, Washington, DC, USA, 1995. IEEE Computer Society. Google ScholarDigital Library
- F. Zhou, J. Condit, Z. Anderson, I. Bagrak, R. Ennals, M. Harren, G. Necula, and E. Brewer. Safedrive: Safe and recoverable extensions using language-based techniques. In Proceedings of the 2006 Symposium on Operating Systems Design and Implementation (OSDI), pages 45--60, Nov. 2006. Google ScholarDigital Library
Index Terms
- Recovery domains: an organizing principle for recoverable operating systems
Recommendations
Recovery domains: an organizing principle for recoverable operating systems
ASPLOS 2009We describe a strategy for enabling existing commodity operating systems to recover from unexpected run-time errors in nearly any part of the kernel, including core kernel components. Our approach is dynamic and request-oriented; it isolates the effects ...
Recovery domains: an organizing principle for recoverable operating systems
ASPLOS 2009We describe a strategy for enabling existing commodity operating systems to recover from unexpected run-time errors in nearly any part of the kernel, including core kernel components. Our approach is dynamic and request-oriented; it isolates the effects ...
Comments