Abstract
One price of extensibility and distribution, as implemented in QuickSilver, is a more complicated set of failure modes, and the consequent necessity of dealing with them. In traditional operating systems, services (e.g., file, display) are intrinsic pieces of the kernel. Process state is maintained in kernel tables, and the kernel contains explicit cleanup code (e.g., to close files, reclaim memory, and get rid of process images after hardware or software failures). QuickSilver, however, is structured according to the client-server model, and as in many systems of its type, system services are implemented by user-level processes that maintain a substantial amount of client process state. Examples of this state are the open files, screen windows, address space, etc., belonging to a process. Failure resilience in such an environment requires that clients and servers be aware of problems involving each other. Examples of the way one would like the system to behave include having files closed and windows removed from the screen when a client terminates, and having clients see bad return codes (rather than hanging) when a file server crashes. This motivates a number of design goals:
Properly written programs (especially servers) should be resilient to external process and machine failures, and should be able to recover all resources associated with failed entities.
Server processes should contain their own recovery code. The kernel should not make any distinction between system service processes and normal application processes.
To avoid the proliferation of ad-hoc recovery mechanisms, there should be a uniform system-wide architecture for recovery management.
A client may invoke several independent servers to perform a set of logically related activitites (a unit of work) that must execute atomically in the presence of failures, that is, either all the related activities should occur or none of them should. The recovery mechanism should support this.
In QuickSilver, recovery is based on the database notion of atomic transactions, which are made available as a system service to be used by other, higher-level servers. This allows meeting all the above design goals. Software portability is important in the QuickSilver environment, dictating that transaction-based recovery be accessible to conventional programming languages rather than a special-purpose one such as Argus [Liskov84]. To accommodate servers with diverse recovery demands, the low-level primitives of commit coordination and log recovery are exposed directly rather than building recovery on top of a stable-storage mechanism such as in CPR [Attanasio87] or recoverable objects such as those in Camelot [Spector87] or Clouds [Allchin&McKendry83].
- Allchin & McKendry 83 Allchin, J. E., McKendry, M. S., Synchronization and recovery of actions, Proceedings of the Second A CM Symposium on Principles of Distributed Computing (August 1983) pp. 31-44. Google ScholarDigital Library
- Attanasio 87 Attanasio, C.R., CPR supervisor support for relational database facility, IBM Technical Report RC 12416 (January 1987).Google Scholar
- Liskov 84 Liskov, B., Overview of the Argus language and system, MIT Laboratory for Computer Science (February 1984).Google Scholar
- Spector 87 Spector, A., et. al., Camelot: A distributed transaction facility for Math and the intemet- An Interim Report, CMU Technical Report CMU-CS-87-129 (June, 1987).Google Scholar
- Moss 85 Moss E. B., Nested Transactions: an Approach to Reliable Distributed Computing, MIT Press (1985). Google ScholarDigital Library
- Obermarck 82 Obermarck R., Distributed deadlock detection algorithm, A CM Transactions on Database Systems Volume 7, Number 2 (June 1982) pp. 187-208. Google ScholarDigital Library
Index Terms
- Recovery management in QuickSilver
Recommendations
Recovery management in QuickSilver
This paper describes QuickSilver, developed at the IBM Almaden Research Center, which uses atomic transactions as a unified failure recovery mechanism for a client-server structured distributed system. Transactions allow failure atomicity for related ...
Recovery management in QuickSilver
SOSP '87: Proceedings of the eleventh ACM Symposium on Operating systems principlesOne price of extensibility and distribution, as implemented in QuickSilver, is a more complicated set of failure modes, and the consequent necessity of dealing with them. In traditional operating systems, services (e.g., file, display) are intrinsic ...
Comments