Recovery management in QuickSilver

Authors:
R. Haskin

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Y. Malachi

IBM Almaden REsearch Center, Sann Jose, CA

IBM Almaden REsearch Center, Sann Jose, CA
View Profile

,
W. Sawdon

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
G. Chan

IBM Almaden REsearrch Center, San Jose, CA

IBM Almaden REsearrch Center, San Jose, CA
View Profile

Authors Info & Claims

ACM SIGOPS Operating Systems Review Volume 21 Issue 5Nov. 1987pp 107–108https://doi.org/10.1145/37499.37512

Published:01 November 1987Publication History

ACM SIGOPS Operating Systems Review

Abstract

One price of extensibility and distribution, as implemented in QuickSilver, is a more complicated set of failure modes, and the consequent necessity of dealing with them. In traditional operating systems, services (e.g., file, display) are intrinsic pieces of the kernel. Process state is maintained in kernel tables, and the kernel contains explicit cleanup code (e.g., to close files, reclaim memory, and get rid of process images after hardware or software failures). QuickSilver, however, is structured according to the client-server model, and as in many systems of its type, system services are implemented by user-level processes that maintain a substantial amount of client process state. Examples of this state are the open files, screen windows, address space, etc., belonging to a process. Failure resilience in such an environment requires that clients and servers be aware of problems involving each other. Examples of the way one would like the system to behave include having files closed and windows removed from the screen when a client terminates, and having clients see bad return codes (rather than hanging) when a file server crashes. This motivates a number of design goals:

Properly written programs (especially servers) should be resilient to external process and machine failures, and should be able to recover all resources associated with failed entities.
Server processes should contain their own recovery code. The kernel should not make any distinction between system service processes and normal application processes.
To avoid the proliferation of ad-hoc recovery mechanisms, there should be a uniform system-wide architecture for recovery management.
A client may invoke several independent servers to perform a set of logically related activitites (a unit of work) that must execute atomically in the presence of failures, that is, either all the related activities should occur or none of them should. The recovery mechanism should support this.

In QuickSilver, recovery is based on the database notion of atomic transactions, which are made available as a system service to be used by other, higher-level servers. This allows meeting all the above design goals. Software portability is important in the QuickSilver environment, dictating that transaction-based recovery be accessible to conventional programming languages rather than a special-purpose one such as Argus [Liskov84]. To accommodate servers with diverse recovery demands, the low-level primitives of commit coordination and log recovery are exposed directly rather than building recovery on top of a stable-storage mechanism such as in CPR [Attanasio87] or recoverable objects such as those in Camelot [Spector87] or Clouds [Allchin&McKendry83].

References

Allchin & McKendry 83 Allchin, J. E., McKendry, M. S., Synchronization and recovery of actions, Proceedings of the Second A CM Symposium on Principles of Distributed Computing (August 1983) pp. 31-44. Google ScholarDigital Library
Attanasio 87 Attanasio, C.R., CPR supervisor support for relational database facility, IBM Technical Report RC 12416 (January 1987).Google Scholar
Liskov 84 Liskov, B., Overview of the Argus language and system, MIT Laboratory for Computer Science (February 1984).Google Scholar
Spector 87 Spector, A., et. al., Camelot: A distributed transaction facility for Math and the intemet- An Interim Report, CMU Technical Report CMU-CS-87-129 (June, 1987).Google Scholar
Moss 85 Moss E. B., Nested Transactions: an Approach to Reliable Distributed Computing, MIT Press (1985). Google ScholarDigital Library
Obermarck 82 Obermarck R., Distributed deadlock detection algorithm, A CM Transactions on Database Systems Volume 7, Number 2 (June 1982) pp. 187-208. Google ScholarDigital Library

Index Terms

Recovery management in QuickSilver
1. General and reference
  1. Cross-computing tools and techniques
    1. Reliability
2. Software and its engineering
  1. Software organization and properties
    1. Extra-functional properties
      1. Software reliability
    2. Software system structures
      1. Distributed systems organizing principles

Recommendations

Recovery management in QuickSilver

This paper describes QuickSilver, developed at the IBM Almaden Research Center, which uses atomic transactions as a unified failure recovery mechanism for a client-server structured distributed system. Transactions allow failure atomicity for related ...
Read More
Recovery management in QuickSilver
SOSP '87: Proceedings of the eleventh ACM Symposium on Operating systems principles
One price of extensibility and distribution, as implemented in QuickSilver, is a more complicated set of failure modes, and the consequent necessity of dealing with them. In traditional operating systems, services (e.g., file, display) are intrinsic ...
Read More
Semantics-Based Recovery in Transaction Management Systems
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM SIGOPS Operating Systems Review Volume 21, Issue 5
Nov. 1987
162 pages
ISSN:0163-5980
DOI:10.1145/37499
Editor:
William M. Waite
Issue’s Table of Contents
SOSP '87: Proceedings of the eleventh ACM Symposium on Operating systems principles
November 1987
162 pages
ISBN:089791242X
DOI:10.1145/41457
Chairman:
Les Belady
Copyright © 1987 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 November 1987
Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 12
  Total Citations
  View Citations
- 499
  Total Downloads
- Downloads (Last 12 months)77
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Recovery management in QuickSilver

ACM SIGOPS Operating Systems Review

Abstract

References

Cited By

Index Terms

Recommendations

Recovery management in QuickSilver

Recovery management in QuickSilver

Semantics-Based Recovery in Transaction Management Systems