article

Free Access

Recovery management in QuickSilver

Authors:
Rober Haskin

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Yoni Malachi

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

,
Gregory Chan

IBM Almaden Research Center, San Jose, CA

IBM Almaden Research Center, San Jose, CA
View Profile

Authors Info & Claims

ACM Transactions on Computer Systems Volume 6 Issue 1pp 82–108https://doi.org/10.1145/35037.35060

Published:01 February 1988Publication History

ACM Transactions on Computer Systems

Abstract

This paper describes QuickSilver, developed at the IBM Almaden Research Center, which uses atomic transactions as a unified failure recovery mechanism for a client-server structured distributed system. Transactions allow failure atomicity for related activities at a single server or at a number of independent servers. Rather than bundling transaction management into a dedicated language or recoverable object manager, Quicksilver exposes the basic commit protocol and log recovery primitives, allowing clients and servers to tailor their recovery techniques to their specific needs. Servers can implement their own log recovery protocols rather than being required to use a system-defined protocol. These decisions allow servers to make their own choices to balance simplicity, efficiency, and recoverability.

References

1 ALLCHIN, J. E., AND MCKENDRY, M.S. Synchronization and recovery of actions. In Proceedings of the 2nd ACM Symposium on Principles o{ Distributed Computing (Montreal, Aug. 1983). ACM, New York, 1983, 31-44. Google Scholar
2 AUSLANDER, M., AND HOPKINS, M. An overview of the PL.8 compiler. In SIGPLAN '82 Symposium on Compiler Writing (Boston, Mass., June 1982). ACM, New York, 1982. Google Scholar
3 BARON, R. V., RASHID, R. F., SIEGEL, E. H., TEVANIAN, A., JR., AND YOUNG, M.W. MACH- 1: A multiprocessor oriented operating system and environment. In New Computing Environmerits: Parallel, Vector, and Systolic, SIAM, 1986, 80-89.Google Scholar
4 BARTLETT, J. A NonStop kernel. In A CM Proceedings of the 8th Symposium on Operating Systems Principles (Pacific Grove, Calif. Dec. 1981). ACM, New York, 1981, 22-30. Google Scholar
5 BIRMAN, K.P. Replication and fault-tolerance in the ISIS system. In Proceedings of the 10th ACM Symposium on Operating Systems Principles (Orcas Island, Wash., Dec. 1985). ACM, New York, 1985, 79-86. Google Scholar
6 BORR, A.J. Transaction monitoring in Encompass: Reliable distributed transaction processing. In Proceedings of the 7th International Conference on Very Large Data Bases (Cannes, France, Sept. 1981), IEEE, New York, 1981, 155-165.Google Scholar
7 CABRERA, L. F., AND WYLLIE, j.C. QuickSilver distributed file services: An architecture for horizontal growth. IBM Res. Rep. RJ5578, Feb. 1987.Google Scholar
8 CHANG, A., AND MERGEN, M. F. 801 storage: Architecture and programming. ACM Trans. Comput. Syst. This issue, 28-50. Google Scholar
9 CHERITON, D.R. The V kernel: a software base for distributed systems. IEEE Softw. 1, 2 (April 1984), 19-42.Google Scholar
10 CHERITON, D.R. Fault-tolerant transaction management in a workstation cluster. Unpublished.Google Scholar
11 CHERITON, D. R., AND ZWAENEPOEL, W. Distributed process groups in the V kernel. ACM Trans. Comput. Syst. 3, 2 (May 1985), 77-107. Google Scholar
12 COOPER, S.C. Replicated distributed programs. In Proceedings of the lOth ACM Symposium on Operating Systems Principles (Orcas Island, Wash., Dec. 1985). ACM, New York, 1985, 63-78. Google Scholar
13 CRISTIAN, F., AGHILI, H., STRONG, R., AND DOLEV, D. Atomic broadcast: From simple message diffusion to Byzantine agreement. IBM Res. Rep. RJ5244, IBM, San Jose, Calif., July 1986.Google Scholar
14 GRAY, J.N. Notes on data base operating systems. In Operating Systems, An Advanced Course, R. Bayer, R. M. Graham, and G. Seegmdller, Eds. Springer-Verlag, New York, 1978, 393-481. Also available as IBM Res. Rep. RJ2188, IBM Almaden Research Center, San Jose, CA 95120. Google Scholar
15 GRAY, J. N., MCJONES, P., BLASGEN, M. W., LORIE, R. A., PRICE, T. G., PUTZOLU, G. F., AND TRAIGER, I.L. The recovery manager of the System R database manager. Comput. Surv. 13, 2 (June 1981), 223-242. Google Scholar
16 HERLIHY, M. P., AND WING, J.M. Avalon: Language support for reliable distributed systems. Tech. Rep. CMU-CS-86-167, Dept. of Computer Science, Carnegie Mellon Univ., Pittsburgh, Pa., Nov. 1986.Google Scholar
17 INTERNATIONAL BUSINESS MACHINES. Systems Network Architecture Transaction: Programmer's Reference Manual for LU Type 6.2, IBM Corporation GC30-3084.Google Scholar
18 LAMPSON, B.W. Atomic transactions. In Distributed Systems--Architecture and Implementation. Springer-Verlag, New York, 1981, 246-264. Google Scholar
19 LINDSAY, B., HAAS, L., MOHAN, C., WILMS, P., AND YOST, R. Computation and communication in R*: A distributed database manager. In Proceedings of the 9th ACM Symposium on Operating Systems Principles (Bretton Woods, N.H., Oct. 1983). ACM, New York, 1983, 1-10. Also available as IBM Res. Rep. RJ3740, IBM, San Jose, Calif., Jan. 1983. Google Scholar
20 LINDSAY, B. G., SELINGER, P. G., GALTIERI, C., GRAY, J. N., LORIE, R. A., PRICE, T. G., PUTZOLU, F., TRAIGER, I. L., AND WADE, B.W. Single and multi-site recovery facilities. In Distributed Data Bases, I. W. Draffan and F. Poole, Eds. Cambridge University Press, Cambridge, UK, 1980. Also available as Notes on Distributed Databases, IBM Res. Rep. RJ2571, IBM, San Jose, Calif., July 1979, 44-50.Google Scholar
21 LISKOV, B., AND SCHEIFLER, R. Guardians and actions: Linguistic support for robust, distributed programs. ACM Trans. Program. Lang. Syst. 5, 3 (July 1983), 381-404. Google Scholar
22 LYON, B., AND SAGER, G. Overview of the SUN network file system. SUN Microsystems, Inc., Mountain View, Calif., Jan. 1985, 1-8.Google Scholar
23 MOHAN, C., LINDSAY, B., AND OBERMARCK, R. Transaction management in the R* distributed database management system. ACM Trans. Database Syst. 11, 4 (Dec. 1986), 378-396. Also available as IBM Res. Rep. RJ5037, IBM, San Jose, Calif., Feb. 1986. Google Scholar
24 MOHAN, C., STRONG, H. R., AND FXNKELSTEIN, S. Method for distributed transaction commit and recovery using Byzantine agreement within clusters of processors. In Proceedings of the 2nd ACM Symposium on Principles o{ Distributed Computing (Montreal, Aug. 1983). ACM, New York, 1983, 89-103. Also IBM Res. Rep. RJ3882. Google Scholar
25 Moss, E.B. Nested Transactions: An Approach to Reliable Distributed Computing, MIT Press, Cambridge, Mass., 1985. Google Scholar
26 MOLLER, E. T., MOORE, J. D., AND POPEK, G.J. A nested transaction mechanism for LOCUS. In Proceedings of the 9th A CM Symposium on Operating System Principles (Bretton Woods, N.H., Oct. 1983). ACM, New York, 1983, 71-89. Google Scholar
27 OBERMARCK, R. Distributed deadlock detection algorithm. A CM Trans. Database Syst. 7, 2 (June 1982), 187-208. Google Scholar
28 OKI, B., LISKOV, B., AND SCHEIFLER, R. Reliable object storage to support atomic actions. In Proceedings of the lOth ACM Symposium on Operating Systems Principles (Orcas Island, Wash., Dec. 1985). ACM, New York, 1985, 147-159. Google Scholar
29 POPEK, G., WALKER, B., CHOW, J., EDWARDS, D., KLINE, C., RUDISIN, G., AND THIEL, G. LOCUS: A network transparent high reliability distributed system. In Proceedings of the 8th ACM Symposium on Operating Systems Principles (Pacific Grove, Calif., Dec. 1981). ACM, New York, 1981, 169-177. Google Scholar
30 Pu, C., NOE, J. D., AND PROUDFOOT, A. Regeneration of replicated objects: A technique and its Eden implementation. In Proceedings o/the 2nd International Conference on Data Engineering, (Los Angeles, Feb. 1986). IEEE Press, New York, 1986, 175-187. Google Scholar
31 RASHID, R., AND ROBERTSON, G. Accent: A communication oriented network operating system kernel. In Proceedings of the 8th A CM Symposium on Operating Systems Principles (Pacific Grove, Calif., Dec. 1981). ACM, New York, 1981, 64-75. Google Scholar
32 REED, D., AND SVOBODOVA, L. SWALLOW: A distributed data storage system for a local network. In Networks for Computer Communications, North-Holland, Amsterdam, 1981, 355- 373.Google Scholar
33 SCHWARZ, P.M. Transactions on Typed Objects. Ph.D. Dissertation, Carnegie-Mellon Univ., Pittsburgh, Pa., Dec. 1984. Available as CMU Tech. Rep. CMU-CS-84-166. Google Scholar
34 SPECTOR, A. Z., BUTCHER, J., DANIELS, D. S., DUCHAMP, D. J., EPPINGER, J. L., FINEMAN, C. E., HEDDAYA, A., AND SCHWARZ, P.M. Support for distributed transactions in the TABS prototype. IEEE Trans. Softw. Eng. SE-11, 6 (June 1985), 520-530.Google Scholar
35 SPECTOR, A. Z., DANIELS, D., DUCHAMP, D., EPPINGER, J., AND PAUSCH, R. Distributed transactions for reliable systems. In Proceedings of the l Oth A CM Symposium on Operating Systems Principles (Orcas Island, Wash., Dec. 1985. ACM, New York, 1985, 127-146. Google Scholar
36 SPECTOR, A., ET AL. Camelot: A distributed transaction facility for Mach and the internet--an interim report. Tech. Rep. CMU-CS-87-129, Dept. of Computer Science, Carnegie Mellon Univ., Pittsburgh, Pa., June 1987.Google Scholar
37 STONEBRAKER, M. Operating systems support for database management. Commun. A CM 24, 7 (July 1981), 412-418. Google Scholar
38 WEINSTEIN, M. J., PAGE, T. W., LIVEZEY, B. K., AND POPEK, G.J. Transactions and synchronization in a distributed operating system. In Proceedings of the l Oth A CM Symposium on Operating Systems Principles (Orcas Island, Wash., Dec. 1985). ACM, New York, 1985, 115-126. Google Scholar

Index Terms

Recommendations

Recovery management in QuickSilver
One price of extensibility and distribution, as implemented in QuickSilver, is a more complicated set of failure modes, and the consequent necessity of dealing with them. In traditional operating systems, services (e.g., file, display) are intrinsic ...
Read More
Recovery management in QuickSilver
SOSP '87: Proceedings of the eleventh ACM Symposium on Operating systems principles
One price of extensibility and distribution, as implemented in QuickSilver, is a more complicated set of failure modes, and the consequent necessity of dealing with them. In traditional operating systems, services (e.g., file, display) are intrinsic ...
Read More
Semantics-Based Recovery in Transaction Management Systems
Read More

Reviews

Reviewer: Jason Gait

Quicksilver is a network operating system for IBM workstations connected by a token ring. Quicksilver provides system services as user-level processes that maintain client states. Servers are resilient to external failure and can recover resources associated with failed clients. The commit protocol and log recovery primitives are available to applications so servers can tailor recovery techniques to requirements, trading off simplicity and efficiency against recoverability. The authors have adopted a high-overhead transaction mechanism in Quicksilver, but with the policy of using it only when necessary. To this end, servers are divided into four types: those that have volatile internal states and only require signaling capability, such as the window manager; those that manage replicated volatile states and use transaction commit for atomicity, like the name server; those that manage recoverable states and require a full panoply of recovery mechanisms, like the file server; and those that manipulate long-lived states and require log service for checkpointing. Only those that manage recoverable states are truly expensive in Quicksilver. Transaction overhead is further reduced by providing alternative commit protocols to servers, so servers can choose how much to pay for recovery. Interprocess communication (IPC) addresses in Quicksilver are evidently site-dependent (contrary to the author's statement in section 2.1), so IPC is location sensitive. Thus services (except for transaction management) are bound to nodes, migration is expensive, and load balancing (usually a fundamental rationale for a network operating system) is probably impractical. The Quicksilver IPC mechanism is heavily loaded, with responsibility for guaranteeing delivery and message ordering, for enforcing security constraints, and for maintaining transaction connectivity graphs. These overheads slow down processes that do not require the benefits provided and to some extent defeat the author's goal of paying optional overhead for optional services. The paper contains a comprehensive review of possible approaches and a wide-ranging survey of the distributed operating system literature.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Transactions on Computer Systems Volume 6, Issue 1
Feb. 1988
152 pages
ISSN:0734-2071
EISSN:1557-7333
DOI:10.1145/35037
Issue’s Table of Contents

Copyright © 1988 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 February 1988
Published in tocs Volume 6, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 80
  Total Citations
  View Citations
- 1,106
  Total Downloads
- Downloads (Last 12 months)74
- Downloads (Last 6 weeks)19
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Recovery management in QuickSilver

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Recovery management in QuickSilver

Recovery management in QuickSilver

Semantics-Based Recovery in Transaction Management Systems

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Recovery management in QuickSilver

ACM Transactions on Computer Systems

Abstract

References

Cited By

Index Terms

Recommendations

Recovery management in QuickSilver

Recovery management in QuickSilver

Semantics-Based Recovery in Transaction Management Systems

Reviews

Access critical reviews of Computing literature here

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media