skip to main content
article
Free Access

Recovery management in QuickSilver

Published:01 February 1988Publication History
Skip Abstract Section

Abstract

This paper describes QuickSilver, developed at the IBM Almaden Research Center, which uses atomic transactions as a unified failure recovery mechanism for a client-server structured distributed system. Transactions allow failure atomicity for related activities at a single server or at a number of independent servers. Rather than bundling transaction management into a dedicated language or recoverable object manager, Quicksilver exposes the basic commit protocol and log recovery primitives, allowing clients and servers to tailor their recovery techniques to their specific needs. Servers can implement their own log recovery protocols rather than being required to use a system-defined protocol. These decisions allow servers to make their own choices to balance simplicity, efficiency, and recoverability.

References

  1. 1 ALLCHIN, J. E., AND MCKENDRY, M.S. Synchronization and recovery of actions. In Proceedings of the 2nd ACM Symposium on Principles o{ Distributed Computing (Montreal, Aug. 1983). ACM, New York, 1983, 31-44. Google ScholarGoogle Scholar
  2. 2 AUSLANDER, M., AND HOPKINS, M. An overview of the PL.8 compiler. In SIGPLAN '82 Symposium on Compiler Writing (Boston, Mass., June 1982). ACM, New York, 1982. Google ScholarGoogle Scholar
  3. 3 BARON, R. V., RASHID, R. F., SIEGEL, E. H., TEVANIAN, A., JR., AND YOUNG, M.W. MACH- 1: A multiprocessor oriented operating system and environment. In New Computing Environmerits: Parallel, Vector, and Systolic, SIAM, 1986, 80-89.Google ScholarGoogle Scholar
  4. 4 BARTLETT, J. A NonStop kernel. In A CM Proceedings of the 8th Symposium on Operating Systems Principles (Pacific Grove, Calif. Dec. 1981). ACM, New York, 1981, 22-30. Google ScholarGoogle Scholar
  5. 5 BIRMAN, K.P. Replication and fault-tolerance in the ISIS system. In Proceedings of the 10th ACM Symposium on Operating Systems Principles (Orcas Island, Wash., Dec. 1985). ACM, New York, 1985, 79-86. Google ScholarGoogle Scholar
  6. 6 BORR, A.J. Transaction monitoring in Encompass: Reliable distributed transaction processing. In Proceedings of the 7th International Conference on Very Large Data Bases (Cannes, France, Sept. 1981), IEEE, New York, 1981, 155-165.Google ScholarGoogle Scholar
  7. 7 CABRERA, L. F., AND WYLLIE, j.C. QuickSilver distributed file services: An architecture for horizontal growth. IBM Res. Rep. RJ5578, Feb. 1987.Google ScholarGoogle Scholar
  8. 8 CHANG, A., AND MERGEN, M. F. 801 storage: Architecture and programming. ACM Trans. Comput. Syst. This issue, 28-50. Google ScholarGoogle Scholar
  9. 9 CHERITON, D.R. The V kernel: a software base for distributed systems. IEEE Softw. 1, 2 (April 1984), 19-42.Google ScholarGoogle Scholar
  10. 10 CHERITON, D.R. Fault-tolerant transaction management in a workstation cluster. Unpublished.Google ScholarGoogle Scholar
  11. 11 CHERITON, D. R., AND ZWAENEPOEL, W. Distributed process groups in the V kernel. ACM Trans. Comput. Syst. 3, 2 (May 1985), 77-107. Google ScholarGoogle Scholar
  12. 12 COOPER, S.C. Replicated distributed programs. In Proceedings of the lOth ACM Symposium on Operating Systems Principles (Orcas Island, Wash., Dec. 1985). ACM, New York, 1985, 63-78. Google ScholarGoogle Scholar
  13. 13 CRISTIAN, F., AGHILI, H., STRONG, R., AND DOLEV, D. Atomic broadcast: From simple message diffusion to Byzantine agreement. IBM Res. Rep. RJ5244, IBM, San Jose, Calif., July 1986.Google ScholarGoogle Scholar
  14. 14 GRAY, J.N. Notes on data base operating systems. In Operating Systems, An Advanced Course, R. Bayer, R. M. Graham, and G. Seegmdller, Eds. Springer-Verlag, New York, 1978, 393-481. Also available as IBM Res. Rep. RJ2188, IBM Almaden Research Center, San Jose, CA 95120. Google ScholarGoogle Scholar
  15. 15 GRAY, J. N., MCJONES, P., BLASGEN, M. W., LORIE, R. A., PRICE, T. G., PUTZOLU, G. F., AND TRAIGER, I.L. The recovery manager of the System R database manager. Comput. Surv. 13, 2 (June 1981), 223-242. Google ScholarGoogle Scholar
  16. 16 HERLIHY, M. P., AND WING, J.M. Avalon: Language support for reliable distributed systems. Tech. Rep. CMU-CS-86-167, Dept. of Computer Science, Carnegie Mellon Univ., Pittsburgh, Pa., Nov. 1986.Google ScholarGoogle Scholar
  17. 17 INTERNATIONAL BUSINESS MACHINES. Systems Network Architecture Transaction: Programmer's Reference Manual for LU Type 6.2, IBM Corporation GC30-3084.Google ScholarGoogle Scholar
  18. 18 LAMPSON, B.W. Atomic transactions. In Distributed Systems--Architecture and Implementation. Springer-Verlag, New York, 1981, 246-264. Google ScholarGoogle Scholar
  19. 19 LINDSAY, B., HAAS, L., MOHAN, C., WILMS, P., AND YOST, R. Computation and communication in R*: A distributed database manager. In Proceedings of the 9th ACM Symposium on Operating Systems Principles (Bretton Woods, N.H., Oct. 1983). ACM, New York, 1983, 1-10. Also available as IBM Res. Rep. RJ3740, IBM, San Jose, Calif., Jan. 1983. Google ScholarGoogle Scholar
  20. 20 LINDSAY, B. G., SELINGER, P. G., GALTIERI, C., GRAY, J. N., LORIE, R. A., PRICE, T. G., PUTZOLU, F., TRAIGER, I. L., AND WADE, B.W. Single and multi-site recovery facilities. In Distributed Data Bases, I. W. Draffan and F. Poole, Eds. Cambridge University Press, Cambridge, UK, 1980. Also available as Notes on Distributed Databases, IBM Res. Rep. RJ2571, IBM, San Jose, Calif., July 1979, 44-50.Google ScholarGoogle Scholar
  21. 21 LISKOV, B., AND SCHEIFLER, R. Guardians and actions: Linguistic support for robust, distributed programs. ACM Trans. Program. Lang. Syst. 5, 3 (July 1983), 381-404. Google ScholarGoogle Scholar
  22. 22 LYON, B., AND SAGER, G. Overview of the SUN network file system. SUN Microsystems, Inc., Mountain View, Calif., Jan. 1985, 1-8.Google ScholarGoogle Scholar
  23. 23 MOHAN, C., LINDSAY, B., AND OBERMARCK, R. Transaction management in the R* distributed database management system. ACM Trans. Database Syst. 11, 4 (Dec. 1986), 378-396. Also available as IBM Res. Rep. RJ5037, IBM, San Jose, Calif., Feb. 1986. Google ScholarGoogle Scholar
  24. 24 MOHAN, C., STRONG, H. R., AND FXNKELSTEIN, S. Method for distributed transaction commit and recovery using Byzantine agreement within clusters of processors. In Proceedings of the 2nd ACM Symposium on Principles o{ Distributed Computing (Montreal, Aug. 1983). ACM, New York, 1983, 89-103. Also IBM Res. Rep. RJ3882. Google ScholarGoogle Scholar
  25. 25 Moss, E.B. Nested Transactions: An Approach to Reliable Distributed Computing, MIT Press, Cambridge, Mass., 1985. Google ScholarGoogle Scholar
  26. 26 MOLLER, E. T., MOORE, J. D., AND POPEK, G.J. A nested transaction mechanism for LOCUS. In Proceedings of the 9th A CM Symposium on Operating System Principles (Bretton Woods, N.H., Oct. 1983). ACM, New York, 1983, 71-89. Google ScholarGoogle Scholar
  27. 27 OBERMARCK, R. Distributed deadlock detection algorithm. A CM Trans. Database Syst. 7, 2 (June 1982), 187-208. Google ScholarGoogle Scholar
  28. 28 OKI, B., LISKOV, B., AND SCHEIFLER, R. Reliable object storage to support atomic actions. In Proceedings of the lOth ACM Symposium on Operating Systems Principles (Orcas Island, Wash., Dec. 1985). ACM, New York, 1985, 147-159. Google ScholarGoogle Scholar
  29. 29 POPEK, G., WALKER, B., CHOW, J., EDWARDS, D., KLINE, C., RUDISIN, G., AND THIEL, G. LOCUS: A network transparent high reliability distributed system. In Proceedings of the 8th ACM Symposium on Operating Systems Principles (Pacific Grove, Calif., Dec. 1981). ACM, New York, 1981, 169-177. Google ScholarGoogle Scholar
  30. 30 Pu, C., NOE, J. D., AND PROUDFOOT, A. Regeneration of replicated objects: A technique and its Eden implementation. In Proceedings o/the 2nd International Conference on Data Engineering, (Los Angeles, Feb. 1986). IEEE Press, New York, 1986, 175-187. Google ScholarGoogle Scholar
  31. 31 RASHID, R., AND ROBERTSON, G. Accent: A communication oriented network operating system kernel. In Proceedings of the 8th A CM Symposium on Operating Systems Principles (Pacific Grove, Calif., Dec. 1981). ACM, New York, 1981, 64-75. Google ScholarGoogle Scholar
  32. 32 REED, D., AND SVOBODOVA, L. SWALLOW: A distributed data storage system for a local network. In Networks for Computer Communications, North-Holland, Amsterdam, 1981, 355- 373.Google ScholarGoogle Scholar
  33. 33 SCHWARZ, P.M. Transactions on Typed Objects. Ph.D. Dissertation, Carnegie-Mellon Univ., Pittsburgh, Pa., Dec. 1984. Available as CMU Tech. Rep. CMU-CS-84-166. Google ScholarGoogle Scholar
  34. 34 SPECTOR, A. Z., BUTCHER, J., DANIELS, D. S., DUCHAMP, D. J., EPPINGER, J. L., FINEMAN, C. E., HEDDAYA, A., AND SCHWARZ, P.M. Support for distributed transactions in the TABS prototype. IEEE Trans. Softw. Eng. SE-11, 6 (June 1985), 520-530.Google ScholarGoogle Scholar
  35. 35 SPECTOR, A. Z., DANIELS, D., DUCHAMP, D., EPPINGER, J., AND PAUSCH, R. Distributed transactions for reliable systems. In Proceedings of the l Oth A CM Symposium on Operating Systems Principles (Orcas Island, Wash., Dec. 1985. ACM, New York, 1985, 127-146. Google ScholarGoogle Scholar
  36. 36 SPECTOR, A., ET AL. Camelot: A distributed transaction facility for Mach and the internet--an interim report. Tech. Rep. CMU-CS-87-129, Dept. of Computer Science, Carnegie Mellon Univ., Pittsburgh, Pa., June 1987.Google ScholarGoogle Scholar
  37. 37 STONEBRAKER, M. Operating systems support for database management. Commun. A CM 24, 7 (July 1981), 412-418. Google ScholarGoogle Scholar
  38. 38 WEINSTEIN, M. J., PAGE, T. W., LIVEZEY, B. K., AND POPEK, G.J. Transactions and synchronization in a distributed operating system. In Proceedings of the l Oth A CM Symposium on Operating Systems Principles (Orcas Island, Wash., Dec. 1985). ACM, New York, 1985, 115-126. Google ScholarGoogle Scholar

Index Terms

  1. Recovery management in QuickSilver

                Recommendations

                Reviews

                Jason Gait

                Quicksilver is a network operating system for IBM workstations connected by a token ring. Quicksilver provides system services as user-level processes that maintain client states. Servers are resilient to external failure and can recover resources associated with failed clients. The commit protocol and log recovery primitives are available to applications so servers can tailor recovery techniques to requirements, trading off simplicity and efficiency against recoverability. The authors have adopted a high-overhead transaction mechanism in Quicksilver, but with the policy of using it only when necessary. To this end, servers are divided into four types: those that have volatile internal states and only require signaling capability, such as the window manager; those that manage replicated volatile states and use transaction commit for atomicity, like the name server; those that manage recoverable states and require a full panoply of recovery mechanisms, like the file server; and those that manipulate long-lived states and require log service for checkpointing. Only those that manage recoverable states are truly expensive in Quicksilver. Transaction overhead is further reduced by providing alternative commit protocols to servers, so servers can choose how much to pay for recovery. Interprocess communication (IPC) addresses in Quicksilver are evidently site-dependent (contrary to the author's statement in section 2.1), so IPC is location sensitive. Thus services (except for transaction management) are bound to nodes, migration is expensive, and load balancing (usually a fundamental rationale for a network operating system) is probably impractical. The Quicksilver IPC mechanism is heavily loaded, with responsibility for guaranteeing delivery and message ordering, for enforcing security constraints, and for maintaining transaction connectivity graphs. These overheads slow down processes that do not require the benefits provided and to some extent defeat the author's goal of paying optional overhead for optional services. The paper contains a comprehensive review of possible approaches and a wide-ranging survey of the distributed operating system literature.

                Access critical reviews of Computing literature here

                Become a reviewer for Computing Reviews.

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in

                Full Access

                • Published in

                  cover image ACM Transactions on Computer Systems
                  ACM Transactions on Computer Systems  Volume 6, Issue 1
                  Feb. 1988
                  152 pages
                  ISSN:0734-2071
                  EISSN:1557-7333
                  DOI:10.1145/35037
                  Issue’s Table of Contents

                  Copyright © 1988 ACM

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 1 February 1988
                  Published in tocs Volume 6, Issue 1

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • article

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader