skip to main content
article
Free Access

Resourceful systems for fault tolerance, reliability, and safety

Published:01 March 1990Publication History
Skip Abstract Section

Abstract

Above all, it is vital to recognize that completely guaranteed behavior is impossible and that there are inherent risks in relying on computer systems in critical environments. The unforeseen consequences are often the most disastrous [Neumann 1986].

Section 1 of this survey reviews the current state of the art of system reliability, safety, and fault tolerance. The emphasis is on the contribution of software to these areas. Section 2 reviews current approaches to software fault tolerance. It discusses why some of the assumptions underlying hardware fault tolerance do not hold for software. It argues that the current software fault tolerance techniques are more accurately thought of as delayed debugging than as fault tolerance. It goes on to show that in providing both backtracking and executable specifications, logic programming offers most of the tools currently used in software fault tolerance. Section 3 presents a generalization of the recovery block approach to software fault tolerance, called resourceful systems. Systems are resourceful if they are able to determine whether they have achieved their goals or, if not, to develop and carry out alternate plans. Section 3 develops an approach to designing resourceful systems based upon a functionally rich architecture and an explicit goal orientation.

References

  1. ABBOTT, R. J. 1986. An Integrated Approach to Software Development. John Wiley & Sons, New York. Google ScholarGoogle Scholar
  2. ABDEL-GHALY, A. A., CHAN, P. Y., AND LITTLEWOOD, B. 1986. Evaluation of competent software reliability predictions. IEEE Trans. Softw. Eng. SE-12, 9 (Sept.), 950-967. Google ScholarGoogle Scholar
  3. AGRE, P. E., AND CHAPMAN, D. 1987. Pengi: An implementation of a theory of activity. In Proceedings o/AAAI 87. American Association for Artificial Intelligence, pp. 268-272.Google ScholarGoogle Scholar
  4. ANDERSON, T. 1985. Resilient Computing Systems, vol. I. John Wiley & Sons, New York. Google ScholarGoogle Scholar
  5. ANDERSON, T., AND LEE, P. A. 1981. Fault Tolerance Principles and Practice. Prentice/Hall International, London. Google ScholarGoogle Scholar
  6. AVlZIENIS, A. 1985a. Fault-tolerant computer systems. Class notes. Univ. of California, Los Angeles.Google ScholarGoogle Scholar
  7. AVlZlENIS, A. 1985b. The N-version approach to fault-tolerant software. IEEE Trans. Softw. Eng. SE-11, 12 (Dec.), 1491-1501.Google ScholarGoogle Scholar
  8. AVIZlENIS, A., AND KELLY, N. J. 1984. Fault Tolerance by design diversity:Concepts and experiments. IEEE Comput. 17, 8 {Aug.), 67-80.Google ScholarGoogle Scholar
  9. AVIZIENIS, A., AND LAPRIE, J. C. 1986. Dependable computing: From concepts to design diversity. Proc. IEEE 74, 5 (May), 629-638.Google ScholarGoogle Scholar
  10. BARSTOW, D. R. 1985. Domain-specific automatic programming. IEEE Trans. Softw. Eng. SE-11, 11 (Nov.), 1321-1336. Google ScholarGoogle Scholar
  11. BASTANI, F. B., AND YEN, I. L. 1985. Analysis of an inherently fault tolerant program. In Proceedings of COMPSAC 85. IEEE Computer Society, pp. 428-436.Google ScholarGoogle Scholar
  12. BERLINER, H. 1980. Computer backgammon. Scientific American (June).Google ScholarGoogle Scholar
  13. BISHOP, P. G., EsP, D. G., AND BARNES, M., HUM- PHREYS, P., DAHLL, G., AND LAHTI, J. 1986. PODS: A project on diverse software. IEEE Trans. Softw. Eng. SE-12, 9 (Sept.), 929-940. Google ScholarGoogle Scholar
  14. BOLOGNA, S., AND LEVESON, N. G. 1986. Special issue on reliability and safety in real-time process control. IEEE Trans. Softw. Eng. SE-12, 9 (Sept.), 877-996.Google ScholarGoogle Scholar
  15. BOWEN, T. P., WIGLE, G. B., AND TSAI, J. T. 1985. Specification of software quality attributes. Tech. Rep. RADC-TR-85-37, Rome Air Development Center.Google ScholarGoogle Scholar
  16. BROOKS, F. P., JR. 1987. No silver bullet: Essence and accidents of software engineering. IEEE Comput. 20, 4 (Apr.), 10-19. Google ScholarGoogle Scholar
  17. CHA, S. D., KNIGHT, J. C., LEVESON, N. G., AND SHIMEALL, T. J. 1987. An empirical study of software error detection using self-checks. In Digest of Papers FTCS-17:17th Annual Symposium on Fault Tolerant Computing. IEEE Computer Society, (July 1987), pp. 156-161.Google ScholarGoogle Scholar
  18. CHAPMAN, D. 1985. Planning for conjunctive goals. Tech. Rep. TR 802, MIT Artificial Intelligence Laboratory. Google ScholarGoogle Scholar
  19. DALE, C. J. 1982. Software reliability evaluation methods. Tech. Rep. ST-26750, British Aerospace Dynamics Group.Google ScholarGoogle Scholar
  20. DENNING, P. J. 1976. Fault-tolerant operating systems. Comput. Surv. 8, 4 (Dec.), 359-389. Google ScholarGoogle Scholar
  21. FLOYD, R. W. 1967. Assigning meaning to programs. In Proceedings of the Symposia on Applied Mathematics (Providence, R.I.), vol. 19, American Mathematics Society, pp. 19-32.Google ScholarGoogle Scholar
  22. PREY, P. W. 1977. Chess Skill in Man and Machine. Springer-Verlag, Berlin. Google ScholarGoogle Scholar
  23. GEORGEFF, M. P., AND LANSKY, A. L. 1987. Reactive reasoning and planning. In Proceedings of AAAI 87. American Association for Artificial Intelligence, pp. 677-682.Google ScholarGoogle Scholar
  24. GILLEY, G. C. 1987. Architectural design methods of transient fault protection. In Proceedings of AIAA Computers in Aerospace VI Conference. AIAA, pp. 78-82.Google ScholarGoogle Scholar
  25. GOEL, A. L. 1985. Software reliability models: Assumptions, limitations, and applicability. IEEE Trans. Softw. Eng. S WE-11, 12 (Dec.), 1411-1423.Google ScholarGoogle Scholar
  26. GOEL, A. L., AND BASTANI, F. B. 1985. Special issue on software and reliability. IEEE Trans. Softw. Eng. SE-11, 1 (Dec.), 1490-1577.Google ScholarGoogle Scholar
  27. GOEL, A. L., AND BASTANI, F. B. 1986. Special issue on software reliability. IEEE Trans. Softw. Eng. SE-12, 12 (Jan.), 1-182. Google ScholarGoogle Scholar
  28. RAY, J. 1986. Why do computers stop and what can be done about it? In Proceedings of the 5th Symposium on Reliability and Distributed Software and Database Systems. IEEE Computer Society, pp. 3-12.Google ScholarGoogle Scholar
  29. HAMMING, R. W.1950. Error detecting and error correcting codes. Bell Syst. Tech. J. 26, 4 (Apr.), 147-160.Google ScholarGoogle Scholar
  30. HECHT, H. 1976. Fault-tolerant software for realtime applications. Comput. Surv. 8, 4 (Dec.), 391- 409. Google ScholarGoogle Scholar
  31. HORNING, j. J. ET AL. 1974. A program structure for error detection and recovery, in Lecture Notes in Computer Science 16, E. Gelenbe and C. Kaiser, Eds. Springer-Verlag, Berlin, pp. 171-187. Google ScholarGoogle Scholar
  32. JELINSKI, Z., AND MORANDA, P. B. 1972. Software reliability research. In Statistical Computer Performance Evaluation, Freiberger, Ed. Academic Press, New York, pp. 485-502.Google ScholarGoogle Scholar
  33. KEILLER, P. A., LITTLEWOOD, B., MILLER, D. R., AND SOFER, A. 1983. Comparison of software reliability predictions. In Digest of the 13th International Symposium on Fault-Tolerant Computing, IEEE Computer Society, pp. 128-143.Google ScholarGoogle Scholar
  34. KNIGHT, J. C., AND LEVESON, N. G. 1986. An experimental evaluation of the assumption of independence in multiversion programming. IEEE Trans. Softw. Eng. SE-12, 1 (Jan.), 96-109. Google ScholarGoogle Scholar
  35. KOVED, L., AND WALDBAUM, G. 1986. improving availability of software subsystems through online error detection. IBM Syst. J. 25, 1 (Jan.), 96-109.Google ScholarGoogle Scholar
  36. LEVESON, N. G. 1986. Software safety: Why, what and how. Computing Surveys 18, 2 (June), 125- 163. Google ScholarGoogle Scholar
  37. LINDEN, T. A. 1976. Operating system structures to support security and reliable software. Comput. Surv. 8, 4 {Dec.), 409-445. Google ScholarGoogle Scholar
  38. LITTLEWOOD, B., AND VERRLL, J. L. 1973. A Bayesian reliability growth model for computer software. J. Roy. Stat. Soc. C 22 (Sept.), 332-346.Google ScholarGoogle Scholar
  39. LUCKHAM, D., AND VON HENKE, E. W. 1984. An overview of Anna, a specification language for Ada. Tech. Rep. 84-265, Stanford Computer Systems Laboratory. Google ScholarGoogle Scholar
  40. MERRIAM-WEBSTER. 1987. Webster's Ninth New Collegiate Dictionary. Merriam-Webster, Inc., Springfield, Mass.Google ScholarGoogle Scholar
  41. MOSTOW, J. Ed. 1985. Special issue on artificial intelligence and software engineering. IEEE Trans. Softw. Eng. SE-11, 11 (Nov.), 1257-1351. Google ScholarGoogle Scholar
  42. MUSA, J. D. 1975. A theory of software reliability and its application. IEEE Trans. Softw. Eng., SE-1, 9 (Sept.), 312-327.Google ScholarGoogle Scholar
  43. NEUMANN, P. G. 1986. On hierarchical design of computer systems for critical applications. IEEE Trans. Softw. Eng. SE-12, 9 (Sept.), 905-920. Google ScholarGoogle Scholar
  44. NILSSON, N. J. 1980. Principles of Artificial Intelligence. Tioga, Palo Alto. Google ScholarGoogle Scholar
  45. PARTSCH, H., AND STEINBRUGGEN, R. 1983. Program transformation systems. A CM Comput. Surv. 15, 3 (Sept.), 199-236. Google ScholarGoogle Scholar
  46. RANDELL, B. 1977. System structuring for software fault tolerance. In Current Trends in Programming Methodology, R. T. Yeh, Ed. Prentice-Hall, Englewood Cliffs, N.J., pp. 195-219.Google ScholarGoogle Scholar
  47. RENNELS, D. A. 1984. Fault-tolerant computing-- Concepts and examples. IEEE Trans. Comput. C-33, 12 (Dec.), 1116-1129.Google ScholarGoogle Scholar
  48. SEVIORA, R. E. 1987. Knowledge-based program debugging systems. IEEE Softw. 4, 3 (May), 20-32.Google ScholarGoogle Scholar
  49. SHOOMAN, M. 1973. Operational testing and software reliability during program development. In Proceedings of the IEEE Symposium on Computer Software Reliability (New York), IEEE Computer Society, pp. 51-57.Google ScholarGoogle Scholar
  50. STERLING, L., AND SHAPIRO, E. 1986. The Art of Prolog. MIT Press, Cambridge, Mass.Google ScholarGoogle Scholar
  51. STOTT, C. B. 1987. Review of resilient computing systems: vol. I. IEEE Computer 20, 6 (June), 117- 118.Google ScholarGoogle Scholar
  52. STROM, R. E., AND YEMINI, S. 1986. Typestate: A programming language concept for enhancing software reliability. IEEE Trans. Softw. Eng. SE- 12, 1 (Jan.), 157-171. Google ScholarGoogle Scholar
  53. TAYLOR, J. R. 1982. An integrated approach to the treatment of design and specification errors in electronic systems and software. In Reliability in Electrical and Electronic Components and Systems, E. Lauger and J. Moltoft, Eds., North- Holland, Amsterdam.Google ScholarGoogle Scholar
  54. TAYLOR, D. J., AND BLACK, J. P. 1982. Principles of data structure error correction. IEEE Trans. Comput. C-31, 7 (July), 602-608.Google ScholarGoogle Scholar
  55. TAYLOR, D. J., AND SEGER, C.-J. H. 1986. Robust storage structures for crash recovery. IEEE Trans. Comput. C-35, 4 (Apr.), 288-295. Google ScholarGoogle Scholar
  56. TAYLOR, D. J., MORGAN, D. E., AND BLACK, J. P. 1980. Redundancy in data structures: Improving software fault tolerance. IEEE Trans. Softw. Eng. SE-6, 1 (Nov.), 585-594.Google ScholarGoogle Scholar

Recommendations

Reviews

Bruce M. McMillin

Abbott surveys various classical software techniques for ensuring system fault tolerance. He also presents the less widely known concept of “resourceful systems,” which adapt to changes in environment and preserve their functionality. The survey is mostly complete, but sketchy in places. Some details may have been omitted because of space limitations. The author does omit one important aspect of software fault tolerance—application-oriented techniques for parallel computing systems and systolic arrays. Fault tolerance for multiple-computation processors is a growing area of study. Abbott sometimes confuses program proofs and verification, which could confuse the neophyte. Also, the analogy between logic programming languages and the concept of Recovery Blocks stretches a point. All in all, the paper is a good introduction for a class or seminar on the subject.

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Sign in

Full Access

  • Published in

    cover image ACM Computing Surveys
    ACM Computing Surveys  Volume 22, Issue 1
    March 1990
    63 pages
    ISSN:0360-0300
    EISSN:1557-7341
    DOI:10.1145/78949
    Issue’s Table of Contents

    Copyright © 1990 ACM

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    • Published: 1 March 1990
    Published in csur Volume 22, Issue 1

    Permissions

    Request permissions about this article.

    Request Permissions

    Check for updates

    Qualifiers

    • article

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader