Abstract
Above all, it is vital to recognize that completely guaranteed behavior is impossible and that there are inherent risks in relying on computer systems in critical environments. The unforeseen consequences are often the most disastrous [Neumann 1986].
Section 1 of this survey reviews the current state of the art of system reliability, safety, and fault tolerance. The emphasis is on the contribution of software to these areas. Section 2 reviews current approaches to software fault tolerance. It discusses why some of the assumptions underlying hardware fault tolerance do not hold for software. It argues that the current software fault tolerance techniques are more accurately thought of as delayed debugging than as fault tolerance. It goes on to show that in providing both backtracking and executable specifications, logic programming offers most of the tools currently used in software fault tolerance. Section 3 presents a generalization of the recovery block approach to software fault tolerance, called resourceful systems. Systems are resourceful if they are able to determine whether they have achieved their goals or, if not, to develop and carry out alternate plans. Section 3 develops an approach to designing resourceful systems based upon a functionally rich architecture and an explicit goal orientation.
- ABBOTT, R. J. 1986. An Integrated Approach to Software Development. John Wiley & Sons, New York. Google Scholar
- ABDEL-GHALY, A. A., CHAN, P. Y., AND LITTLEWOOD, B. 1986. Evaluation of competent software reliability predictions. IEEE Trans. Softw. Eng. SE-12, 9 (Sept.), 950-967. Google Scholar
- AGRE, P. E., AND CHAPMAN, D. 1987. Pengi: An implementation of a theory of activity. In Proceedings o/AAAI 87. American Association for Artificial Intelligence, pp. 268-272.Google Scholar
- ANDERSON, T. 1985. Resilient Computing Systems, vol. I. John Wiley & Sons, New York. Google Scholar
- ANDERSON, T., AND LEE, P. A. 1981. Fault Tolerance Principles and Practice. Prentice/Hall International, London. Google Scholar
- AVlZIENIS, A. 1985a. Fault-tolerant computer systems. Class notes. Univ. of California, Los Angeles.Google Scholar
- AVlZlENIS, A. 1985b. The N-version approach to fault-tolerant software. IEEE Trans. Softw. Eng. SE-11, 12 (Dec.), 1491-1501.Google Scholar
- AVIZlENIS, A., AND KELLY, N. J. 1984. Fault Tolerance by design diversity:Concepts and experiments. IEEE Comput. 17, 8 {Aug.), 67-80.Google Scholar
- AVIZIENIS, A., AND LAPRIE, J. C. 1986. Dependable computing: From concepts to design diversity. Proc. IEEE 74, 5 (May), 629-638.Google Scholar
- BARSTOW, D. R. 1985. Domain-specific automatic programming. IEEE Trans. Softw. Eng. SE-11, 11 (Nov.), 1321-1336. Google Scholar
- BASTANI, F. B., AND YEN, I. L. 1985. Analysis of an inherently fault tolerant program. In Proceedings of COMPSAC 85. IEEE Computer Society, pp. 428-436.Google Scholar
- BERLINER, H. 1980. Computer backgammon. Scientific American (June).Google Scholar
- BISHOP, P. G., EsP, D. G., AND BARNES, M., HUM- PHREYS, P., DAHLL, G., AND LAHTI, J. 1986. PODS: A project on diverse software. IEEE Trans. Softw. Eng. SE-12, 9 (Sept.), 929-940. Google Scholar
- BOLOGNA, S., AND LEVESON, N. G. 1986. Special issue on reliability and safety in real-time process control. IEEE Trans. Softw. Eng. SE-12, 9 (Sept.), 877-996.Google Scholar
- BOWEN, T. P., WIGLE, G. B., AND TSAI, J. T. 1985. Specification of software quality attributes. Tech. Rep. RADC-TR-85-37, Rome Air Development Center.Google Scholar
- BROOKS, F. P., JR. 1987. No silver bullet: Essence and accidents of software engineering. IEEE Comput. 20, 4 (Apr.), 10-19. Google Scholar
- CHA, S. D., KNIGHT, J. C., LEVESON, N. G., AND SHIMEALL, T. J. 1987. An empirical study of software error detection using self-checks. In Digest of Papers FTCS-17:17th Annual Symposium on Fault Tolerant Computing. IEEE Computer Society, (July 1987), pp. 156-161.Google Scholar
- CHAPMAN, D. 1985. Planning for conjunctive goals. Tech. Rep. TR 802, MIT Artificial Intelligence Laboratory. Google Scholar
- DALE, C. J. 1982. Software reliability evaluation methods. Tech. Rep. ST-26750, British Aerospace Dynamics Group.Google Scholar
- DENNING, P. J. 1976. Fault-tolerant operating systems. Comput. Surv. 8, 4 (Dec.), 359-389. Google Scholar
- FLOYD, R. W. 1967. Assigning meaning to programs. In Proceedings of the Symposia on Applied Mathematics (Providence, R.I.), vol. 19, American Mathematics Society, pp. 19-32.Google Scholar
- PREY, P. W. 1977. Chess Skill in Man and Machine. Springer-Verlag, Berlin. Google Scholar
- GEORGEFF, M. P., AND LANSKY, A. L. 1987. Reactive reasoning and planning. In Proceedings of AAAI 87. American Association for Artificial Intelligence, pp. 677-682.Google Scholar
- GILLEY, G. C. 1987. Architectural design methods of transient fault protection. In Proceedings of AIAA Computers in Aerospace VI Conference. AIAA, pp. 78-82.Google Scholar
- GOEL, A. L. 1985. Software reliability models: Assumptions, limitations, and applicability. IEEE Trans. Softw. Eng. S WE-11, 12 (Dec.), 1411-1423.Google Scholar
- GOEL, A. L., AND BASTANI, F. B. 1985. Special issue on software and reliability. IEEE Trans. Softw. Eng. SE-11, 1 (Dec.), 1490-1577.Google Scholar
- GOEL, A. L., AND BASTANI, F. B. 1986. Special issue on software reliability. IEEE Trans. Softw. Eng. SE-12, 12 (Jan.), 1-182. Google Scholar
- RAY, J. 1986. Why do computers stop and what can be done about it? In Proceedings of the 5th Symposium on Reliability and Distributed Software and Database Systems. IEEE Computer Society, pp. 3-12.Google Scholar
- HAMMING, R. W.1950. Error detecting and error correcting codes. Bell Syst. Tech. J. 26, 4 (Apr.), 147-160.Google Scholar
- HECHT, H. 1976. Fault-tolerant software for realtime applications. Comput. Surv. 8, 4 (Dec.), 391- 409. Google Scholar
- HORNING, j. J. ET AL. 1974. A program structure for error detection and recovery, in Lecture Notes in Computer Science 16, E. Gelenbe and C. Kaiser, Eds. Springer-Verlag, Berlin, pp. 171-187. Google Scholar
- JELINSKI, Z., AND MORANDA, P. B. 1972. Software reliability research. In Statistical Computer Performance Evaluation, Freiberger, Ed. Academic Press, New York, pp. 485-502.Google Scholar
- KEILLER, P. A., LITTLEWOOD, B., MILLER, D. R., AND SOFER, A. 1983. Comparison of software reliability predictions. In Digest of the 13th International Symposium on Fault-Tolerant Computing, IEEE Computer Society, pp. 128-143.Google Scholar
- KNIGHT, J. C., AND LEVESON, N. G. 1986. An experimental evaluation of the assumption of independence in multiversion programming. IEEE Trans. Softw. Eng. SE-12, 1 (Jan.), 96-109. Google Scholar
- KOVED, L., AND WALDBAUM, G. 1986. improving availability of software subsystems through online error detection. IBM Syst. J. 25, 1 (Jan.), 96-109.Google Scholar
- LEVESON, N. G. 1986. Software safety: Why, what and how. Computing Surveys 18, 2 (June), 125- 163. Google Scholar
- LINDEN, T. A. 1976. Operating system structures to support security and reliable software. Comput. Surv. 8, 4 {Dec.), 409-445. Google Scholar
- LITTLEWOOD, B., AND VERRLL, J. L. 1973. A Bayesian reliability growth model for computer software. J. Roy. Stat. Soc. C 22 (Sept.), 332-346.Google Scholar
- LUCKHAM, D., AND VON HENKE, E. W. 1984. An overview of Anna, a specification language for Ada. Tech. Rep. 84-265, Stanford Computer Systems Laboratory. Google Scholar
- MERRIAM-WEBSTER. 1987. Webster's Ninth New Collegiate Dictionary. Merriam-Webster, Inc., Springfield, Mass.Google Scholar
- MOSTOW, J. Ed. 1985. Special issue on artificial intelligence and software engineering. IEEE Trans. Softw. Eng. SE-11, 11 (Nov.), 1257-1351. Google Scholar
- MUSA, J. D. 1975. A theory of software reliability and its application. IEEE Trans. Softw. Eng., SE-1, 9 (Sept.), 312-327.Google Scholar
- NEUMANN, P. G. 1986. On hierarchical design of computer systems for critical applications. IEEE Trans. Softw. Eng. SE-12, 9 (Sept.), 905-920. Google Scholar
- NILSSON, N. J. 1980. Principles of Artificial Intelligence. Tioga, Palo Alto. Google Scholar
- PARTSCH, H., AND STEINBRUGGEN, R. 1983. Program transformation systems. A CM Comput. Surv. 15, 3 (Sept.), 199-236. Google Scholar
- RANDELL, B. 1977. System structuring for software fault tolerance. In Current Trends in Programming Methodology, R. T. Yeh, Ed. Prentice-Hall, Englewood Cliffs, N.J., pp. 195-219.Google Scholar
- RENNELS, D. A. 1984. Fault-tolerant computing-- Concepts and examples. IEEE Trans. Comput. C-33, 12 (Dec.), 1116-1129.Google Scholar
- SEVIORA, R. E. 1987. Knowledge-based program debugging systems. IEEE Softw. 4, 3 (May), 20-32.Google Scholar
- SHOOMAN, M. 1973. Operational testing and software reliability during program development. In Proceedings of the IEEE Symposium on Computer Software Reliability (New York), IEEE Computer Society, pp. 51-57.Google Scholar
- STERLING, L., AND SHAPIRO, E. 1986. The Art of Prolog. MIT Press, Cambridge, Mass.Google Scholar
- STOTT, C. B. 1987. Review of resilient computing systems: vol. I. IEEE Computer 20, 6 (June), 117- 118.Google Scholar
- STROM, R. E., AND YEMINI, S. 1986. Typestate: A programming language concept for enhancing software reliability. IEEE Trans. Softw. Eng. SE- 12, 1 (Jan.), 157-171. Google Scholar
- TAYLOR, J. R. 1982. An integrated approach to the treatment of design and specification errors in electronic systems and software. In Reliability in Electrical and Electronic Components and Systems, E. Lauger and J. Moltoft, Eds., North- Holland, Amsterdam.Google Scholar
- TAYLOR, D. J., AND BLACK, J. P. 1982. Principles of data structure error correction. IEEE Trans. Comput. C-31, 7 (July), 602-608.Google Scholar
- TAYLOR, D. J., AND SEGER, C.-J. H. 1986. Robust storage structures for crash recovery. IEEE Trans. Comput. C-35, 4 (Apr.), 288-295. Google Scholar
- TAYLOR, D. J., MORGAN, D. E., AND BLACK, J. P. 1980. Redundancy in data structures: Improving software fault tolerance. IEEE Trans. Softw. Eng. SE-6, 1 (Nov.), 585-594.Google Scholar
Recommendations
Fault Tolerance in Multiprocessor Systems Without Dedicated Redundancy
An algorithm called RAFT (recursive algorithm for fault tolerance) for achieving fault tolerance in multiprocessor systems is described. Through the use of a combination of dynamic space- and time- redundancy techniques, RAFT achieves fault tolerance in ...
Application-Aware Byzantine Fault Tolerance
DASC '14: Proceedings of the 2014 IEEE 12th International Conference on Dependable, Autonomic and Secure ComputingByzantine fault tolerance has been intensively studied over the past decade as a way to enhance the intrusion resilience of computer systems. However, state-machine-based Byzantine fault tolerance algorithms require deterministic application processing ...
Comments