Abstract
This paper critically surveys methods for the automated probabilistic diagnosis of large multiprocessor systems. In recent years, much of the work on system-level diagnosis has focused on probabilistic methods, which can diagnose intermittently faulty processing nodes and can be applied in general situations on general interconnection networks. The theory behind the probabilistic diagnosis methods is explained, and the various diagnosis algorithms are described in simple terms with the aid of a running example. The diagnosis methods are compared and analyzed, and a chart is produced, showing the comparative advantages of the various diagnosis algorithms on the basis of several factors important to the probabilistic diagnosis.
- BARSI, F., GRADONI, F., AND MAESTRINI, P. 1976. A theory of diagnosability of digital systems. IEEE Trans. Comput. C-25, 6 (June), 585 593.Google ScholarDigital Library
- BERMAN, P., AND PELC, A. 1990. Distributed probabihstic fault diagnosis in multiprocessor systems. Dig. Papers FTCS-20 (June), 340 346.Google Scholar
- BIANCHINI, R. P., JR., AND BUSKENS, R. W. 1992. Implementation of on-line distributed systemlevel diagnosis theory. IEEE Trans. Comput. 41, 5 (May), 616-626. Google ScholarDigital Library
- BIANCHINI, R., JR., GOODWlN, K., ~'~D NYDICK, D. S. 1990. Practical application and implementation of distributed system-level diagnosis theory. Dig. Papers FTCS-20 (June), 332-339.Google Scholar
- BLOUGH, D. M. 1988. Fault detection and diagnosis in multiprocessor systems. Ph.D. dissertation, The Johns Hopkins University, Baltimore, Md. Google ScholarDigital Library
- BLOUGH, D. M., AND PELC, A. 1993. Diagnosis and repair in multiprocessor systems. IEEE Trans. Comput. 42, 2 (Feb.), 205-217. Google ScholarDigital Library
- BLOUGH, D. M., AND PELC, A. 1992. Complexity of fault diagnosis in comparison models. IEEE Trans. Comput. 41, 3 (Mar.), 318-324. Google ScholarDigital Library
- BLOUGH, D. M., SULLIVAN, G. F., AND MASSON, G. M. 1992a. Efficient diagnosis of multiprocessor systems under probabilistic models. IEEE Trans. Comput. 41, 9 (Sept.), 1126-1136. Google ScholarDigital Library
- BLOUGH, D. M., SULLIVAN, G. F., AND MASSON, G. M. 1992b Intermittent fault diagnosis in multiprocessor systems. IEEE Trans. Comput. 41, 11 (Nov.), 1430 1441 Google ScholarDigital Library
- BLOUNT, M. L. 1977. Probablhstic treatment of diagnosis in digital systems Dig. Papers FTCS-7, 72-77.Google Scholar
- BUSKENS. R. W., ANn BIANCHINI, R. P., JR. 1993. Distributed on-line diagnosis in the presence of arbitrary faults. Dtg. Papers FTCS-21 (June), 470 479.Google Scholar
- DAHBURA, A. T. 1988. System-level diagnosis: A perspective for the third decade. In Concurrent Computation: Algorithms, Architectures, Technologies Plenum, New York.Google Scholar
- DAHBURA, A. T., SABNANI, K. K., AND KING, L. L. 1987. The comparison approach to multiproessor fault diagnosis. IEEE Trans. Comput. C-36, 3 (Mar.), 373-378. Google ScholarDigital Library
- FRmDMAN, A. D. 1975. A new measure of digital system diagnosis. Dig. Papers FTCS-5, 167-170.Google Scholar
- FRIEDMAN, A. D., AND SIMONCINI, L. 1980. Systemlevel fault diagnosis. Computer 13, 3 (Mar.), 47 53.Google ScholarDigital Library
- FUSSELL, D., AND RANGARAJAN, S. 1989. Probabilistic diagnosis of multiprocessor systems with arbitrary connectivity. Dig. Papers FTCS-19, 560-565.Google Scholar
- HAKIMI, S. L., AND AMIN, A. T. 1974. Characterization of connection assignment of diagnosable systems. IEEE Trans. Comput. C-23, I (Jan.), 86 88.Google ScholarDigital Library
- HAKIMI, S. L., AND NAKAJIMA, K. 1984. On adaptive system diagnosis. IEEE Trans. Comput. C-33, 3 (Mar.), 234-240.Google ScholarDigital Library
- HORST, R., JEWETT, D., AND LENOSK~, D. 1993. The risk of data corruption in microprocessor-based systems. Dig. Papers FTCS-23 (June), 576-585.Google Scholar
- HOSSEINi, S. H., KUHL, J. G., AND REDDY, S. M. 1988. On self-fault diagnosis of the distributed systems. IEEE Trans. Comput. 37, 2 (Feb.), 248-251. Google ScholarDigital Library
- HUANG, S., Xu, J., AND CHEN, T. 1989. Characterization and design of sequentially t-diagnosable systems. Dig. Papers FTCS-19 (June), 554-559.Google Scholar
- KAVIANPOUR, A., AND FRIEDMAN, A. D. 1978. Efficient design of easily diagnosable systems. In the 3rd USA-JAPAN Computer Conference. 251 257.Google Scholar
- KAVlANPOUR, A., ANn KIM, K. H. 1991. Diagnosabilities of hypercubes under the pessimistic onestep diagnosis strategy. IEEE Trans. Comput. 40, 2 (Feb.), 232-237. Google ScholarDigital Library
- KIME, C. 1986. System diagnosis. In Fault- Tolerant Computing Theory and Techniques. Vol. 2. Prentice-Hall, Englewood Cliffs, N.J. Google ScholarDigital Library
- KUHL, J. G., AND REDDY, S. M. 1980. Distributed fault-tolerance for large multiprocessor systems. In the 7th International Symposium on Computer Architecture. ACM, New York, 23-30. Google ScholarDigital Library
- LEE, S. 1990. Probabilistic multiprocessor and multicomputer diagnosis. Ph.D. dissertation, The Univ. of Michigan, Ann Arbor, Mich. Google ScholarDigital Library
- LEE, S., AND SHIN, K. G. 1993. Optimal and efficient probabilistic distributed diagnosis schemes. IEEE Trans. Comput. 42, 7 (July), 882-886. Google ScholarDigital Library
- LEE, S., AND SHIN, K. G. 1990a. Optimal multiple syndrome probabilistic diagnosis. Dig. Papers FTCS-20 (June), 324 331.Google Scholar
- LEE, S., AND SHIN, K. G. 1990b. Interleaved all-toall reliable broadcast on meshes and hypercubes. In the 1990 International Conference on Parallel Processing. Vol. III. Pennsylvania State University, University Park, Pa., 110 113.Google Scholar
- MAHESWARI, S. H., AND HAKIMI, S. L. 1976. On models for diagnosable systems and probabilistic fault diagnosis. IEEE Trans. Comput. C-25, 3 (Mar.), 228 236.Google Scholar
- MALLELA, S., AND MASSON, G. M. 1980. Diagnosis without repair for hybrid fault situations. IEEE Trans. Comput. C-29, 6 (June), 461-470.Google ScholarDigital Library
- MALLELA, S., AND MASSON, G. M. 1978. Diagnosable systems for intermittent faults. IEEE Trans. Comput. C-27, 6 (June), 560-566.Google ScholarDigital Library
- PELC, A. 1993. Efficient distributed diagnosis in the presence of random faults. Dig. Papers FTCS-23 (June), 462 469.Google Scholar
- PREPARATA, F. P., METZE, G., AND CHIEN, R. T. 1967. On the connection assignment problem of diagnosable systems. IEEE Trans. Elec. Comput. EC-16, 6 (Dec.), 848-854.Google ScholarCross Ref
- RAMANATHAN, P., AND SHIN, K. G. 1988. Reliable broadcast in hypercube multicomputers. IEEE Trans. Comput. 37, 12 (Dec.), 1654-1657. Google ScholarDigital Library
- RANGARAJAN, S., AND FUSSELL, D. 1992. Diagnosing arbitrarily connected parallel computers with high probability. IEEE Trans. Comput. 41, 5 (May), 606-615. Google ScholarDigital Library
- RUSSELL, J. D., AND KIME, C. R. 1975. System fault diagnosis: Closure and diagnosability with repair. IEEE Trans. Comput. C-24, 11 (Nov.), 1078-1088.Google ScholarDigital Library
- SCHEINERMAN, E. 1987. Almost sure fault tolerance in random graphs. SIAM J. Comput. 16, 12 (Dec.), 1124 1134. Google ScholarDigital Library
- SIEWlOREK, D. P., AND SWARZ, R. S. 1982. The Theory and Practice of Reliable System Design. Digital Equipment Corporation, Bedford, Mass.Google Scholar
- SOMANI, A. K., AND AGRAWAL, V. K. 1992. Distributed diagnosis algorithms for regular interconnected structures. IEEE Trans. Comput. 41, 7 (July), 899-906. Google ScholarDigital Library
- SOMANI, A. K., AND AGRAWAL, V. K. 1989. Distributed syndrome decoding for regular interconnected structures. Dig. Papers FTCS-19 (June), 70 77.Google Scholar
- SOMANI, A. K., AGRAWAL, V. K., AND AVIS, D. 1987. A generalized theory for system level diagnosis. IEEE Trans. Comput. C-36, 5 (May), 538-546. Google ScholarDigital Library
- YANG, C. L., AND MASSON, G. M. 1988. A distributed algorithm for fault diagnosis in systems with soft failures. IEEE Trans. Comput. 37, 11 (Nov.), 1476-1479. Google ScholarDigital Library
- YANG, C. L., AND MASSON, G. M. 1986. A fault identification algorithm for t~-diagnosable systems. IEEE Trans. Comput. C-35, 6 (June), 503-510. Google ScholarDigital Library
Index Terms
- Probabilistic diagnosis of multiprocessor systems
Recommendations
On Probabilistic Diagnosis of Multiprocessor Systems Using Multiple Syndromes
This paper addresses the distributed self-diagnosis of a multiprocessor/multicomputersystem based on fault syndromes formed by comparison testing. The authors show thatby using multiple fault syndromes, it is possible to achieve significantly better ...
A Quick Pessimistic Diagnosis Algorithm for Hypercube-Like Multiprocessor Systems under the PMC Model
Processor fault diagnosis is an essential subject for the reliability of a multiprocessor system. The precise strategy and the pessimistic strategy are two classical diagnostic strategies which are based on the well-known PMC model. The precise strategy ...
Comparison Diagnosis in Large Multiprocessor Systems
ATS '96: Proceedings of the 5th Asian Test SymposiumWe present the bounded symmetric comparison (BSC) model for comparison-based system-level diagnosis. It is based on the symmetric comparison model of Chwa and Hakimi but includes a limit on the number of PEs that can produce identical, faulty results ...
Comments