skip to main content
article
Free Access

Probabilistic diagnosis of multiprocessor systems

Authors Info & Claims
Published:01 March 1994Publication History
Skip Abstract Section

Abstract

This paper critically surveys methods for the automated probabilistic diagnosis of large multiprocessor systems. In recent years, much of the work on system-level diagnosis has focused on probabilistic methods, which can diagnose intermittently faulty processing nodes and can be applied in general situations on general interconnection networks. The theory behind the probabilistic diagnosis methods is explained, and the various diagnosis algorithms are described in simple terms with the aid of a running example. The diagnosis methods are compared and analyzed, and a chart is produced, showing the comparative advantages of the various diagnosis algorithms on the basis of several factors important to the probabilistic diagnosis.

References

  1. BARSI, F., GRADONI, F., AND MAESTRINI, P. 1976. A theory of diagnosability of digital systems. IEEE Trans. Comput. C-25, 6 (June), 585 593.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. BERMAN, P., AND PELC, A. 1990. Distributed probabihstic fault diagnosis in multiprocessor systems. Dig. Papers FTCS-20 (June), 340 346.Google ScholarGoogle Scholar
  3. BIANCHINI, R. P., JR., AND BUSKENS, R. W. 1992. Implementation of on-line distributed systemlevel diagnosis theory. IEEE Trans. Comput. 41, 5 (May), 616-626. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. BIANCHINI, R., JR., GOODWlN, K., ~'~D NYDICK, D. S. 1990. Practical application and implementation of distributed system-level diagnosis theory. Dig. Papers FTCS-20 (June), 332-339.Google ScholarGoogle Scholar
  5. BLOUGH, D. M. 1988. Fault detection and diagnosis in multiprocessor systems. Ph.D. dissertation, The Johns Hopkins University, Baltimore, Md. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. BLOUGH, D. M., AND PELC, A. 1993. Diagnosis and repair in multiprocessor systems. IEEE Trans. Comput. 42, 2 (Feb.), 205-217. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. BLOUGH, D. M., AND PELC, A. 1992. Complexity of fault diagnosis in comparison models. IEEE Trans. Comput. 41, 3 (Mar.), 318-324. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. BLOUGH, D. M., SULLIVAN, G. F., AND MASSON, G. M. 1992a. Efficient diagnosis of multiprocessor systems under probabilistic models. IEEE Trans. Comput. 41, 9 (Sept.), 1126-1136. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. BLOUGH, D. M., SULLIVAN, G. F., AND MASSON, G. M. 1992b Intermittent fault diagnosis in multiprocessor systems. IEEE Trans. Comput. 41, 11 (Nov.), 1430 1441 Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. BLOUNT, M. L. 1977. Probablhstic treatment of diagnosis in digital systems Dig. Papers FTCS-7, 72-77.Google ScholarGoogle Scholar
  11. BUSKENS. R. W., ANn BIANCHINI, R. P., JR. 1993. Distributed on-line diagnosis in the presence of arbitrary faults. Dtg. Papers FTCS-21 (June), 470 479.Google ScholarGoogle Scholar
  12. DAHBURA, A. T. 1988. System-level diagnosis: A perspective for the third decade. In Concurrent Computation: Algorithms, Architectures, Technologies Plenum, New York.Google ScholarGoogle Scholar
  13. DAHBURA, A. T., SABNANI, K. K., AND KING, L. L. 1987. The comparison approach to multiproessor fault diagnosis. IEEE Trans. Comput. C-36, 3 (Mar.), 373-378. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. FRmDMAN, A. D. 1975. A new measure of digital system diagnosis. Dig. Papers FTCS-5, 167-170.Google ScholarGoogle Scholar
  15. FRIEDMAN, A. D., AND SIMONCINI, L. 1980. Systemlevel fault diagnosis. Computer 13, 3 (Mar.), 47 53.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. FUSSELL, D., AND RANGARAJAN, S. 1989. Probabilistic diagnosis of multiprocessor systems with arbitrary connectivity. Dig. Papers FTCS-19, 560-565.Google ScholarGoogle Scholar
  17. HAKIMI, S. L., AND AMIN, A. T. 1974. Characterization of connection assignment of diagnosable systems. IEEE Trans. Comput. C-23, I (Jan.), 86 88.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. HAKIMI, S. L., AND NAKAJIMA, K. 1984. On adaptive system diagnosis. IEEE Trans. Comput. C-33, 3 (Mar.), 234-240.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. HORST, R., JEWETT, D., AND LENOSK~, D. 1993. The risk of data corruption in microprocessor-based systems. Dig. Papers FTCS-23 (June), 576-585.Google ScholarGoogle Scholar
  20. HOSSEINi, S. H., KUHL, J. G., AND REDDY, S. M. 1988. On self-fault diagnosis of the distributed systems. IEEE Trans. Comput. 37, 2 (Feb.), 248-251. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. HUANG, S., Xu, J., AND CHEN, T. 1989. Characterization and design of sequentially t-diagnosable systems. Dig. Papers FTCS-19 (June), 554-559.Google ScholarGoogle Scholar
  22. KAVIANPOUR, A., AND FRIEDMAN, A. D. 1978. Efficient design of easily diagnosable systems. In the 3rd USA-JAPAN Computer Conference. 251 257.Google ScholarGoogle Scholar
  23. KAVlANPOUR, A., ANn KIM, K. H. 1991. Diagnosabilities of hypercubes under the pessimistic onestep diagnosis strategy. IEEE Trans. Comput. 40, 2 (Feb.), 232-237. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. KIME, C. 1986. System diagnosis. In Fault- Tolerant Computing Theory and Techniques. Vol. 2. Prentice-Hall, Englewood Cliffs, N.J. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. KUHL, J. G., AND REDDY, S. M. 1980. Distributed fault-tolerance for large multiprocessor systems. In the 7th International Symposium on Computer Architecture. ACM, New York, 23-30. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. LEE, S. 1990. Probabilistic multiprocessor and multicomputer diagnosis. Ph.D. dissertation, The Univ. of Michigan, Ann Arbor, Mich. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. LEE, S., AND SHIN, K. G. 1993. Optimal and efficient probabilistic distributed diagnosis schemes. IEEE Trans. Comput. 42, 7 (July), 882-886. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. LEE, S., AND SHIN, K. G. 1990a. Optimal multiple syndrome probabilistic diagnosis. Dig. Papers FTCS-20 (June), 324 331.Google ScholarGoogle Scholar
  29. LEE, S., AND SHIN, K. G. 1990b. Interleaved all-toall reliable broadcast on meshes and hypercubes. In the 1990 International Conference on Parallel Processing. Vol. III. Pennsylvania State University, University Park, Pa., 110 113.Google ScholarGoogle Scholar
  30. MAHESWARI, S. H., AND HAKIMI, S. L. 1976. On models for diagnosable systems and probabilistic fault diagnosis. IEEE Trans. Comput. C-25, 3 (Mar.), 228 236.Google ScholarGoogle Scholar
  31. MALLELA, S., AND MASSON, G. M. 1980. Diagnosis without repair for hybrid fault situations. IEEE Trans. Comput. C-29, 6 (June), 461-470.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. MALLELA, S., AND MASSON, G. M. 1978. Diagnosable systems for intermittent faults. IEEE Trans. Comput. C-27, 6 (June), 560-566.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. PELC, A. 1993. Efficient distributed diagnosis in the presence of random faults. Dig. Papers FTCS-23 (June), 462 469.Google ScholarGoogle Scholar
  34. PREPARATA, F. P., METZE, G., AND CHIEN, R. T. 1967. On the connection assignment problem of diagnosable systems. IEEE Trans. Elec. Comput. EC-16, 6 (Dec.), 848-854.Google ScholarGoogle ScholarCross RefCross Ref
  35. RAMANATHAN, P., AND SHIN, K. G. 1988. Reliable broadcast in hypercube multicomputers. IEEE Trans. Comput. 37, 12 (Dec.), 1654-1657. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. RANGARAJAN, S., AND FUSSELL, D. 1992. Diagnosing arbitrarily connected parallel computers with high probability. IEEE Trans. Comput. 41, 5 (May), 606-615. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. RUSSELL, J. D., AND KIME, C. R. 1975. System fault diagnosis: Closure and diagnosability with repair. IEEE Trans. Comput. C-24, 11 (Nov.), 1078-1088.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. SCHEINERMAN, E. 1987. Almost sure fault tolerance in random graphs. SIAM J. Comput. 16, 12 (Dec.), 1124 1134. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. SIEWlOREK, D. P., AND SWARZ, R. S. 1982. The Theory and Practice of Reliable System Design. Digital Equipment Corporation, Bedford, Mass.Google ScholarGoogle Scholar
  40. SOMANI, A. K., AND AGRAWAL, V. K. 1992. Distributed diagnosis algorithms for regular interconnected structures. IEEE Trans. Comput. 41, 7 (July), 899-906. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. SOMANI, A. K., AND AGRAWAL, V. K. 1989. Distributed syndrome decoding for regular interconnected structures. Dig. Papers FTCS-19 (June), 70 77.Google ScholarGoogle Scholar
  42. SOMANI, A. K., AGRAWAL, V. K., AND AVIS, D. 1987. A generalized theory for system level diagnosis. IEEE Trans. Comput. C-36, 5 (May), 538-546. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. YANG, C. L., AND MASSON, G. M. 1988. A distributed algorithm for fault diagnosis in systems with soft failures. IEEE Trans. Comput. 37, 11 (Nov.), 1476-1479. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. YANG, C. L., AND MASSON, G. M. 1986. A fault identification algorithm for t~-diagnosable systems. IEEE Trans. Comput. C-35, 6 (June), 503-510. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Probabilistic diagnosis of multiprocessor systems

            Recommendations

            Reviews

            Ismet Gungor

            After presenting an introductory general overview of deterministic diagnosis methods for multiprocessor systems consisting of large numbers of processing elements, this paper provides an analysis of probabilistic diagnosis methods for multiprocessor systems. Methods that define the entire fault set (or a subset) uniquely from a given syndrome are categorized as deterministic, whereas methods that try to diagnose faulty nodes with high probability are categorized as probabilistic. The basic limitation of deterministic methods is that they are not of use in case of intermittent faults, which are known to account for a large percentage of the faults in real systems. The authors highlight the basic difficulties in the application of the probabilistic diagnosis methods, which use probability parameters to model the behavior of faulty and non-faulty nodes, and evaluate the quality of the diagnoses generated. There is no assurance that a probabilistic method will produce a correct and complete diagnosis. The expectation, however, is that with larger testing graphs and more example syndromes, the methods outlined and compared in the paper will perform reasonably well. Although different aspects of the methods are assessed, including computational complexity, the most important aspect, namely diagnostic accuracy, is not assessed. Diagnostic accuracy is of vital importance for any method that is to be used for practical purposes. The subject has been extensively treated theoretically in the literature, as is evident from the long list of references provided. The usefulness of these methods has yet to be demonstrated through the application of the methods on real systems.

            Access critical reviews of Computing literature here

            Become a reviewer for Computing Reviews.

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Computing Surveys
              ACM Computing Surveys  Volume 26, Issue 1
              March 1994
              132 pages
              ISSN:0360-0300
              EISSN:1557-7341
              DOI:10.1145/174666
              Issue’s Table of Contents

              Copyright © 1994 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 1 March 1994
              Published in csur Volume 26, Issue 1

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • article

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader