Skip to main content

2011 | OriginalPaper | Buchkapitel

9. Software-Level Soft-Error Mitigation Techniques

verfasst von : Maurizio Rebaudengo, Matteo Sonza Reorda, Massimo Violante

Erschienen in: Soft Errors in Modern Electronic Systems

Verlag: Springer US

Aktivieren Sie unsere intelligente Suche, um passende Fachinhalte oder Patente zu finden.

search-config
loading …

Abstract

Several application domains exist, where the effects of Soft Errors on processor-based systems cannot be faced by acting on the hardware (either by changing the technology, or the components, or the architecture, or whatever else). In these cases, an attractive solution lies in just modifying the software: the ability to detect and possibly correct errors is obtained by introducing redundancy in the code and in the data, without modifying the underlying hardware. This chapter provides an overview of the methods resorting to this technique, outlining their characteristics and summarizing their advantages and limitations.

Sie haben noch keine Lizenz? Dann Informieren Sie sich jetzt über unsere Produkte:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Fußnoten
1
The term alternate reflects sequential execution, which is a feature specific to the recovery block approach.
 
2
Task duplication [40] was introduced to detect transient faults, based on duplicating the computation of a task on two processors. If the results of the two executions do not match, the task is executed again in another processor until a pair of processors produces identical results. This scheme does not use checkpoints, and every time a fault is detected, the task has to be started from its beginning.
 
Literatur
1.
Zurück zum Zitat M. Rebaudengo, M. Sonza Reorda, M. Torchiano, M. Violante, Soft-error detection through software fault-tolerance techniques. Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, 1999, pp. 210–218 M. Rebaudengo, M. Sonza Reorda, M. Torchiano, M. Violante, Soft-error detection through software fault-tolerance techniques. Proceedings of the IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems, 1999, pp. 210–218
2.
Zurück zum Zitat M. Rebaudengo, M. Sonza Reorda, M. Torchiano, M. Violante, A source-to-source compiler for generating dependable software. Proceedings of the IEEE International Workshop on Source Code Analysis and Manipulation, 2001, pp. 33–42 M. Rebaudengo, M. Sonza Reorda, M. Torchiano, M. Violante, A source-to-source compiler for generating dependable software. Proceedings of the IEEE International Workshop on Source Code Analysis and Manipulation, 2001, pp. 33–42
3.
Zurück zum Zitat P. Cheynet, B. Nicolescu, R. Velazco, M. Rebaudengo, M. Sonza Reorda, M. Violante, Experimentally evaluating an automatic approach for generating safety-critical software with respect to transient errors. IEEE Transactions on Nuclear Science 47(6), 2000, 2231–2236CrossRef P. Cheynet, B. Nicolescu, R. Velazco, M. Rebaudengo, M. Sonza Reorda, M. Violante, Experimentally evaluating an automatic approach for generating safety-critical software with respect to transient errors. IEEE Transactions on Nuclear Science 47(6), 2000, 2231–2236CrossRef
4.
Zurück zum Zitat A. Benso, S. Chiusano, P. Prinetto, L. Tagliaferri, A C/C++ source-to-source compiler for dependable applications. Proceedings of the IEEE International Conference on Dependable Systems and Networks, 2000, pp. 71–78 A. Benso, S. Chiusano, P. Prinetto, L. Tagliaferri, A C/C++ source-to-source compiler for dependable applications. Proceedings of the IEEE International Conference on Dependable Systems and Networks, 2000, pp. 71–78
5.
Zurück zum Zitat N. Oh, P.P. Shirvani, E.J. McCluskey, Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability 51(1), 2002, 63–75CrossRef N. Oh, P.P. Shirvani, E.J. McCluskey, Error detection by duplicated instructions in super-scalar processors. IEEE Transactions on Reliability 51(1), 2002, 63–75CrossRef
6.
Zurück zum Zitat G. Sohi, M. Franklin, K. Saluja, A study of time-redundant fault tolerance techniques for high-performance pipelined computers. 19th International Symposium on Fault Tolerant Computing, 1989, pp. 463–443 G. Sohi, M. Franklin, K. Saluja, A study of time-redundant fault tolerance techniques for high-performance pipelined computers. 19th International Symposium on Fault Tolerant Computing, 1989, pp. 463–443
7.
Zurück zum Zitat C. Bolchini, A software methodology for detecting hardware faults in VLIW data paths. IEEE Transactions on Reliability 52(4), 2003, 458–468CrossRef C. Bolchini, A software methodology for detecting hardware faults in VLIW data paths. IEEE Transactions on Reliability 52(4), 2003, 458–468CrossRef
8.
Zurück zum Zitat N. Oh, E.J. McCluskey, Error detection by selective procedure call duplication for low energy consumption. IEEE Transactions on Reliability 51(4), 2002, 392–402CrossRef N. Oh, E.J. McCluskey, Error detection by selective procedure call duplication for low energy consumption. IEEE Transactions on Reliability 51(4), 2002, 392–402CrossRef
9.
Zurück zum Zitat K. Echtle, B. Hinz, T. Nikolov, On hardware fault detection by diverse software. Proceedings of the 13th International Conference on Fault-Tolerant Systems and Diagnostics, 1990, pp. 362–367 K. Echtle, B. Hinz, T. Nikolov, On hardware fault detection by diverse software. Proceedings of the 13th International Conference on Fault-Tolerant Systems and Diagnostics, 1990, pp. 362–367
10.
Zurück zum Zitat H. Engel, Data flow transformations to detect results which are corrupted by hardware faults. Proceedings of the IEEE High-Assurance System Engineering Workshop, 1997, pp. 279–285 H. Engel, Data flow transformations to detect results which are corrupted by hardware faults. Proceedings of the IEEE High-Assurance System Engineering Workshop, 1997, pp. 279–285
11.
Zurück zum Zitat M. Jochim, Detecting processor hardware faults by means of automatically generated virtual duplex systems. Proceedings of the International Conference on Dependable Systems and Networks, 2002, pp. 399–408 M. Jochim, Detecting processor hardware faults by means of automatically generated virtual duplex systems. Proceedings of the International Conference on Dependable Systems and Networks, 2002, pp. 399–408
12.
Zurück zum Zitat S.K. Reinhardt, S.S. Mukherjee, Transient fault detection via simultaneous multithreading. Proceedings of the 27th International Symposium on Computer Architecture, 2000, pp. 25–36 S.K. Reinhardt, S.S. Mukherjee, Transient fault detection via simultaneous multithreading. Proceedings of the 27th International Symposium on Computer Architecture, 2000, pp. 25–36
13.
Zurück zum Zitat E. Rotenberg, AR-SMT: a microarchitectural approach to fault tolerance in microprocessors. 29th International Symposium on Fault-Tolerant Computing, 1999, pp. 84–91 E. Rotenberg, AR-SMT: a microarchitectural approach to fault tolerance in microprocessors. 29th International Symposium on Fault-Tolerant Computing, 1999, pp. 84–91
14.
Zurück zum Zitat N. Oh, S. Mitra, E.J. McCluskey, ED4I: error detection by diverse data and duplicated instructions. IEEE Transactions on Computers 51(2), 2002, 180–199CrossRef N. Oh, S. Mitra, E.J. McCluskey, ED4I: error detection by diverse data and duplicated instructions. IEEE Transactions on Computers 51(2), 2002, 180–199CrossRef
15.
Zurück zum Zitat M. Hiller, Executable assertions for detecting data errors in embedded control systems. Proceedings of the IEEE International Conference on Dependable Systems and Networks, 2000, pp. 24–33 M. Hiller, Executable assertions for detecting data errors in embedded control systems. Proceedings of the IEEE International Conference on Dependable Systems and Networks, 2000, pp. 24–33
16.
Zurück zum Zitat J. Vinter, J. Aidemark, P. Folkesson, J. Karlsson, Reducing critical failures for control algorithms using executable assertions and best effort recovery. Proceedings of the IEEE International Conference on Dependable Systems and Networks, 2001, pp. 347–356 J. Vinter, J. Aidemark, P. Folkesson, J. Karlsson, Reducing critical failures for control algorithms using executable assertions and best effort recovery. Proceedings of the IEEE International Conference on Dependable Systems and Networks, 2001, pp. 347–356
17.
Zurück zum Zitat S.S. Yau, F.-C. Chen, An approach to concurrent control flow checking. IEEE Transactions on Software Engineering 6(2), 1980, 126–137MathSciNetMATHCrossRef S.S. Yau, F.-C. Chen, An approach to concurrent control flow checking. IEEE Transactions on Software Engineering 6(2), 1980, 126–137MathSciNetMATHCrossRef
18.
Zurück zum Zitat N. Oh, P.P. Shirvani, E.J. McCluskey, Control-flow checking by software signatures. IEEE Transactions on Reliability 51(2), 2002, 111–122CrossRef N. Oh, P.P. Shirvani, E.J. McCluskey, Control-flow checking by software signatures. IEEE Transactions on Reliability 51(2), 2002, 111–122CrossRef
19.
Zurück zum Zitat Z. Alkhalifa, V.S.S. Nair, N. Krishnamurthy, J.A. Abraham, Design and evaluation of system-level checks for on-line control flow error detection. IEEE Transactions on Parallel and Distributed Systems 10(6), 1999, 627–641CrossRef Z. Alkhalifa, V.S.S. Nair, N. Krishnamurthy, J.A. Abraham, Design and evaluation of system-level checks for on-line control flow error detection. IEEE Transactions on Parallel and Distributed Systems 10(6), 1999, 627–641CrossRef
20.
Zurück zum Zitat O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, M. Violante, Soft-error detection using control flow assertions. Proceedings of the 18th International Symposium on Defect and Fault Tolerance in VLSI Systems, 3–5 November 2003, pp. 581–588 O. Goloubeva, M. Rebaudengo, M. Sonza Reorda, M. Violante, Soft-error detection using control flow assertions. Proceedings of the 18th International Symposium on Defect and Fault Tolerance in VLSI Systems, 3–5 November 2003, pp. 581–588
21.
Zurück zum Zitat R. Vemu, J.A. Abraham, CEDA: control-flow error detection through assertions. Proceedings of the 12th IEEE International On-Line Testing Symposium, 2006, pp. 151–158 R. Vemu, J.A. Abraham, CEDA: control-flow error detection through assertions. Proceedings of the 12th IEEE International On-Line Testing Symposium, 2006, pp. 151–158
22.
Zurück zum Zitat R. Vemu, J.A. Abraham, Budget-dependent control-flow error detection. Proceedings of the 14th IEEE International On-Line Testing Symposium, 2008, pp. 73–78 R. Vemu, J.A. Abraham, Budget-dependent control-flow error detection. Proceedings of the 14th IEEE International On-Line Testing Symposium, 2008, pp. 73–78
23.
Zurück zum Zitat C. Babbage, On the mathematical powers of the calculating engine, unpublished manuscript, December 1837, Oxford, Buxton Ms7, Museum of History of Science. Printed in The Origins of Digital Computers: Selected Papers, B. Randell (ed.), Springer, Berlin, 1974, pp. 17–52 C. Babbage, On the mathematical powers of the calculating engine, unpublished manuscript, December 1837, Oxford, Buxton Ms7, Museum of History of Science. Printed in The Origins of Digital Computers: Selected Papers, B. Randell (ed.), Springer, Berlin, 1974, pp. 17–52
24.
Zurück zum Zitat A. Avizienis, J.C. Laprie, Dependable computing: from concepts to design diversity. Proceedings of the IEEE 74(5), 1986, 629–638CrossRef A. Avizienis, J.C. Laprie, Dependable computing: from concepts to design diversity. Proceedings of the IEEE 74(5), 1986, 629–638CrossRef
25.
Zurück zum Zitat A. Avizienis, The N-version approach to fault-tolerant software. IEEE Transactions on Software Engineering 11(12), 1985, 1491–1501CrossRef A. Avizienis, The N-version approach to fault-tolerant software. IEEE Transactions on Software Engineering 11(12), 1985, 1491–1501CrossRef
26.
Zurück zum Zitat B. Randell, System structure for software fault tolerance. IEEE Transactions on Software Engineering 1(2), 1975, 220–232CrossRef B. Randell, System structure for software fault tolerance. IEEE Transactions on Software Engineering 1(2), 1975, 220–232CrossRef
27.
Zurück zum Zitat D. Pradhan, Fault-Tolerant Computer System Design. Prentice-Hall, Englewood Cliffs, NJ, 1996 D. Pradhan, Fault-Tolerant Computer System Design. Prentice-Hall, Englewood Cliffs, NJ, 1996
28.
Zurück zum Zitat J.P. Kelly, T.I. McVittie, W.I. Yamamoto, Implementing design diversity to achieve fault tolerance. IEEE Software 8(4), 1991, 61–71CrossRef J.P. Kelly, T.I. McVittie, W.I. Yamamoto, Implementing design diversity to achieve fault tolerance. IEEE Software 8(4), 1991, 61–71CrossRef
29.
Zurück zum Zitat J.H. Lala, L.S. Alger, Hardware and software fault tolerance: a unified architectural approach. Proceedings of the 18th International Symposium on Fault-Tolerant Computing, 1988, pp. 240–245 J.H. Lala, L.S. Alger, Hardware and software fault tolerance: a unified architectural approach. Proceedings of the 18th International Symposium on Fault-Tolerant Computing, 1988, pp. 240–245
30.
Zurück zum Zitat C.E. Price, Fault tolerant avionics for the space shuttle. Proceedings of the 10th IEEE/AIAA Digital Avionics Systems Conference, 1991, pp. 203–206 C.E. Price, Fault tolerant avionics for the space shuttle. Proceedings of the 10th IEEE/AIAA Digital Avionics Systems Conference, 1991, pp. 203–206
31.
Zurück zum Zitat D. Briere, P. Traverse, AIRBUS A320/A330/A340 electrical flight controls: a family of fault-tolerant systems. Proceedings of the 23rd International Symposium on Fault-Tolerant Computing, 1993, pp. 616–623 D. Briere, P. Traverse, AIRBUS A320/A330/A340 electrical flight controls: a family of fault-tolerant systems. Proceedings of the 23rd International Symposium on Fault-Tolerant Computing, 1993, pp. 616–623
32.
Zurück zum Zitat R. Riter, Modeling and testing a critical fault-tolerant multi-process system. Proceedings of the 25th International Symposium on Fault-Tolerant Computing, 1995, pp. 516–521 R. Riter, Modeling and testing a critical fault-tolerant multi-process system. Proceedings of the 25th International Symposium on Fault-Tolerant Computing, 1995, pp. 516–521
33.
Zurück zum Zitat G. Hagelin, ERICSSON safety system for railway control. Proceedings of the Workshop on Design Diversity in Action, Springer, Vienna, 1988, pp. 11–21 G. Hagelin, ERICSSON safety system for railway control. Proceedings of the Workshop on Design Diversity in Action, Springer, Vienna, 1988, pp. 11–21
34.
Zurück zum Zitat H. Kanzt, C. Koza, The ELEKTRA railway signalling system: field experience with an actively replicated system with diversity. Proceedings of the 25th International Symposium on Fault-Tolerant Computing, 1995, pp. 453–458 H. Kanzt, C. Koza, The ELEKTRA railway signalling system: field experience with an actively replicated system with diversity. Proceedings of the 25th International Symposium on Fault-Tolerant Computing, 1995, pp. 453–458
35.
Zurück zum Zitat A. Amendola, L. Impagliazzo, P. Marmo, G. Mongardi, G. Sartore, Architecture and safety requirements of the ACC railway interlocking system. Proceedings of IEEE International Computer Performance and Dependability Symposium, 1996, pp. 21–29 A. Amendola, L. Impagliazzo, P. Marmo, G. Mongardi, G. Sartore, Architecture and safety requirements of the ACC railway interlocking system. Proceedings of IEEE International Computer Performance and Dependability Symposium, 1996, pp. 21–29
36.
Zurück zum Zitat A.M. Tyrrell, Recovery blocks and algorithm-based fault tolerance, EUROMICRO 96. Beyond 2000: Hardware and Software Design Strategies. Proceedings of the 22nd EuroMicro Conference, 1996, pp. 292–299 A.M. Tyrrell, Recovery blocks and algorithm-based fault tolerance, EUROMICRO 96. Beyond 2000: Hardware and Software Design Strategies. Proceedings of the 22nd EuroMicro Conference, 1996, pp. 292–299
37.
Zurück zum Zitat K.M. Chandy, C.V. Ramamoorthy, Rollback and recovery strategies for computer programs. IEEE Transactions on Computers 21(6), 1972, 546–556MathSciNetMATHCrossRef K.M. Chandy, C.V. Ramamoorthy, Rollback and recovery strategies for computer programs. IEEE Transactions on Computers 21(6), 1972, 546–556MathSciNetMATHCrossRef
38.
Zurück zum Zitat W.K. Fuchs, C.-C.J. Li, CATCH – compiler-assisted techniques for checkpointing. Proceedings of the 20th Fault-Tolerant Computing Symposium, 1990, pp. 74–81 W.K. Fuchs, C.-C.J. Li, CATCH – compiler-assisted techniques for checkpointing. Proceedings of the 20th Fault-Tolerant Computing Symposium, 1990, pp. 74–81
39.
Zurück zum Zitat J. Long, W.K. Fuchs, J.A. Abraham, Compiler-assisted static checkpoint insertion. Proceedings of the 22nd Fault-Tolerant Computing Symposium, 1992, pp. 58–65 J. Long, W.K. Fuchs, J.A. Abraham, Compiler-assisted static checkpoint insertion. Proceedings of the 22nd Fault-Tolerant Computing Symposium, 1992, pp. 58–65
40.
Zurück zum Zitat D.K. Pradhan, N.H. Vaidya, Roll-forward checkpointing scheme: a novel fault-tolerant architecture. IEEE Transactions on Computers 43(10), 1994, 1163–1174MATHCrossRef D.K. Pradhan, N.H. Vaidya, Roll-forward checkpointing scheme: a novel fault-tolerant architecture. IEEE Transactions on Computers 43(10), 1994, 1163–1174MATHCrossRef
41.
Zurück zum Zitat A. Ziv, J. Bruck, Performance optimization of checkpointing scheme with task duplication. IEEE Transactions on Computers 46(12), 1997, 1381–1386MathSciNetCrossRef A. Ziv, J. Bruck, Performance optimization of checkpointing scheme with task duplication. IEEE Transactions on Computers 46(12), 1997, 1381–1386MathSciNetCrossRef
42.
Zurück zum Zitat K.H. Huang, J.A. Abraham, Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers C-33(6), 1984, 518–528CrossRef K.H. Huang, J.A. Abraham, Algorithm-based fault tolerance for matrix operations. IEEE Transactions on Computers C-33(6), 1984, 518–528CrossRef
43.
Zurück zum Zitat A. Roy-Chowdhury, P. Banerjee, Tolerance determination for algorithm based checks using simplified error analysis. Proceedings of the IEEE International Fault Tolerant Computing Symposium, 1993 A. Roy-Chowdhury, P. Banerjee, Tolerance determination for algorithm based checks using simplified error analysis. Proceedings of the IEEE International Fault Tolerant Computing Symposium, 1993
44.
Zurück zum Zitat M. Rebaudengo, M. Sonza Reorda, M. Violante, A new software-based technique for low-cost fault-tolerant application. Proceedings of the IEEE Annual Reliability and Maintainability Symposium, 2003, pp. 25–28 M. Rebaudengo, M. Sonza Reorda, M. Violante, A new software-based technique for low-cost fault-tolerant application. Proceedings of the IEEE Annual Reliability and Maintainability Symposium, 2003, pp. 25–28
45.
Zurück zum Zitat M. Rebaudengo, M. Sonza Reorda, M. Violante, A new approach to software-implemented fault tolerance. Journal of Electronic Testing: Theory and Applications 20, 2004, 433–437CrossRef M. Rebaudengo, M. Sonza Reorda, M. Violante, A new approach to software-implemented fault tolerance. Journal of Electronic Testing: Theory and Applications 20, 2004, 433–437CrossRef
46.
Zurück zum Zitat B. Nicolescu, R. Velazco, M. Sonza Reorda, Effectiveness and limitations of various software techniques for “soft error” detection: a comparative study. Proceedings of the IEEE 7th International On-Line Testing Workshop, 2001, pp. 172–177 B. Nicolescu, R. Velazco, M. Sonza Reorda, Effectiveness and limitations of various software techniques for “soft error” detection: a comparative study. Proceedings of the IEEE 7th International On-Line Testing Workshop, 2001, pp. 172–177
Metadaten
Titel
Software-Level Soft-Error Mitigation Techniques
verfasst von
Maurizio Rebaudengo
Matteo Sonza Reorda
Massimo Violante
Copyright-Jahr
2011
Verlag
Springer US
DOI
https://doi.org/10.1007/978-1-4419-6993-4_9

Neuer Inhalt