Skip to main content
Top

2020 | OriginalPaper | Chapter

8. Recovery Preparation

Authors : Igor Schagaev, Eugene Zouev, Kaegi Thomas

Published in: Software Design for Resilient Computer Systems

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config
loading …

Abstract

In the last section, we showed how hardware integrity of a computing system can be efficiently ensured using hardware-checking schemes and system software testing procedures and their sequences. However, to recover from faults, it is necessary to eliminate the effects the error had on the computation, i.e., the software code and data space. In GAFT, this corresponds to preparation for recovery. We now want to show how software has to be organized to be able own recovery or in other words, we want to revise different strategies how software can, after the detection of an error, ensure that the error did not affect the software state, or if this cannot be ensured, what precautions software has to conduct to be able to re-establish a correct software state. First, we revise the state of the art and then introduce a new technology and show its power and limitations. In the next step, we will show how hardware can assist software in the process of recovery preparation. For all generic approaches to recovery preparation, so-called stable storage, a nonvolatile, reliable, and fast storage is needed. If no direct hardware support is available, stable storage must be implemented in software. We will present a possible software implementation of such a stable storage.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

  • über 102.000 Bücher
  • über 537 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Maschinenbau + Werkstoffe
  • Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 390 Zeitschriften

aus folgenden Fachgebieten:

  • Automobil + Motoren
  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Elektrotechnik + Elektronik
  • Energie + Nachhaltigkeit
  • Maschinenbau + Werkstoffe




 

Jetzt Wissensvorsprung sichern!

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

  • über 67.000 Bücher
  • über 340 Zeitschriften

aus folgenden Fachgebieten:

  • Bauwesen + Immobilien
  • Business IT + Informatik
  • Finance + Banking
  • Management + Führung
  • Marketing + Vertrieb
  • Versicherung + Risiko




Jetzt Wissensvorsprung sichern!

Literature
1.
go back to reference Liedtke J (1995) On micro-kernel construction. In: Proceedings of the fifteenth ACM symposium on operating systems principles, SOSP ’95. ACM, New York, NY, USA, pp 237–250 Liedtke J (1995) On micro-kernel construction. In: Proceedings of the fifteenth ACM symposium on operating systems principles, SOSP ’95. ACM, New York, NY, USA, pp 237–250
3.
go back to reference Haeberlen A et al (2000) Stub-code performance is becoming important. In: Proceedings of 1st conference on industrial experiences with systems software, vol 1. USENIX Association, Berkeley, CA, USA, p 4 Haeberlen A et al (2000) Stub-code performance is becoming important. In: Proceedings of 1st conference on industrial experiences with systems software, vol 1. USENIX Association, Berkeley, CA, USA, p 4
4.
go back to reference Wirth N, Gutknecht J (1992) Project Oberon: the design of an operating system and compiler. Addison-Wesley, Wokingham Wirth N, Gutknecht J (1992) Project Oberon: the design of an operating system and compiler. Addison-Wesley, Wokingham
5.
go back to reference Шагаев И., Берштейн А. Исследования систем команд их влияние на архитектуру современных ЭВМ. Зарубежная радиоэлектроника, 1989 N7, 8 Шагаев И., Берштейн А. Исследования систем команд их влияние на архитектуру современных ЭВМ. Зарубежная радиоэлектроника, 1989 N7, 8
6.
go back to reference Johannes M (2002) The active object system—design and multiprocessor implementation. ETH Zurich, Zurich Johannes M (2002) The active object system—design and multiprocessor implementation. ETH Zurich, Zurich
7.
go back to reference Mossenbock H, Wirth N (1991) The programming language oberon-2. Technical report, Johannes Kepler Universitat Linz Mossenbock H, Wirth N (1991) The programming language oberon-2. Technical report, Johannes Kepler Universitat Linz
8.
go back to reference Martin R, Wirth N (1992) Programming in Oberon: steps beyond Pascal and Modula. Addison-Wesley, Wokingham Martin R, Wirth N (1992) Programming in Oberon: steps beyond Pascal and Modula. Addison-Wesley, Wokingham
9.
go back to reference Wirth N (1977) Modula: a language for modular multiprogramming. Softw: Pract Experience 7(1):1–35MATH Wirth N (1977) Modula: a language for modular multiprogramming. Softw: Pract Experience 7(1):1–35MATH
12.
go back to reference Wirth N (1977) The use of Modula. Softw—Pract Experience 7 Wirth N (1977) The use of Modula. Softw—Pract Experience 7
13.
go back to reference Kaegi-Trachsel T, Gutknecht J (2008) Minos—the design and implementation of an embedded real-time operating system with a perspective of fault tolerance. International Multiconference on IMCSIT 2008, 20–22 October 2008, pp 649–656 Kaegi-Trachsel T, Gutknecht J (2008) Minos—the design and implementation of an embedded real-time operating system with a perspective of fault tolerance. International Multiconference on IMCSIT 2008, 20–22 October 2008, pp 649–656
14.
15.
go back to reference Schagaev I (1990) Using software recovery facilities for determining the type of hardware faults. Autom and Remote Control 51(3) Schagaev I (1990) Using software recovery facilities for determining the type of hardware faults. Autom and Remote Control 51(3)
16.
go back to reference McCluskey E et al (2002) Control-flow checking by software signatures. IEEE Trans Reliab 51(1):111–122CrossRef McCluskey E et al (2002) Control-flow checking by software signatures. IEEE Trans Reliab 51(1):111–122CrossRef
17.
go back to reference Schagaev I (1989) Computing process recovery algorithms. Avtomat Telemekh 4 Schagaev I (1989) Computing process recovery algorithms. Avtomat Telemekh 4
18.
go back to reference Oh N, Mitra S, McCluskey (2002) Error detection by diverse data and duplicated instructions. IEEE Trans Comput 51(2):180–199CrossRef Oh N, Mitra S, McCluskey (2002) Error detection by diverse data and duplicated instructions. IEEE Trans Comput 51(2):180–199CrossRef
19.
go back to reference McCluskey E et al (2002) Error detection by duplicated instructions in superscalarprocessors. IEEE Trans Reliab 51(1):63–75CrossRef McCluskey E et al (2002) Error detection by duplicated instructions in superscalarprocessors. IEEE Trans Reliab 51(1):63–75CrossRef
20.
go back to reference Sogomonyan E, Schagaev I (1988) Hardware and software for fault-tolerant computing systems. Autom Remote Control 49:129–151 Sogomonyan E, Schagaev I (1988) Hardware and software for fault-tolerant computing systems. Autom Remote Control 49:129–151
21.
go back to reference McCluskey E et al (2000) Dependable computing and online testing in adaptive and configurable systems. IEEE Des Test Comput 17(1):29–41CrossRef McCluskey E et al (2000) Dependable computing and online testing in adaptive and configurable systems. IEEE Des Test Comput 17(1):29–41CrossRef
22.
go back to reference Mukherjee S et al (2002) Detailed design and evaluation of redundant multi-threading alternatives. In: 29th annual international symposium on computer architecture, pp 99–110 Mukherjee S et al (2002) Detailed design and evaluation of redundant multi-threading alternatives. In: 29th annual international symposium on computer architecture, pp 99–110
23.
go back to reference Dal Cin M et al (1993) Fault tolerance in distributed shared memory multiprocessors. In: Parallel computer architectures: theory, hardware, software, applications. Springer, London, pp 31–48CrossRef Dal Cin M et al (1993) Fault tolerance in distributed shared memory multiprocessors. In: Parallel computer architectures: theory, hardware, software, applications. Springer, London, pp 31–48CrossRef
24.
go back to reference Candea G, Kawamoto S, Fujiki Y, Greg Friedman G, Fox A (2004) Microreboot: a technique for cheap recovery. In: Proceedings of the 6th conference on symposium on operating systems design & implementation, vol 6. USENIX Association, Berkeley, CA, USA, p 6 Candea G, Kawamoto S, Fujiki Y, Greg Friedman G, Fox A (2004) Microreboot: a technique for cheap recovery. In: Proceedings of the 6th conference on symposium on operating systems design & implementation, vol 6. USENIX Association, Berkeley, CA, USA, p 6
25.
go back to reference Deconinck G et al (1993) Survey of backward error recovery techniques for multicomputers based on checkpointing and rollback. Int J Model Simul 18:262–265 Deconinck G et al (1993) Survey of backward error recovery techniques for multicomputers based on checkpointing and rollback. Int J Model Simul 18:262–265
26.
go back to reference Elnozahy E et al (2002) A survey of rollback-recovery protocols in message-passing systems Elnozahy E et al (2002) A survey of rollback-recovery protocols in message-passing systems
27.
go back to reference Lampson BW (1981) Atomic transactions. In: Distributed systems—architecture and implementation, an advanced course. Springer, London, pp 246–265 Lampson BW (1981) Atomic transactions. In: Distributed systems—architecture and implementation, an advanced course. Springer, London, pp 246–265
28.
go back to reference Randell B (1975) System structure for software fault tolerance. IEEE Trans Softw Eng 1:220–232CrossRef Randell B (1975) System structure for software fault tolerance. IEEE Trans Softw Eng 1:220–232CrossRef
29.
go back to reference Lamport L et al (1985) Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 3:63–75CrossRef Lamport L et al (1985) Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 3:63–75CrossRef
30.
go back to reference Attig N, Sander V (1993) Automatic checkpointing of NQS batch jobs on CRAY unicos systems. In: Proceedings of the cray user group meeting, pp 250–255 Attig N, Sander V (1993) Automatic checkpointing of NQS batch jobs on CRAY unicos systems. In: Proceedings of the cray user group meeting, pp 250–255
31.
go back to reference Strom R, Yemini S (1985) Optimistic recovery in distributed systems. ACM Trans Comput Syst 3:204–226CrossRef Strom R, Yemini S (1985) Optimistic recovery in distributed systems. ACM Trans Comput Syst 3:204–226CrossRef
32.
go back to reference Lorenzo A, Keith M (1996) Trade-offs in implementing causal message logging protocols. In: 15th ACM symposium on principles of distributed computing, PODC ’96. ACM, New York, NY, USA, pp 58–67 Lorenzo A, Keith M (1996) Trade-offs in implementing causal message logging protocols. In: 15th ACM symposium on principles of distributed computing, PODC ’96. ACM, New York, NY, USA, pp 58–67
33.
go back to reference Borg A, Baumbach J, Glazer S (1983) A message system supporting fault tolerance. In: Proceedings of the ninth ACM symposium on operating systems principles, SOSP ’83. ACM, New York, NY, USA, pp 90–99 Borg A, Baumbach J, Glazer S (1983) A message system supporting fault tolerance. In: Proceedings of the ninth ACM symposium on operating systems principles, SOSP ’83. ACM, New York, NY, USA, pp 90–99
34.
go back to reference Strom R, Bacon D, Yemini S (1988) Volatile logging in n-fault-tolerant distributed systems. In: Digest of papers eighteenth international symposium on fault-tolerant computing, FTCS-18, pp 44–49 Strom R, Bacon D, Yemini S (1988) Volatile logging in n-fault-tolerant distributed systems. In: Digest of papers eighteenth international symposium on fault-tolerant computing, FTCS-18, pp 44–49
35.
go back to reference Elnozahy E, Zwaenepoel W (1992) Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans Comput 41(5):526–531CrossRef Elnozahy E, Zwaenepoel W (1992) Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans Comput 41(5):526–531CrossRef
36.
go back to reference Johnson D, Zwaenepoel W (1987) Sender-based message logging. In: Digest of papers: 17 annual international symposium on fault-tolerant computing. IEEE Computer Society, pp 14–19 Johnson D, Zwaenepoel W (1987) Sender-based message logging. In: Digest of papers: 17 annual international symposium on fault-tolerant computing. IEEE Computer Society, pp 14–19
37.
go back to reference Smith S, Johnson D (1996) Minimizing time stamp size for completely asynchronous optimistic recovery with minimal rollback. In: Proceedings of the 15th symposium on reliable distributed systems, SRDS ’96. IEEE Computer Society, Washington, DC, USA, p 66 Smith S, Johnson D (1996) Minimizing time stamp size for completely asynchronous optimistic recovery with minimal rollback. In: Proceedings of the 15th symposium on reliable distributed systems, SRDS ’96. IEEE Computer Society, Washington, DC, USA, p 66
38.
go back to reference Bhargava B, Lian S, Leu P (1990) Experimental evaluation of concurrent checkpointing and rollback-recovery algorithms. In: Proceedings of sixth international conference on data engineering, pp 182–189 Bhargava B, Lian S, Leu P (1990) Experimental evaluation of concurrent checkpointing and rollback-recovery algorithms. In: Proceedings of sixth international conference on data engineering, pp 182–189
39.
go back to reference Tamir Y, Squin C (1984) Error recovery in multicomputers using global checkpoints. In: International conference on parallel processing, pp 32–41 Tamir Y, Squin C (1984) Error recovery in multicomputers using global checkpoints. In: International conference on parallel processing, pp 32–41
40.
go back to reference Tong Z, Kain R, Tsai W (1992) Rollback recovery in distributed systems using loosely synchronized clocks. IEEE Trans Parallel Distrib Syst 3:246–251CrossRef Tong Z, Kain R, Tsai W (1992) Rollback recovery in distributed systems using loosely synchronized clocks. IEEE Trans Parallel Distrib Syst 3:246–251CrossRef
41.
go back to reference Koo R, Toueg S (1987) Checkpointing and rollback-recovery for distributed systems. IEEE Trans Softw Eng 23–31MATHCrossRef Koo R, Toueg S (1987) Checkpointing and rollback-recovery for distributed systems. IEEE Trans Softw Eng 23–31MATHCrossRef
42.
go back to reference Janakiraman G, Tamir Y (1994) Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers. In: Proceedings 13th symposium on reliable distributed systems, pp 42–51 Janakiraman G, Tamir Y (1994) Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers. In: Proceedings 13th symposium on reliable distributed systems, pp 42–51
43.
go back to reference Janssens B, Fuchs WK (1994) Reducing inter-processor dependence in recoverable distributed shared memory. In: Proceedings of reliable distributed systems, pp 34–41 Janssens B, Fuchs WK (1994) Reducing inter-processor dependence in recoverable distributed shared memory. In: Proceedings of reliable distributed systems, pp 34–41
44.
go back to reference Li K (1986) Shared virtual memory on loosely coupled multiprocessors. PhD thesis, New Haven, CT, USA. AAI8728365 Li K (1986) Shared virtual memory on loosely coupled multiprocessors. PhD thesis, New Haven, CT, USA. AAI8728365
45.
go back to reference Bershad B, Zekauskas M (1991) Midway: shared memory parallel programming with entry consistency for distributed memory multiprocessors. Technical report Bershad B, Zekauskas M (1991) Midway: shared memory parallel programming with entry consistency for distributed memory multiprocessors. Technical report
46.
go back to reference Huang Y, Wang Y (1995) Why optimistic message logging has not been used in telecommunications systems. In: FTCS-25, pp 459–463 Huang Y, Wang Y (1995) Why optimistic message logging has not been used in telecommunications systems. In: FTCS-25, pp 459–463
47.
go back to reference Johnson BD (1990) Distributed system fault tolerance using message logging and checkpointing. PhD thesis, Houston, TX, USA. AAI9110983 Johnson BD (1990) Distributed system fault tolerance using message logging and checkpointing. PhD thesis, Houston, TX, USA. AAI9110983
48.
go back to reference Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21:558–565MATHCrossRef Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21:558–565MATHCrossRef
49.
go back to reference Brown L, Wu J (1995) Snooping fault-tolerant distributed shared memories. J Syst Softw 29:149–165CrossRef Brown L, Wu J (1995) Snooping fault-tolerant distributed shared memories. J Syst Softw 29:149–165CrossRef
50.
go back to reference Plank J (1993) Efficient checkpointing on MIMD architectures. PhD thesis, Princeton, NJ, USA, 1993. UMI Order No. GAX93-16087 Plank J (1993) Efficient checkpointing on MIMD architectures. PhD thesis, Princeton, NJ, USA, 1993. UMI Order No. GAX93-16087
51.
go back to reference Schagaev I. Algorithms of computation recovery. Automation and Remote Control, 7, 1986. 26, 36, 65, 122 Schagaev I. Algorithms of computation recovery. Automation and Remote Control, 7, 1986. 26, 36, 65, 122
52.
go back to reference Schagaev I (1987) Algorithms for restoring a computing process. Autom Remote Control 48(4). 26, 65, 122, 141, 149 Schagaev I (1987) Algorithms for restoring a computing process. Autom Remote Control 48(4). 26, 65, 122, 141, 149
53.
go back to reference Schagaev I (1989) Instructions retry in microprocessor recovery algorithms. In: IMEKO—FTSD symposium. 2 Schagaev I (1989) Instructions retry in microprocessor recovery algorithms. In: IMEKO—FTSD symposium. 2
54.
go back to reference Schagaev I (1986) Relationship between the formation of program recovery points and equipment reliability indices. Autom Remote Control 47:124 Schagaev I (1986) Relationship between the formation of program recovery points and equipment reliability indices. Autom Remote Control 47:124
55.
go back to reference Blaeser L, Monkman S, Schagaev I (2014) Evolving systems Worldcomp 2014. In: Proceedings of the international conference on foundations of computer science FCS’14. CSREA Press, ISBN: 1-60132-270-4 Blaeser L, Monkman S, Schagaev I (2014) Evolving systems Worldcomp 2014. In: Proceedings of the international conference on foundations of computer science FCS’14. CSREA Press, ISBN: 1-60132-270-4
56.
go back to reference O’Brian F (1976) Rollback point insertion strategies. In: Digest of papers 6th international symposium fault-tolerant computing, FTCS-6 O’Brian F (1976) Rollback point insertion strategies. In: Digest of papers 6th international symposium fault-tolerant computing, FTCS-6
57.
go back to reference Wirth N (2008) Oberon-07 language report. Technical report, ETH Zurich Wirth N (2008) Oberon-07 language report. Technical report, ETH Zurich
58.
go back to reference Compact Flash Association (2007) Cf+ and compact flash specification revision 4.1. Technical report Compact Flash Association (2007) Cf+ and compact flash specification revision 4.1. Technical report
59.
go back to reference ONFi Workgroup (2011) Open NAND flash interface specification 3.0. Technical report, ONFI Workgroup ONFi Workgroup (2011) Open NAND flash interface specification 3.0. Technical report, ONFI Workgroup
60.
go back to reference ONFi Workgroup (2009) Open NAND flash interface specification: block abstracted NAND. Technical report, ONFi Workgroup ONFi Workgroup (2009) Open NAND flash interface specification: block abstracted NAND. Technical report, ONFi Workgroup
61.
go back to reference SanDisk Corporation (2002) Host design considerations: NAND MMC and SD-based products. Technical report, SanDisk Corporation SanDisk Corporation (2002) Host design considerations: NAND MMC and SD-based products. Technical report, SanDisk Corporation
62.
go back to reference Gal E, Toledo S (2005) Algorithms and data structures for flash memories. ACM Comput Surv 37(2):138–163CrossRef Gal E, Toledo S (2005) Algorithms and data structures for flash memories. ACM Comput Surv 37(2):138–163CrossRef
63.
go back to reference Chang L, Kuo T (2004) An efficient management scheme for large-scale flash memory storage systems. Technical report Chang L, Kuo T (2004) An efficient management scheme for large-scale flash memory storage systems. Technical report
64.
go back to reference Woodhouse D (2001) JOFFs: the journaling flash file system. Technical report, Red Hat, Inc Woodhouse D (2001) JOFFs: the journaling flash file system. Technical report, Red Hat, Inc
Metadata
Title
Recovery Preparation
Authors
Igor Schagaev
Eugene Zouev
Kaegi Thomas
Copyright Year
2020
DOI
https://doi.org/10.1007/978-3-030-21244-5_8