Top

Published in:

2020 | OriginalPaper | Chapter

8. Recovery Preparation

Authors : Igor Schagaev, Eugene Zouev, Kaegi Thomas

Published in: Software Design for Resilient Computer Systems

Publisher: Springer International Publishing

Activate our intelligent search to find suitable subject content or patents.

search-config

AI-assisted search

Off

Abstract

In the last section, we showed how hardware integrity of a computing system can be efficiently ensured using hardware-checking schemes and system software testing procedures and their sequences. However, to recover from faults, it is necessary to eliminate the effects the error had on the computation, i.e., the software code and data space. In GAFT, this corresponds to preparation for recovery. We now want to show how software has to be organized to be able own recovery or in other words, we want to revise different strategies how software can, after the detection of an error, ensure that the error did not affect the software state, or if this cannot be ensured, what precautions software has to conduct to be able to re-establish a correct software state. First, we revise the state of the art and then introduce a new technology and show its power and limitations. In the next step, we will show how hardware can assist software in the process of recovery preparation. For all generic approaches to recovery preparation, so-called stable storage, a nonvolatile, reliable, and fast storage is needed. If no direct hardware support is available, stable storage must be implemented in software. We will present a possible software implementation of such a stable storage.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Online-Abonnement

Mit Springer Professional "Wirtschaft+Technik" erhalten Sie Zugriff auf:

über 102.000 Bücher
über 537 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Finance + Banking
Management + Führung
Marketing + Vertrieb
Maschinenbau + Werkstoffe
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Technik"

Online-Abonnement

Mit Springer Professional "Technik" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 390 Zeitschriften

aus folgenden Fachgebieten:

Automobil + Motoren
Bauwesen + Immobilien
Business IT + Informatik
Elektrotechnik + Elektronik
Energie + Nachhaltigkeit
Maschinenbau + Werkstoffe

Jetzt Wissensvorsprung sichern!

inform now

Springer Professional "Wirtschaft"

Online-Abonnement

Mit Springer Professional "Wirtschaft" erhalten Sie Zugriff auf:

über 67.000 Bücher
über 340 Zeitschriften

aus folgenden Fachgebieten:

Bauwesen + Immobilien
Business IT + Informatik
Finance + Banking
Management + Führung
Marketing + Vertrieb
Versicherung + Risiko

Jetzt Wissensvorsprung sichern!

inform now

previous chapter Testing, Checking, and Hardware Syndrome

next chapter Recovery: Searching and Monitoring of Correct Software States

Liedtke J (1995) On micro-kernel construction. In: Proceedings of the fifteenth ACM symposium on operating systems principles, SOSP ’95. ACM, New York, NY, USA, pp 237–250

Monkman S, Schagaev I (2013) Redundancy + reconfigurability = recoverability. Electronics 2:212–233. ISSN 2079-9292, https://doi.org/10.3390/electronics2030212CrossRef

Haeberlen A et al (2000) Stub-code performance is becoming important. In: Proceedings of 1st conference on industrial experiences with systems software, vol 1. USENIX Association, Berkeley, CA, USA, p 4

Wirth N, Gutknecht J (1992) Project Oberon: the design of an operating system and compiler. Addison-Wesley, Wokingham

Шагаев И., Берштейн А. Исследования систем команд их влияние на архитектуру современных ЭВМ. Зарубежная радиоэлектроника, 1989 N7, 8

Johannes M (2002) The active object system—design and multiprocessor implementation. ETH Zurich, Zurich

Mossenbock H, Wirth N (1991) The programming language oberon-2. Technical report, Johannes Kepler Universitat Linz

Martin R, Wirth N (1992) Programming in Oberon: steps beyond Pascal and Modula. Addison-Wesley, Wokingham

Wirth N (1977) Modula: a language for modular multiprogramming. Softw: Pract Experience 7(1):1–35MATH

10.

Wirth N (1985) Programming in Modula-2. Springer, New YorkMATHCrossRef

11.

Wirth N (1971) The programming language Pascal. Acta Informatica 35–63MATHCrossRef

12.

Wirth N (1977) The use of Modula. Softw—Pract Experience 7

13.

Kaegi-Trachsel T, Gutknecht J (2008) Minos—the design and implementation of an embedded real-time operating system with a perspective of fault tolerance. International Multiconference on IMCSIT 2008, 20–22 October 2008, pp 649–656

14.

Fabry RS (1974) Capability-based addressing. Commun ACM 17:403–412CrossRef

15.

Schagaev I (1990) Using software recovery facilities for determining the type of hardware faults. Autom and Remote Control 51(3)

16.

McCluskey E et al (2002) Control-flow checking by software signatures. IEEE Trans Reliab 51(1):111–122CrossRef

17.

Schagaev I (1989) Computing process recovery algorithms. Avtomat Telemekh 4

18.

Oh N, Mitra S, McCluskey (2002) Error detection by diverse data and duplicated instructions. IEEE Trans Comput 51(2):180–199CrossRef

19.

McCluskey E et al (2002) Error detection by duplicated instructions in superscalarprocessors. IEEE Trans Reliab 51(1):63–75CrossRef

20.

Sogomonyan E, Schagaev I (1988) Hardware and software for fault-tolerant computing systems. Autom Remote Control 49:129–151

21.

McCluskey E et al (2000) Dependable computing and online testing in adaptive and configurable systems. IEEE Des Test Comput 17(1):29–41CrossRef

22.

Mukherjee S et al (2002) Detailed design and evaluation of redundant multi-threading alternatives. In: 29th annual international symposium on computer architecture, pp 99–110

23.

Dal Cin M et al (1993) Fault tolerance in distributed shared memory multiprocessors. In: Parallel computer architectures: theory, hardware, software, applications. Springer, London, pp 31–48CrossRef

24.

Candea G, Kawamoto S, Fujiki Y, Greg Friedman G, Fox A (2004) Microreboot: a technique for cheap recovery. In: Proceedings of the 6th conference on symposium on operating systems design & implementation, vol 6. USENIX Association, Berkeley, CA, USA, p 6

25.

Deconinck G et al (1993) Survey of backward error recovery techniques for multicomputers based on checkpointing and rollback. Int J Model Simul 18:262–265

26.

Elnozahy E et al (2002) A survey of rollback-recovery protocols in message-passing systems

27.

Lampson BW (1981) Atomic transactions. In: Distributed systems—architecture and implementation, an advanced course. Springer, London, pp 246–265

28.

Randell B (1975) System structure for software fault tolerance. IEEE Trans Softw Eng 1:220–232CrossRef

29.

Lamport L et al (1985) Distributed snapshots: determining global states of distributed systems. ACM Trans Comput Syst 3:63–75CrossRef

30.

Attig N, Sander V (1993) Automatic checkpointing of NQS batch jobs on CRAY unicos systems. In: Proceedings of the cray user group meeting, pp 250–255

31.

Strom R, Yemini S (1985) Optimistic recovery in distributed systems. ACM Trans Comput Syst 3:204–226CrossRef

32.

Lorenzo A, Keith M (1996) Trade-offs in implementing causal message logging protocols. In: 15th ACM symposium on principles of distributed computing, PODC ’96. ACM, New York, NY, USA, pp 58–67

33.

Borg A, Baumbach J, Glazer S (1983) A message system supporting fault tolerance. In: Proceedings of the ninth ACM symposium on operating systems principles, SOSP ’83. ACM, New York, NY, USA, pp 90–99

34.

Strom R, Bacon D, Yemini S (1988) Volatile logging in n-fault-tolerant distributed systems. In: Digest of papers eighteenth international symposium on fault-tolerant computing, FTCS-18, pp 44–49

35.

Elnozahy E, Zwaenepoel W (1992) Manetho: transparent roll back-recovery with low overhead, limited rollback, and fast output commit. IEEE Trans Comput 41(5):526–531CrossRef

36.

Johnson D, Zwaenepoel W (1987) Sender-based message logging. In: Digest of papers: 17 annual international symposium on fault-tolerant computing. IEEE Computer Society, pp 14–19

37.

Smith S, Johnson D (1996) Minimizing time stamp size for completely asynchronous optimistic recovery with minimal rollback. In: Proceedings of the 15th symposium on reliable distributed systems, SRDS ’96. IEEE Computer Society, Washington, DC, USA, p 66

38.

Bhargava B, Lian S, Leu P (1990) Experimental evaluation of concurrent checkpointing and rollback-recovery algorithms. In: Proceedings of sixth international conference on data engineering, pp 182–189

39.

Tamir Y, Squin C (1984) Error recovery in multicomputers using global checkpoints. In: International conference on parallel processing, pp 32–41

40.

Tong Z, Kain R, Tsai W (1992) Rollback recovery in distributed systems using loosely synchronized clocks. IEEE Trans Parallel Distrib Syst 3:246–251CrossRef

41.

Koo R, Toueg S (1987) Checkpointing and rollback-recovery for distributed systems. IEEE Trans Softw Eng 23–31MATHCrossRef

42.

Janakiraman G, Tamir Y (1994) Coordinated checkpointing-rollback error recovery for distributed shared memory multicomputers. In: Proceedings 13th symposium on reliable distributed systems, pp 42–51

43.

Janssens B, Fuchs WK (1994) Reducing inter-processor dependence in recoverable distributed shared memory. In: Proceedings of reliable distributed systems, pp 34–41

44.

Li K (1986) Shared virtual memory on loosely coupled multiprocessors. PhD thesis, New Haven, CT, USA. AAI8728365

45.

Bershad B, Zekauskas M (1991) Midway: shared memory parallel programming with entry consistency for distributed memory multiprocessors. Technical report

46.

Huang Y, Wang Y (1995) Why optimistic message logging has not been used in telecommunications systems. In: FTCS-25, pp 459–463

47.

Johnson BD (1990) Distributed system fault tolerance using message logging and checkpointing. PhD thesis, Houston, TX, USA. AAI9110983

48.

Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21:558–565MATHCrossRef

49.

Brown L, Wu J (1995) Snooping fault-tolerant distributed shared memories. J Syst Softw 29:149–165CrossRef

50.

Plank J (1993) Efficient checkpointing on MIMD architectures. PhD thesis, Princeton, NJ, USA, 1993. UMI Order No. GAX93-16087

51.

Schagaev I. Algorithms of computation recovery. Automation and Remote Control, 7, 1986. 26, 36, 65, 122

52.

Schagaev I (1987) Algorithms for restoring a computing process. Autom Remote Control 48(4). 26, 65, 122, 141, 149

53.

Schagaev I (1989) Instructions retry in microprocessor recovery algorithms. In: IMEKO—FTSD symposium. 2

54.

Schagaev I (1986) Relationship between the formation of program recovery points and equipment reliability indices. Autom Remote Control 47:124

55.

Blaeser L, Monkman S, Schagaev I (2014) Evolving systems Worldcomp 2014. In: Proceedings of the international conference on foundations of computer science FCS’14. CSREA Press, ISBN: 1-60132-270-4

56.

O’Brian F (1976) Rollback point insertion strategies. In: Digest of papers 6th international symposium fault-tolerant computing, FTCS-6

57.

Wirth N (2008) Oberon-07 language report. Technical report, ETH Zurich

58.

Compact Flash Association (2007) Cf+ and compact flash specification revision 4.1. Technical report

59.

ONFi Workgroup (2011) Open NAND flash interface specification 3.0. Technical report, ONFI Workgroup

60.

ONFi Workgroup (2009) Open NAND flash interface specification: block abstracted NAND. Technical report, ONFi Workgroup

61.

SanDisk Corporation (2002) Host design considerations: NAND MMC and SD-based products. Technical report, SanDisk Corporation

62.

Gal E, Toledo S (2005) Algorithms and data structures for flash memories. ACM Comput Surv 37(2):138–163CrossRef

63.

Chang L, Kuo T (2004) An efficient management scheme for large-scale flash memory storage systems. Technical report

64.

Woodhouse D (2001) JOFFs: the journaling flash file system. Technical report, Red Hat, Inc

Title: Recovery Preparation
Authors: Igor Schagaev
Eugene Zouev
Kaegi Thomas
Publisher: Springer International Publishing
Book: Software Design for Resilient Computer Systems
Print ISBN: 978-3-030-21243-8

Electronic ISBN: 978-3-030-21244-5

Copyright Year: 2020
DOI: https://doi.org/10.1007/978-3-030-21244-5_8

Springer Professional

Abstract

Please log in to get access to your license.

Dont have a licence yet? Then find out more about our products and how to get one now:

Springer Professional "Wirtschaft+Technik"

Springer Professional "Technik"

Springer Professional "Wirtschaft"