An efficient recovery scheme for fault-tolerant mobile computing systems

https://doi.org/10.1016/S0167-739X(02)00095-XGet rights and content

Abstract

This paper presents an efficient recovery scheme to provide fault-tolerance for the mobile computing systems. The proposed scheme is based on the message logging and the independent checkpointing, and for the efficient management of the recovery information, such as checkpoints and message logs, the movement-based scheme is suggested. The mobile host carrying its recovery information to the nearby mobile support station can recover instantly in case of a failure. However, the support stations visited by the mobile host may have to experience the high failure-free execution cost for transferring the recovery information and accessing the stable storage. On the other hand, the recovery cost can be too high, if the recovery information remains dispersed over a number of mobile support stations. The movement-based scheme considers both of the failure-free execution cost and the failure-recovery cost. Hence, while a mobile host moves within a certain range, the recovery information of the mobile host remains at the support stations where the information was first saved. However, if the mobile host moves out of the range, it transfers the recovery information into the nearby support station. As a result, the scheme can control the transfer cost as well as the recovery cost. The performance of the proposed scheme is evaluated with extensive simulation study.

Introduction

Many algorithms supporting distributed services are nowadays extended to continue their services in the mobile environment [5]. However, straightforward extension of existing algorithms are not well adopted to the new environment, since the mobile computing system introduces a new challenge such as the handling of mobile hosts. The mobile host, called an MH, has some special properties. For example, it keeps moving from a cell to another cell and is connected to the mobile support station, called an MSS, via a wireless network which has the low bandwidth and very fragile connection. The MHs usually carry the small memory and disk spaces, and the low battery capacity of MHs can also be a problem.

The checkpointing–recovery is one of the distributed services to provide the fault-tolerance for the system. Considering the fact that the MHs are vulnerable to the failure, it is very desirable for the mobile computing system to be equipped with the checkpointing–recovery facility. Many checkpointing–recovery schemes have been proposed for the distributed systems, however, these schemes cannot be directly used in the mobile environment, as most of the other distributed services. Especially, the following properties of the mobile computing system introduce new design issues for the implementation of checkpointing–recovery.

First, the low bandwidth of the wireless network makes it impossible to use the schemes enforcing a large number of system messages or a large size of information carried in a message. Also, since the checkpoints of an MH must be transferred to the MSSs due to the lack of disk spaces on MHs, the schemes enforcing frequent checkpointing cannot be used and it is plausible to make an MH to take its checkpoints on its own schedule. While the MH moves around the cells, the checkpoints of an MH has to be stored on several MSSs, and hence, in case of a failure of the MH, a mechanism to trace and retrieve the proper checkpoint must be provided. Moreover, checkpointing and recovery schemes enforcing any kind of synchronization or coordination among the MHs may not be used, because of the frequent disconnection to save the power consumption of MHs.

Considering these properties, coordinated checkpointing and recovery schemes [9], [12], [13], [23], which require the processes in the system to synchronize their checkpointing and recovery activities, are not suitable for the mobile environment. The low network bandwidth cannot afford the large number of coordination messages exchanged between the MHs, and also some MHs temporarily disconnected from the network may not participate in the coordination. Some schemes have been proposed to achieve the checkpointing coordination with the less message overhead and the fewer number of processes participating in the coordination [8], [19].

Communication-pattern-based checkpointing [20] allows a process to take consistent checkpoints independently, whenever it changes its communication status from the sending mode to the receiving mode. This scheme does not require any coordination overhead for checkpointing and recovery, and an extended version of this scheme for the mobile environment has been proposed in [1]. However, since the frequency of checkpointing in this scheme depends on the communication pattern of an MH, the MH may have to transfer a checkpoint with every outgoing message, in the worst case, and the wireless network with the low bandwidth cannot afford it.

Communication-induced checkpointing [7], [24], [26] is the scheme with the least checkpointing overhead, since an index or a timestamp carried in a message makes the processes to eventually take globally consistent checkpoints. For the mobile computing environment, some extended algorithms of this checkpointing scheme have been proposed [15], [16]. However, when the recovery is concerned, checkpointing-only schemes have a common problem in rollback. Because of the livelock problem [13], which causes recursive rollbacks, either the rollback of the related processes has to be synchronized as proposed in [13] or a centralized coordination is required [6].

One way to guarantee the asynchronous recovery [10], [22] is to use the message logging in addition to the checkpointing. Causal logging scheme [3], [4], [11], however, requires a large size of log space and a large amount of dependency information to be carried in a message, which can be a serious drawback in the mobile environment. Hence, instead of causal logging, a scheme to implement the optimistic logging [21], [22], [25] with the less message overhead and a little inference with MHs has been proposed in [17]. Also, in spite of the heavy stable storage access overhead, pessimistic logging is considered as a good design choice for the mobile environment, because of its simple implementation and the capability of asynchronous recovery [18], [27].

In the mobile computing environment, the MHs are frequently disconnected from the network without a failure and the exchange of a large number of coordination messages is considered too costly. Hence, recovery coordination among the MHs is not plausible and asynchronous recovery must be sought. With the asynchronous recovery, a process can independently decide its rollback and after the rollback, it can immediately resume the computation without waiting for any coordination message from another process. To achieve the asynchrony in recovery, it is desirable to log the messages in addition to the independent checkpointing.

Most checkpointing schemes for mobile computing systems have suggested that MSSs should manage the checkpoints for the MHs, because of the storage limitation of MHs. The stable storage of the MSS is also a reasonable choice for the message log. Since the messages heading to an MHs are routed through the MSSs, message logging by the MSS may not impose any extra communication overhead. However, the mobility of MHs introduces a new design issue for the storage management. As an MH moves around the cells, the storages for checkpoints and message logs of an MH become dispersed over a number of MSSs, and in case of a failure, an MH must locate the latest or any consistent checkpoint and also has to locate the sequence of logged messages, which in turn increase the recovery cost.

Considering relatively high failure rate of MHs, instant recovery from a failure must be very important. However, checkpoints and logs distributed over the network may cause the severe message traffic and the delay in recovery. For the efficient implementation, the mobility of MHs must be carefully traced and a proper mechanism for gathering distributed recovery information must be prepared. Also, since the size of a stable storage managed by each MSS is limited, checkpoints and logs no longer required for any recovery must be discarded for the reuse of the space. Such garbage collection may have to put extra communication overhead if the logs are dispersed over a large number of cells.

Some works related to the distributed storage management have been proposed in [18], [27]. For the fast recovery, it is desirable for checkpoints and message logs to be near the MH on recovery, and hence, checkpoints and logs of an MH in [18] keep moving as the MH performs the handoff between two cells. As a result, instant recovery can be possible, however, the failure-free communication overhead cannot be negligible, considering the size of a checkpoint and logged messages. One suggestion made in [27] utilizes the home of each MH to maintain the recovery information. As an MH moves, it transfers checkpoints or logs to the home, and in case of a failure, it can find the recovery information at home. However, if the MH is far from home, the transfer cost can also be a problem.

This paper presents an efficient recovery scheme based on pessimistic message logging for the mobile computing environment. With the message logging and periodic checkpointing, asynchronous recovery can be achieved even in case of multiple and concurrent failure occurrences. For the management of the recovery information, such as checkpoints and message logs, the movement-based scheme is used. The MH carrying its recovery information to its current MSS can recover instantly in case of a failure. However, the MSSs visited by the MH have to experience the high failure-free cost to transfer the recovery information and access the stable storage. On the other hand, the recovery cost can be too high, if the recovery information is dispersed over a large number of support stations. The movement-based scheme considers both of the costs. While the MH moves within a certain range, recovery information of the MH is not moved. However, if the MH moves out of the range, it transfers the recovery information nearby. As a result, the scheme controls the transfer cost as well as the recovery cost. The proposed scheme is evaluated with the earlier schemes in [18] with extensive simulation study.

The rest of this paper is organized as follows: Section 2 describes the system model and the basic components of fault-tolerant mobile computing systems. In Section 3, the proposed recovery scheme is presented and the performance of the proposed scheme is compared with the earlier ones with the extensive simulation results in Section 4. Finally, Section 5 concludes the paper.

Section snippets

Mobile computing system model

The mobile computing system considered in this paper follows the model presented in [1], [2]. The system consists of a set of mobile hosts (MHs) and static mobile support stations (MSSs). A set of dynamic and wireless communication link can be established between an MH and MSS; and a set of high speed static and wired communication link is assumed between MSSs. A region covered by an MSS is called a cell. An MH residing in a cell can be connected to the MSS servicing the cell and the MH can

The proposed recovery scheme

The proposed recovery scheme is based on independent checkpointing, pessimistic message logging and asynchronous rollback-recovery. For the efficient management of distributed log storages, a scheme based on MH’s movement is proposed.

Performance study

The performance of the proposed scheme is evaluated and compared with the earlier schemes proposed in [18].

Conclusions

In this paper, we have presented an efficient recovery scheme based on message logging and checkpointing for mobile computing systems. The mobile host carrying its recovery information to its current mobile support station can recover instantly in case of a failure. However, the mobile support stations visited by the mobile host have to experience high failure-free execution cost to transfer the recovery information and access the stable storage. On the other hand, the recovery cost can be too

Taesoon Park received her BS degree in computer engineering from Seoul National University, Seoul, Korea, in 1987, and MS and PhD degrees in computer science from Texas A&M University, College Station, TX, in 1989 and 1994, respectively. She is currently an Assistant Professor in Sejong University, Seoul, Korea. Her research interests are in the areas of fault-tolerant and distributed computing systems.

References (27)

  • K. Venkatesh et al.

    Optimal checkpointing and local recording for domino-free rollback recovery

    Inform. Proc. Lett.

    (1987)
  • A. Acharya, B.R. Badrinath, Checkpointing distributed applications on mobile computers, in: Proceedings of the Third...
  • I.F. Ayildiz, J.S.M. Ho, On location management for personal communications networks, in: Proceedings of the IEEE...
  • L. Alvisi, B. Hoppoe, K. Marzullo, Nonblocking and orphan-free message logging protocols, in: Proceedings of the 23rd...
  • L. Alvisi, K. Marzullo, Message logging: pessimistic, optimistic and causal, in: Proceedings of the 15th International...
  • B.R. Badrinath, A. Acharya, T. Imielinski, Structuring distributed algorithms for mobile hosts, in: Proceedings of the...
  • B. Bhargava, S.R. Lian, Independent checkpointing and concurrent rollback for recovery—an optimistic approach, in:...
  • D. Briatico, A. Ciuffoletti, L. Simoncini, A distributed domino-effect free recovery algorithm, in: Proceedings of the...
  • G. Cao, M. Singhal, Low-cost checkpointing with mutable checkpoints in mobile computing systems, in: Proceedings of the...
  • M. Chandy et al.

    Distributed snapshot: determining global states of distributed systems

    ACM Trans. Comput. Syst.

    (1985)
  • O.P. Damani, V.K. Garg, How to recover efficiently and asynchronously when optimism fails, in: Proceedings of the 16th...
  • E.N. Elnozahy et al.

    Manetho: transparent rollback-recovery with low overhead, limited rollback, and fast output commit

    IEEE Trans. Comput.

    (1992)
  • J.L. Kim et al.

    An efficient algorithm for checkpointing recovery in distributed systems

    IEEE Trans. Parallel Distrib. Syst.

    (1993)
  • Cited by (0)

    Taesoon Park received her BS degree in computer engineering from Seoul National University, Seoul, Korea, in 1987, and MS and PhD degrees in computer science from Texas A&M University, College Station, TX, in 1989 and 1994, respectively. She is currently an Assistant Professor in Sejong University, Seoul, Korea. Her research interests are in the areas of fault-tolerant and distributed computing systems.

    Namyoon Woo received his BS and MS degree in Computer Science from Korea Advanced Institute of Science and Technology (KAIST), Taejon, Korea, in 1997 and 1999 respectively. Currently he is a PhD student at Seoul National University. His research interests include fault tolerance, distributed systems, grids, multiple task scheduling and user-level communications.

    Heon Y. Yeom is an Associate Professor with the Department of Computer Science and Engineering, Seoul National University. He received his BS degree in computer science from Seoul National University in 1984 and received his MS and PhD degrees in computer science from Texas A&M University, in 1986 and 1992, respectively. From 1986 to 1990, he worked with Texas Transportation Institute as a Systems Analyst and from 1992 to 1993, he was with Samsung Data Systems as a Research Scientist. He joined the Department of Computer Science, Seoul National University in 1993, where he currently teaches and researches on distributed systems, multimedia systems and transaction processing.

    An earlier version of this work has appeared in the Proceedings of the 2001 International Conference on Parallel and Distributed Systems.

    View full text