Soft-Checkpointing Based Hybrid Synchronous Checkpointing Protocol for Mobile Distributed Systems

Minimum-process coordinated checkpointing is a suitable approach to introduce fault tolerance in mobile distributed systems transparently. In order to balance the checkpointing overhead and the loss of computation on recovery, the authors propose a hybrid checkpointing algorithm, wherein an all-process coordinated checkpoint is taken after the execution of minimum-process coordinated checkpointing algorithm for a fixed number of times. In coordinated checkpointing, if a single process fails to take its checkpoint; all the checkpointing effort goes waste, because, each process has to abort its tentative checkpoint. In order to take the tentative checkpoint, an MH (Mobile Host) needs to transfer large checkpoint data to its local MSS over wireless channels. In this regard, the authors propose that in the first phase, all concerned MHs will take soft checkpoint only. Soft checkpoint is similar to mutable checkpoint. In this case, if some process fails to take checkpoint in the first phase, then MHs need to abort their soft checkpoints only. The effort of taking a soft checkpoint is negligibly small as compared to the tentative one. In the minimum-process coordinated checkpointing algorithm, an effort has been made to minimize the number of useless checkpoints and blocking of processes using probabilistic approach.

Download Full-text

Soft-Checkpointing Based Hybrid Synchronous Checkpointing Protocol for Mobile Distributed Systems

Development of Distributed Systems from Design to Application and Maintenance ◽

10.4018/978-1-4666-2647-8.ch006 ◽

2012 ◽

pp. 87-100

Author(s):

Parveen Kumar ◽

Rachit Garg

Keyword(s):

Distributed Systems ◽

Fault Tolerance ◽

Wireless Channels ◽

Probabilistic Approach ◽

Fixed Number ◽

Mobile Host ◽

Single Process ◽

Coordinated Checkpointing ◽

Coordinated Checkpoint

Minimum-process coordinated checkpointing is a suitable approach to introduce fault tolerance in mobile distributed systems transparently. In order to balance the checkpointing overhead and the loss of computation on recovery, the authors propose a hybrid checkpointing algorithm, wherein an all-process coordinated checkpoint is taken after the execution of minimum-process coordinated checkpointing algorithm for a fixed number of times. In coordinated checkpointing, if a single process fails to take its checkpoint; all the checkpointing effort goes waste, because, each process has to abort its tentative checkpoint. In order to take the tentative checkpoint, an MH (Mobile Host) needs to transfer large checkpoint data to its local MSS over wireless channels. In this regard, the authors propose that in the first phase, all concerned MHs will take soft checkpoint only. Soft checkpoint is similar to mutable checkpoint. In this case, if some process fails to take checkpoint in the first phase, then MHs need to abort their soft checkpoints only. The effort of taking a soft checkpoint is negligibly small as compared to the tentative one. In the minimum-process coordinated checkpointing algorithm, an effort has been made to minimize the number of useless checkpoints and blocking of processes using probabilistic approach.

Download Full-text

A Low-Cost Hybrid Coordinated Checkpointing Protocol for Mobile Distributed Systems

Mobile Information Systems ◽

10.1155/2008/982349 ◽

2008 ◽

Vol 4 (1) ◽

pp. 13-32 ◽

Cited By ~ 7

Author(s):

Parveen Kumar

Keyword(s):

Distributed Systems ◽

Low Cost ◽

Fixed Number ◽

Mobile Nodes ◽

Battery Power ◽

Coordinated Checkpointing ◽

The Cost ◽

Coordinated Checkpoint ◽

Low Activity ◽

Blocking Algorithm

Mobile distributed systems raise new issues such as mobility, low bandwidth of wireless channels, disconnections, limited battery power and lack of reliable stable storage on mobile nodes. In minimum-process coordinated checkpointing, some processes may not checkpoint for several checkpoint initiations. In the case of a recovery after a fault, such processes may rollback to far earlier checkpointed state and thus may cause greater loss of computation. In all-process coordinated checkpointing, the recovery line is advanced for all processes but the checkpointing overhead may be exceedingly high. To optimize both matrices, the checkpointing overhead and the loss of computation on recovery, we propose a hybrid checkpointing algorithm, wherein an all-process coordinated checkpoint is taken after the execution of minimum-process coordinated checkpointing algorithm for a fixed number of times. Thus, the Mobile nodes with low activity or in doze mode operation may not be disturbed in the case of minimum-process checkpointing and the recovery line is advanced for each process after an all-process checkpoint. Additionally, we try to minimize the information piggybacked onto each computation message. For minimum-process checkpointing, we design a blocking algorithm, where no useless checkpoints are taken and an effort has been made to optimize the blocking of processes. We propose to delay selective messages at the receiver end. By doing so, processes are allowed to perform their normal computation, send messages and partially receive them during their blocking period. The proposed minimum-process blocking algorithm forces zero useless checkpoints at the cost of very small blocking.

Download Full-text

A MINIMUM-PROCESS COORDINATED CHECKPOINTING PROTOCOL FOR MOBILE COMPUTING SYSTEMS

International Journal of Foundations of Computer Science ◽

10.1142/s0129054108006108 ◽

2008 ◽

Vol 19 (04) ◽

pp. 1015-1038 ◽

Cited By ~ 3

Author(s):

SUNIL KUMAR GUPTA ◽

R. K. CHAUHAN ◽

PARVEEN KUMAR

Keyword(s):

Distributed Systems ◽

Wireless Channels ◽

Mobile Nodes ◽

High Failure Rate ◽

Computing Systems ◽

Battery Power ◽

The Status ◽

Minimum Number ◽

Status Information ◽

Coordinated Checkpointing

Checkpoint is a designated place in a program at which normal process is interrupted specifically to preserve the status information necessary to allow resumption of processing at a later time. A checkpoint algorithm for mobile distributed systems needs to handle many new issues like: mobility, low bandwidth of wireless channels, lack of stable storage on mobile nodes, disconnections, limited battery power and high failure rate of mobile nodes. These issues make traditional checkpointing techniques unsuitable for such environments. Minimum-process coordinated checkpointing is an attractive approach to introduce fault tolerance in mobile distributed systems transparently. This approach is domino-free, requires at most two checkpoints of a process on stable storage, and forces only a minimum number of processes to checkpoint. But, it requires extra synchronization messages, blocking of the underlying computation or taking some useless checkpoints. In this paper, we design a minimum-process checkpointing algorithm for mobile distributed systems, where no useless checkpoint is taken. We reduce the blocking of processes by allowing the processes to do their normal computations, send messages and receive selective messages during their blocking period.

Download Full-text

Theory and Practice of Fault Tolerance in Distributed Systems.

10.21236/ada187559 ◽

1987 ◽

Author(s):

K. M. Chandy ◽

J. Misra

Keyword(s):

Distributed Systems ◽

Fault Tolerance ◽

Theory And Practice

Download Full-text

The robust middleware approach for transparent and systematic fault tolerance in parallel and distributed systems

2003 International Conference on Parallel Processing, 2003. Proceedings. ◽

10.1109/icpp.2003.1240566 ◽

2003 ◽

Author(s):

Chi-Hsiang Yeh

Keyword(s):

Distributed Systems ◽

Fault Tolerance

Download Full-text

Fault Tolerance in Distributed Systems Using Fused Data Structures

IEEE Transactions on Parallel and Distributed Systems ◽

10.1109/tpds.2012.96 ◽

2013 ◽

Vol 24 (4) ◽

pp. 701-715 ◽

Cited By ~ 7

Author(s):

B. Balasubramanian ◽

V. K. Garg

Keyword(s):

Distributed Systems ◽

Fault Tolerance ◽

Data Structures

Download Full-text

Optimizing fault tolerance in embedded distributed systems

IEEE Micro ◽

10.1109/40.865869 ◽

2000 ◽

Vol 20 (4) ◽

pp. 76-84

Author(s):

S. Draber

Keyword(s):

Distributed Systems ◽

Fault Tolerance

Download Full-text

Consistency of Replicated Datasets in Grid Computing

Handbook of Research on Grid Technologies and Utility Computing ◽

10.4018/978-1-60566-184-1.ch006 ◽

2009 ◽

pp. 49-58

Author(s):

Gianni Pucciani ◽

Flavia Donno ◽

Andrea Domenici ◽

Heinz Stockinger

Keyword(s):

Distributed Systems ◽

Fault Tolerance ◽

Grid Computing ◽

Data Management ◽

Data Replication ◽

Data Access ◽

Grid Middleware ◽

Pros And Cons ◽

Consistency Problem ◽

Replica Consistency

Data replication is a well-known technique used in distributed systems in order to improve fault tolerance and make data access faster. Several copies of a dataset are created and placed at different nodes, so that users can access the replica closest to them, and at the same time the data access load is distributed among the replicas. In today’s Grid middleware solutions, data management services allow users to replicate datasets (i.e., flat files or databases) among storage elements within a Grid, but replicas are often considered read-only because of the absence of mechanisms able to propagate updates and enforce replica consistency. This entry analyzes the replica consistency problem and provides hints for the development of a Replica Consistency Service, highlighting the main issues and pros and cons of several approaches.

Download Full-text

Increasing the fault tolerance of distributed systems for the Hyper de Bruijn topology with excess code

2019 IEEE International Conference on Advanced Trends in Information Theory (ATIT) ◽

10.1109/atit49449.2019.9030487 ◽

2019 ◽

Author(s):

Heorhii Loutskii ◽

Artem Volokyta ◽

Pavlo Rehida ◽

Oleksandr Honcharenko ◽

Bohdan Ivanishchev ◽

...

Keyword(s):

Distributed Systems ◽

Fault Tolerance ◽

De Bruijn

Download Full-text