Optimal equidistant checkpointing of fault tolerant systems subject to correlated failure
Checkpointing is a technique to back up work at periodic intervals so that if computation fails, it will not be necessary to restart from the beginning but will instead be able to restart from the latest checkpoint. Performing checkpointing operations requires time. Therefore, it is necessary to consider the tradeoff between the time to perform checkpointing operations and the time saved when computation restarts at a checkpoint. This article presents a method to model the impact of correlated failures on an application that performs a specified amount of computation and implements checkpointing operations at equidistant periods during this computation. We develop a Markov model and superimpose a correlated life distribution. Two cases are considered. The first assumes that reaching a checkpoint resets the failure distribution. The second allows the probability of failure to progress. We illustrate the approach through a series of examples. The results indicate that correlation can negatively impact checkpointing, necessitating more frequent checkpointing and increasing the total time required, but that the approach can still identify the optimal number of equidistant checkpoints, despite this correlation.