Fault-Tolerant Distributed Computing in Full-Information Networks

Author(s):  
Shafi Goldwasser ◽  
Elan Pavlov ◽  
Vinod Vaikuntanathan
Author(s):  
Yongning Zhai ◽  
Weiwei Li

For the distributed computing system, excessive or deficient checkpointing operations would result in severe performance degradation. To minimize the expected computation execution of the long-running application with a general failure distribution, the optimal equidistant checkpoint interval for fault tolerant performance optimization is analyzed and derived in this paper. More precisely, the optimal checkpointing period to determine the proper checkpoint sequence is proposed, and the derivation of the expected effective rate of the defined computation cycle is introduced. Corresponding to the maximal expected effective rate, the constraint of the optimal checkpoint sequence can be obtained. From the constraint of optimality, the optimal equidistant checkpoint interval can be obtained according to the minimal fault tolerant overhead ratio. By the numerical results, the proposal is practical to determine a proper equidistant checkpoint interval for fault tolerant performance optimization.


Sign in / Sign up

Export Citation Format

Share Document