High Performance Computing Systems with Various Checkpointing Schemes

Finding the failure rate of a system is a crucial step in high performance computing systems analysis. To deal with this problem, a fault tolerant mechanism, called checkpoint/ restart technique, was introduced. However, there are additional costs to perform this mechanism. Thus, we propose two models for different schemes (full and incremental checkpoint schemes). The models which are based on the reliability of the system are used to determine the checkpoint placements. Both proposed models consider a balance of between checkpoint overhead and the re-computing time. Due to the extra costs from each incremental checkpoint during the recovery period, a method to find the number of incremental checkpoints between two consecutive full checkpoints is given. Our simulation suggests that in most cases our incremental checkpoint model can reduce the waste time more than it is reduced by the full checkpoint model. The waste times produced by both models are in the range of 2% to 28% of the application completion time depending on the checkpoint overheads.

Download Full-text

Algorithm Based Fault Tolerant and Check Pointing for High Performance Computing Systems

Journal of Applied Sciences ◽

10.3923/jas.2009.3947.3956 ◽

2009 ◽

Vol 9 (22) ◽

pp. 3947-3956 ◽

Cited By ~ 8

Author(s):

Hodjatollah Hamidi ◽

A. Vafaei ◽

A.H. Monadjemi

Keyword(s):

High Performance Computing ◽

High Performance ◽

Fault Tolerant ◽

Computing Systems ◽

Performance Computing

Download Full-text

FT-PBLAS: PBLAS-Based Fault-Tolerant Linear Algebra Computation on High-performance Computing Systems

IEEE Access ◽

10.1109/access.2020.2975832 ◽

2020 ◽

Vol 8 ◽

pp. 42674-42688

Author(s):

Yanchao Zhu ◽

Yi Liu ◽

Guozhen Zhang

Keyword(s):

High Performance Computing ◽

Linear Algebra ◽

High Performance ◽

Fault Tolerant ◽

Computing Systems ◽

Performance Computing ◽

Algebra Computation

Download Full-text

Optimizing Checkpoint Restart with Data Deduplication

Scientific Programming ◽

10.1155/2016/9315493 ◽

2016 ◽

Vol 2016 ◽

pp. 1-11 ◽

Cited By ~ 6

Author(s):

Zhengyu Chen ◽

Jianhua Sun ◽

Hao Chen

Keyword(s):

Detailed Analysis ◽

High Performance Computing ◽

High Performance ◽

Fault Tolerant ◽

Data Deduplication ◽

Software Faults ◽

Distributed Programs ◽

Computing Systems ◽

Redundancy Elimination ◽

Performance Computing

The increasing scale, such as the size and complexity, of computer systems brings more frequent occurrences of hardware or software faults; thus fault-tolerant techniques become an essential component in high-performance computing systems. In order to achieve the goal of tolerating runtime faults, checkpoint restart is a typical and widely used method. However, the exploding sizes of checkpoint files that need to be saved to external storage pose a major scalability challenge, necessitating the design of efficient approaches to reducing the amount of checkpointing data. In this paper, we first motivate the need of redundancy elimination with a detailed analysis of checkpoint data from real scenarios. Based on the analysis, we apply inline data deduplication to achieve the objective of reducing checkpoint size. We use DMTCP, an open-source checkpoint restart package, to validate our method. Our experiment shows that, by using our method, single-computer programs can reduce the size of checkpoint file by 20% and distributed programs can reduce the size of checkpoint file by 47%.

Download Full-text