A fault tolerant systolic mesh for linear system solution

1998 ◽  
Vol 67 (3-4) ◽  
pp. 315-332
Author(s):  
K. Bhuvaneswari ◽  
C. Siva Ram murthy
Automatica ◽  
2012 ◽  
Vol 48 (8) ◽  
pp. 1676-1682 ◽  
Author(s):  
Lijun Liu ◽  
Yi Shen ◽  
Earl H. Dowell ◽  
Chunhui Zhu

2013 ◽  
Vol 8 (3) ◽  
Author(s):  
Shaohua Wang ◽  
Juan Liu ◽  
Lin Chen

2021 ◽  
Vol 8 (4) ◽  
pp. 1-19
Author(s):  
Xuejiao Kang ◽  
David F. Gleich ◽  
Ahmed Sameh ◽  
Ananth Grama

As parallel and distributed systems scale, fault tolerance is an increasingly important problem—particularly on systems with limited I/O capacity and bandwidth. Erasure coded computations address this problem by augmenting a given problem instance with redundant data and then solving the augmented problem in a fault oblivious manner in a faulty parallel environment. In the event of faults, a computationally inexpensive procedure is used to compute the true solution from a potentially fault-prone solution. These techniques are significantly more efficient than conventional solutions to the fault tolerance problem. In this article, we show how we can minimize, to optimality, the overhead associated with our problem augmentation techniques for linear system solvers. Specifically, we present a technique that adaptively augments the problem only when faults are detected. At any point in execution, we only solve a system whose size is identical to the original input system. This has several advantages in terms of maintaining the size and conditioning of the system, as well as in only adding the minimal amount of computation needed to tolerate observed faults. We present, in detail, the augmentation process, the parallel formulation, and evaluation of performance of our technique. Specifically, we show that the proposed adaptive fault tolerance mechanism has minimal overhead in terms of FLOP counts with respect to the original solver executing in a non-faulty environment, has good convergence properties, and yields excellent parallel performance. We also demonstrate that our approach significantly outperforms an optimized application-level checkpointing scheme that only checkpoints needed data structures.


1992 ◽  
Vol 11 (3) ◽  
pp. 141-145 ◽  
Author(s):  
Luís Alfredo V de Carvalho ◽  
Valmir C Barbosa

Sign in / Sign up

Export Citation Format

Share Document