High performance fault-tolerance for clouds

Author(s):  
Dimosthenis Kyriazis ◽  
Vasileios Anagnostopoulos ◽  
Andrea Arcangeli ◽  
David Gilbert ◽  
Dimitrios Kalogeras ◽  
...  
2017 ◽  
Vol 16 ◽  
pp. 1-10 ◽  
Author(s):  
Naresh Kumar Reddy Beechu ◽  
Vasantha Moodabettu Harishchandra ◽  
Nithin Kumar Yernad Balachandra

Author(s):  
Simon McIntosh–Smith ◽  
Rob Hunt ◽  
James Price ◽  
Alex Warwick Vesztrocy

High-performance computing systems continue to increase in size in the quest for ever higher performance. The resulting increased electronic component count, coupled with the decrease in feature sizes of the silicon manufacturing processes used to build these components, may result in future exascale systems being more susceptible to soft errors caused by cosmic radiation than in current high-performance computing systems. Through the use of techniques such as hardware-based error-correcting codes and checkpoint-restart, many of these faults can be mitigated at the cost of increased hardware overhead, run-time, and energy consumption that can be as much as 10–20%. Some predictions expect these overheads to continue to grow over time. For extreme scale systems, these overheads will represent megawatts of power consumption and millions of dollars of additional hardware costs, which could potentially be avoided with more sophisticated fault-tolerance techniques. In this paper we present new software-based fault tolerance techniques that can be applied to one of the most important classes of software in high-performance computing: iterative sparse matrix solvers. Our new techniques enables us to exploit knowledge of the structure of sparse matrices in such a way as to improve the performance, energy efficiency, and fault tolerance of the overall solution.


1992 ◽  
Vol 02 (03) ◽  
pp. 281-304
Author(s):  
SANJAY P. POPLI ◽  
MAGDY A. BAYOUMI ◽  
AKASH TYAGI

Real-time digital signal processing (DSP) applications require high performance parallel architectures that are also reliable. VLSI arrays are good candidates for providing the required high throughput for these applications. These arrays which consist of a number of regularly interconnected processing elements (PEs) will not function correctly in the presence of even a single fault in any of the PEs. Fault tolerance has therefore become a vital design criterion for VLSI arrays. In this paper, a fault tolerance strategy for VLSI arrays is proposed, which significantly improves the reliability of the system. The fault tolerance scheme is composed of two phases: testing and locating faults (fault detection and diagnosis), and reconfiguration. The first phase employs an on-line error detection technique which achieves a compromise between the space and time redundancy approaches. This concurrent error detection technique reduces the rollback time considerably. The reconfiguration phase is achieved by using a global control responsible for changing the states of the switches in the interconnection network. Backtracking is introduced into the algorithm for maximizing the processor utilization, at the same time keeping the complexity of the interconnection network as simple as possible. Finally, a reliability analysis of this scheme using a Markov model and a comparison with some previous schemes are given.


Author(s):  
Marc Casas ◽  
Wilfried N Gansterer ◽  
Elias Wimmer

We investigate the usefulness of gossip-based reduction algorithms in a high-performance computing (HPC) context. We compare them to state-of-the-art deterministic parallel reduction algorithms in terms of fault tolerance and resilience against silent data corruption (SDC) as well as in terms of performance and scalability. New gossip-based reduction algorithms are proposed, which significantly improve the state-of-the-art in terms of resilience against SDC. Moreover, a new gossip-inspired reduction algorithm is proposed, which promises a much more competitive runtime performance in an HPC context than classical gossip-based algorithms, in particular for low accuracy requirements.


Sign in / Sign up

Export Citation Format

Share Document