A Performance and Energy Comparison of Fault Tolerance Techniques for Exascale Computing Systems

High-performance computing systems continue to increase in size in the quest for ever higher performance. The resulting increased electronic component count, coupled with the decrease in feature sizes of the silicon manufacturing processes used to build these components, may result in future exascale systems being more susceptible to soft errors caused by cosmic radiation than in current high-performance computing systems. Through the use of techniques such as hardware-based error-correcting codes and checkpoint-restart, many of these faults can be mitigated at the cost of increased hardware overhead, run-time, and energy consumption that can be as much as 10–20%. Some predictions expect these overheads to continue to grow over time. For extreme scale systems, these overheads will represent megawatts of power consumption and millions of dollars of additional hardware costs, which could potentially be avoided with more sophisticated fault-tolerance techniques. In this paper we present new software-based fault tolerance techniques that can be applied to one of the most important classes of software in high-performance computing: iterative sparse matrix solvers. Our new techniques enables us to exploit knowledge of the structure of sparse matrices in such a way as to improve the performance, energy efficiency, and fault tolerance of the overall solution.

Download Full-text

Toward Exascale Computing Systems: An Energy Efficient Massive Parallel Computational Model

International Journal of Advanced Computer Science and Applications ◽

10.14569/ijacsa.2018.090217 ◽

2018 ◽

Vol 9 (2) ◽

Author(s):

Muhammad Usman ◽

Fathy Alburaei ◽

Aiiad Ahmad ◽

Abdullah

Keyword(s):

Computational Model ◽

Energy Efficient ◽

Computing Systems ◽

Exascale Computing

Download Full-text

A General Framework of Algorithm-Based Fault Tolerance Technique for Computing Systems

Analyzing Security, Trust, and Crime in the Digital World - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-4666-4856-2.ch001 ◽

2014 ◽

pp. 1-21 ◽

Cited By ~ 1

Author(s):

Hodjatollah Hamidi

Keyword(s):

Fault Tolerance ◽

Error Correction ◽

General Framework ◽

Fault Tolerant ◽

Convolutional Code ◽

Numerical Algorithms ◽

Convolutional Codes ◽

Computing Systems ◽

Specific Level ◽

Computing Paradigm

The Algorithm-Based Fault Tolerance (ABFT) approach transforms a system that does not tolerate a specific type of faults, called the fault-intolerant system, to a system that provides a specific level of fault tolerance, namely recovery. The ABFT philosophy leads directly to a model from which error correction can be developed. By employing an ABFT scheme with effective convolutional code, the design allows high throughput as well as high fault coverage. The ABFT techniques that detect errors rely on the comparison of parity values computed in two ways. The parallel processing of input parity values produce output parity values comparable with parity values regenerated from the original processed outputs and can apply convolutional codes for the redundancy. This method is a new approach to concurrent error correction in fault-tolerant computing systems. This chapter proposes a novel computing paradigm to provide fault tolerance for numerical algorithms. The authors also present, implement, and evaluate early detection in ABFT.

Download Full-text

A Method to Support Fault Tolerance Design in Service Oriented Computing Systems

Theoretical and Analytical Service-Focused Systems Design and Development ◽

10.4018/978-1-4666-1767-4.ch019 ◽

2012 ◽

pp. 362-376

Author(s):

Domenico Cotroneo ◽

Antonio Pecchia ◽

Roberto Pietrantuono ◽

Stefano Russo

Keyword(s):

Fault Tolerance ◽

Common Ground ◽

Fault Injection ◽

Failure Behavior ◽

Tolerance Design ◽

System Failure ◽

Computing Systems ◽

Service Oriented Computing ◽

Service Oriented ◽

Tailored Design

Service Oriented Computing relies on the integration of heterogeneous software technologies and infrastructures that provide developers with a common ground for composing services and producing applications flexibly. However, this approach eases software development but makes dependability a big challenge. Integrating such diverse software items raise issues that traditional testing is not able to exhaustively cope with. In this context, tolerating faults, rather than attempt to detect them solely by testing, is a more suitable solution. This paper proposes a method to support a tailored design of fault tolerance actions for the system being developed. This paper describes system failure behavior through an extensive fault injection campaign to figure out its criticalities and adopt the most appropriate countermeasures to tolerate operational faults. The proposed method is applied to two distinct SOC-enabling technologies. Results show how the achieved findings allow designers to understand the system failure behavior and plan fault tolerance.

Download Full-text

Graph-Based Load Balancing Model for Exascale Computing Systems

11th International Conference on Theory and Application of Soft Computing, Computing with Words and Perceptions and Artificial Intelligence - ICSCCW-2021 - Lecture Notes in Networks and Systems ◽

10.1007/978-3-030-92127-9_33 ◽

2022 ◽

pp. 229-236

Author(s):

Araz R. Aliev ◽

Nigar T. Ismayilova

Keyword(s):

Load Balancing ◽

Computing Systems ◽

Exascale Computing

Download Full-text

A Method to Support Fault Tolerance Design in Service Oriented Computing Systems

International Journal of Systems and Service-Oriented Engineering ◽

10.4018/jssoe.2010070105 ◽

2010 ◽

Vol 1 (3) ◽

pp. 75-89

Author(s):

Domenico Cotroneo ◽

Antonio Pecchia ◽

Roberto Pietrantuono ◽

Stefano Russo

Keyword(s):

Fault Tolerance ◽

Common Ground ◽

Fault Injection ◽

Failure Behavior ◽

Tolerance Design ◽

System Failure ◽

Computing Systems ◽

Service Oriented Computing ◽

Service Oriented ◽

Tailored Design

Service Oriented Computing relies on the integration of heterogeneous software technologies and infrastructures that provide developers with a common ground for composing services and producing applications flexibly. However, this approach eases software development but makes dependability a big challenge. Integrating such diverse software items raise issues that traditional testing is not able to exhaustively cope with. In this context, tolerating faults, rather than attempt to detect them solely by testing, is a more suitable solution. This paper proposes a method to support a tailored design of fault tolerance actions for the system being developed. This paper describes system failure behavior through an extensive fault injection campaign to figure out its criticalities and adopt the most appropriate countermeasures to tolerate operational faults. The proposed method is applied to two distinct SOC-enabling technologies. Results show how the achieved findings allow designers to understand the system failure behavior and plan fault tolerance.

Download Full-text