Towards Building Resilient Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance with the CLAMR Hydrodynamics Mini-App

Reliability and fault tolerance have become aspects of growing relevance in the field of HPC, due to the increased probability that faults of different kinds will occur in these systems. This is fundamentally due to the increasing complexity of the processors, in the search to improve performance, which leads to a rise in the scale of integration and in the number of components that work near their technological limits, being increasingly prone to failures. Another factor that affects is the growth in the size of parallel systems to obtain greater computational power, in terms of number of cores and processing nodes. As applications demand longer uninterrupted computation times, the impact of faults grows, due to the cost of relaunching an execution that was aborted due to the occurrence of a fault or concluded with erroneous results. Consequently, it is necessary to run these applications on highly available and reliable systems, requiring strategies capable of providing detection, protection and recovery against faults. In the next years it is planned to reach Exa-scale, in which there will be supercomputers with millions of processing cores, capable of performing on the order of 1018 operations per second. This is a great window of opportunity for HPC applications, but it also increases the risk that they will not complete their executions. Recent studies show that, as systems continue to include more processors, the Mean Time Between Errors decreases, resulting in higher failure rates and increased risk of corrupted results; large parallel applications are expected to deal with errors that occur every few minutes, requiring external help to progress efficiently. Silent Data Corruptions are the most dangerous errors that can occur, since they can generate incorrect results in programs that appear to execute correctly. Scientific applications and large-scale simulations are the most affected, making silent error handling the main challenge towards resilience in HPC. In message passing applications, a silent error, affecting a single task, can produce a pattern of corruption that spreads to all communicating processes; in the worst case scenario, the erroneous final results cannot be detected at the end of the execution and will be taken as correct. Since scientific applications have execution times of the order of hours or even days, it is essential to find strategies that allow applications to reach correct solutions in a bounded time, despite the underlying failures. These strategies also prevent energy consumption from skyrocketing, since if they are not used, the executions should be launched again from the beginning. However, the most popular parallel programming models used in supercomputers lack support for fault tolerance.

Download Full-text

Categorization and SEU Fault Simulations of Radiation-Hardened-by-Design Flip-Flops

Electronics ◽

10.3390/electronics10131572 ◽

2021 ◽

Vol 10 (13) ◽

pp. 1572

Author(s):

Ehab A. Hamed ◽

Inhee Lee

Keyword(s):

Soft Error ◽

Error Tolerance ◽

Cmos Process ◽

Flip Flop ◽

Single Event Upsets ◽

Area Overhead ◽

Fair Comparison ◽

Reference Design ◽

Radiation Hardened ◽

Radiation Hardened By Design

In the previous three decades, many Radiation-Hardened-by-Design (RHBD) Flip-Flops (FFs) have been designed and improved to be immune to Single Event Upsets (SEUs). Their specifications are enhanced regarding soft error tolerance, area overhead, power consumption, and delay. In this review, previously presented RHBD FFs are classified into three categories with an overview of each category. Six well-known RHBD FFs architectures are simulated using a 180 nm CMOS process to show a fair comparison between them while the conventional Transmission Gate Flip-Flop (TGFF) is used as a reference design for this comparison. The results of the comparison are analyzed to give some important highlights about each design.

Download Full-text

Soft Error Tolerance of Standard and Stacked Latches Dependending on Substrate Bias in a FDSOI Process Evaluated by Device Simulation

2019 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S) ◽

10.1109/s3s46989.2019.9320705 ◽

2019 ◽

Author(s):

Kentaro Kojima ◽

Jun Furuta ◽

Kazutoshi Kobayashi

Keyword(s):

Device Simulation ◽

Soft Error ◽

Error Tolerance ◽

Substrate Bias

Download Full-text

The Impact of Precision Bitwidth on the Soft Error Reliability of the MobileNet Network

10.1109/lascas51355.2021.9667153 ◽

2021 ◽

Author(s):

Geancarlo Abich ◽

Ricardo Reis ◽

Luciano Ost

Keyword(s):

Soft Error ◽

The Impact

Download Full-text

FAILURE TOLERANT SYNCHRONOUS AND SELT-TIED CIRCUITS COMPARISON

10.29003/m2498.mmmsec-2021/154-156 ◽

2021 ◽

Author(s):

Aleksandr Zatsarinny ◽

Yuri Stepchenkov ◽

Yuri Diachenko ◽

Yuri Rogdestvenski

Keyword(s):

Digital Circuits ◽

Soft Error ◽

Error Tolerance ◽

Electronic Circuits ◽

Two Phase ◽

Synchronous Circuits ◽

Tolerance Level ◽

Failure Tolerance ◽

Hardware Costs ◽

Signal Coding

The article considers the problem of developing synchronous and self-timed (ST) digital circuits that are tolerant to soft errors. Synchronous circuits traditionally use the 2-of-3 voting principle to ensure single failure, resulting in three times the hardware costs. In ST circuits, due to dual-rail signal coding and two-phase control, even duplication provides a soft error tolerance level 2.1 to 3.5 times higher than the triple modular redundant synchronous counterpart. The development of new high-precision software simulating microelectronic failure mechanisms will provide more accurate estimates for the electronic circuits' failure tolerance

Download Full-text

Design of Soft error tolerance technique for FPGA based soft core processors

2014 IEEE International Conference on Advanced Communications, Control and Computing Technologies ◽

10.1109/icaccct.2014.7019254 ◽

2014 ◽

Cited By ~ 4

Author(s):

Ishan M Safarulla ◽

Karthika Manilal

Keyword(s):

Soft Error ◽

Error Tolerance ◽

Soft Core

Download Full-text

Method for Formal Verification of Soft-Error Tolerance Mechanisms in Pipelined Microprocessors

Formal Methods and Software Engineering - Lecture Notes in Computer Science ◽

10.1007/978-3-642-16901-4_24 ◽

2010 ◽

pp. 355-370 ◽

Cited By ~ 2

Author(s):

Miroslav N. Velev ◽

Ping Gao

Keyword(s):

Formal Verification ◽

Soft Error ◽

Error Tolerance ◽

Tolerance Mechanisms ◽

Pipelined Microprocessors

Download Full-text

Analyzing the Robustness of HPC Applications Using a Fine-Grained Soft Error Fault Injection Tool

Innovative Research and Applications in Next-Generation High Performance Computing - Advances in Systems Analysis, Software Engineering, and High Performance Computing ◽

10.4018/978-1-5225-0287-6.ch011 ◽

2016 ◽

pp. 277-305

Author(s):

Qiang Guan ◽

Nathan DeBardeleben ◽

Sean Blanchard ◽

Song Fu ◽

Claude H. Davis IV ◽

...

Keyword(s):

High Performance ◽

Fault Injection ◽

Soft Errors ◽

Small Degree ◽

Soft Error ◽

Power Efficient ◽

Fine Grained ◽

Different Characteristics ◽

The Impact ◽

Performance Computing

As the high performance computing (HPC) community continues to push towards exascale computing, HPC applications of today are only affected by soft errors to a small degree but we expect that this will become a more serious issue as HPC systems grow. We propose F-SEFI, a Fine-grained Soft Error Fault Injector, as a tool for profiling software robustness against soft errors. We utilize soft error injection to mimic the impact of errors on logic circuit behavior. Leveraging the open source virtual machine hypervisor QEMU, F-SEFI enables users to modify emulated machine instructions to introduce soft errors. F-SEFI can control what application, which sub-function, when and how to inject soft errors with different granularities, without interference to other applications that share the same environment. We demonstrate use cases of F-SEFI on several benchmark applications with different characteristics to show how data corruption can propagate to incorrect results. The findings from the fault injection campaign can be used for designing robust software and power-efficient hardware.

Download Full-text