Towards Building Resilient Scientific Applications: Resilience Analysis on the Impact of Soft Error and Transient Error Tolerance with the CLAMR Hydrodynamics Mini-App

Author(s):  
Qiang Guan ◽  
Nathan DeBardeleben ◽  
Brian Atkinson ◽  
Robert Robey ◽  
WIlliam M. Jones
2020 ◽  
Vol 20 (2) ◽  
pp. e14
Author(s):  
Diego Montezanti

  Reliability and fault tolerance have become aspects of growing relevance in the field of HPC, due to the increased probability that faults of different kinds will occur in these systems. This is fundamentally due to the increasing complexity of the processors, in the search to improve performance, which leads to a rise in the scale of integration and in the number of components that work near their technological limits, being increasingly prone to failures. Another factor that affects is the growth in the size of parallel systems to obtain greater computational power, in terms of number of cores and processing nodes. As applications demand longer uninterrupted computation times, the impact of faults grows, due to the cost of relaunching an execution that was aborted due to the occurrence of a fault or concluded with erroneous results. Consequently, it is necessary to run these applications on highly available and reliable systems, requiring strategies capable of providing detection, protection and recovery against faults. In the next years it is planned to reach Exa-scale, in which there will be supercomputers with millions of processing cores, capable of performing on the order of 1018 operations per second. This is a great window of opportunity for HPC applications, but it also increases the risk that they will not complete their executions. Recent studies show that, as systems continue to include more processors, the Mean Time Between Errors decreases, resulting in higher failure rates and increased risk of corrupted results; large parallel applications are expected to deal with errors that occur every few minutes, requiring external help to progress efficiently. Silent Data Corruptions are the most dangerous errors that can occur, since they can generate incorrect results in programs that appear to execute correctly. Scientific applications and large-scale simulations are the most affected, making silent error handling the main challenge towards resilience in HPC. In message passing applications, a silent error, affecting a single task, can produce a pattern of corruption that spreads to all communicating processes; in the worst case scenario, the erroneous final results cannot be detected at the end of the execution and will be taken as correct. Since scientific applications have execution times of the order of hours or even days, it is essential to find strategies that allow applications to reach correct solutions in a bounded time, despite the underlying failures. These strategies also prevent energy consumption from skyrocketing, since if they are not used, the executions should be launched again from the beginning. However, the most popular parallel programming models used in supercomputers lack support for fault tolerance.


Electronics ◽  
2021 ◽  
Vol 10 (13) ◽  
pp. 1572
Author(s):  
Ehab A. Hamed ◽  
Inhee Lee

In the previous three decades, many Radiation-Hardened-by-Design (RHBD) Flip-Flops (FFs) have been designed and improved to be immune to Single Event Upsets (SEUs). Their specifications are enhanced regarding soft error tolerance, area overhead, power consumption, and delay. In this review, previously presented RHBD FFs are classified into three categories with an overview of each category. Six well-known RHBD FFs architectures are simulated using a 180 nm CMOS process to show a fair comparison between them while the conventional Transmission Gate Flip-Flop (TGFF) is used as a reference design for this comparison. The results of the comparison are analyzed to give some important highlights about each design.


Author(s):  
Aleksandr Zatsarinny ◽  
Yuri Stepchenkov ◽  
Yuri Diachenko ◽  
Yuri Rogdestvenski

The article considers the problem of developing synchronous and self-timed (ST) digital circuits that are tolerant to soft errors. Synchronous circuits traditionally use the 2-of-3 voting principle to ensure single failure, resulting in three times the hardware costs. In ST circuits, due to dual-rail signal coding and two-phase control, even duplication provides a soft error tolerance level 2.1 to 3.5 times higher than the triple modular redundant synchronous counterpart. The development of new high-precision software simulating microelectronic failure mechanisms will provide more accurate estimates for the electronic circuits' failure tolerance


Author(s):  
Qiang Guan ◽  
Nathan DeBardeleben ◽  
Sean Blanchard ◽  
Song Fu ◽  
Claude H. Davis IV ◽  
...  

As the high performance computing (HPC) community continues to push towards exascale computing, HPC applications of today are only affected by soft errors to a small degree but we expect that this will become a more serious issue as HPC systems grow. We propose F-SEFI, a Fine-grained Soft Error Fault Injector, as a tool for profiling software robustness against soft errors. We utilize soft error injection to mimic the impact of errors on logic circuit behavior. Leveraging the open source virtual machine hypervisor QEMU, F-SEFI enables users to modify emulated machine instructions to introduce soft errors. F-SEFI can control what application, which sub-function, when and how to inject soft errors with different granularities, without interference to other applications that share the same environment. We demonstrate use cases of F-SEFI on several benchmark applications with different characteristics to show how data corruption can propagate to incorrect results. The findings from the fault injection campaign can be used for designing robust software and power-efficient hardware.


Sign in / Sign up

Export Citation Format

Share Document