Combining architectural fault-injection and neutron beam testing approaches toward better understanding of GPU soft-error resilience

Author(s):  
Fritz G. Previlon ◽  
Babatunde Egbantan ◽  
Devesh Tiwari ◽  
Paolo Rech ◽  
David. R. Kaeli
2021 ◽  
Vol 20 (5s) ◽  
pp. 1-22
Author(s):  
Uzair Sharif ◽  
Daniel Mueller-Gritschneder ◽  
Ulf Schlichtmann

Safety-critical embedded systems may either use specialized hardware or rely on Software-Implemented Hardware Fault Tolerance (SIHFT) to meet soft error resilience requirements. SIHFT has the advantage that it can be used with low-cost, off-the-shelf components such as standard Micro-Controller Units. For this, SIHFT methods apply redundancy in software computation and special checker codes to detect transient errors, so called soft errors, that either corrupt the data flow or the control flow of the software and may lead to Silent Data Corruption (SDC). So far, this is done by applying separate SIHFT methods for the data and control flow protection, which leads to large overheads in computation time. This work in contrast presents REPAIR, a method that exploits the checks of the SIHFT data flow protection to also detect control flow errors as well, thereby, yielding higher SDC resilience with less computational overhead. For this, the data flow protection methods entail duplicating the computation with subsequent checks placed strategically throughout the program. These checks assure that the two redundant computation paths, which work on two different parts of the register file, yield the same result. By updating the pairing between the registers used in the primary computation path and the registers in the duplicated computation path using the REPAIR method, these checks also fail with high coverage when a control flow error, which leads to an illegal jumps, occurs. Extensive RTL fault injection simulations are carried out to accurately quantify soft error resilience while evaluating Mibench programs along with an embedded case-study running on an OpenRISC processor. Our method performs slightly better on average in terms of soft error resilience compared to the best state-of-the-art method but requiring significantly lower overheads. These results show that REPAIR is a valuable addition to the set of known SIHFT methods.


Author(s):  
Qiang Guan ◽  
Nathan DeBardeleben ◽  
Sean Blanchard ◽  
Song Fu ◽  
Claude H. Davis IV ◽  
...  

As the high performance computing (HPC) community continues to push towards exascale computing, HPC applications of today are only affected by soft errors to a small degree but we expect that this will become a more serious issue as HPC systems grow. We propose F-SEFI, a Fine-grained Soft Error Fault Injector, as a tool for profiling software robustness against soft errors. We utilize soft error injection to mimic the impact of errors on logic circuit behavior. Leveraging the open source virtual machine hypervisor QEMU, F-SEFI enables users to modify emulated machine instructions to introduce soft errors. F-SEFI can control what application, which sub-function, when and how to inject soft errors with different granularities, without interference to other applications that share the same environment. We demonstrate use cases of F-SEFI on several benchmark applications with different characteristics to show how data corruption can propagate to incorrect results. The findings from the fault injection campaign can be used for designing robust software and power-efficient hardware.


IEEE Access ◽  
2020 ◽  
Vol 8 ◽  
pp. 19490-19503 ◽  
Author(s):  
Younis Ibrahim ◽  
Haibin Wang ◽  
Man Bai ◽  
Zhi Liu ◽  
Jianan Wang ◽  
...  

2012 ◽  
Vol 61 (3) ◽  
pp. 313-322 ◽  
Author(s):  
Luis Entrena ◽  
Mario Garcia-Valderas ◽  
Raul Fernandez-Cardenal ◽  
Almudena Lindoso ◽  
Marta Portela ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document