Analyzing the criticality of transient faults-induced SDCS on GPU applications

Author(s):  
Fernando Fernandes dos Santos ◽  
Paolo Rech
Keyword(s):  
2020 ◽  
pp. 1-1
Author(s):  
Moussa Kafal ◽  
Nicolas Gregis ◽  
Jaume Benoit ◽  
Nicolas Ravot ◽  
Clara Lagomarsini ◽  
...  

2021 ◽  
Vol 20 (3) ◽  
pp. 1-25
Author(s):  
James Marshall ◽  
Robert Gifford ◽  
Gedare Bloom ◽  
Gabriel Parmer ◽  
Rahul Simha

Increased access to space has led to an increase in the usage of commodity processors in radiation environments. These processors are vulnerable to transient faults such as single event upsets that may cause bit-flips in processor components. Caches in particular are vulnerable due to their relatively large area, yet are often omitted from fault injection testing because many processors do not provide direct access to cache contents and they are often not fully modeled by simulators. The performance benefits of caches make disabling them undesirable, and the presence of error correcting codes is insufficient to correct for increasingly common multiple bit upsets. This work explores building a program’s cache profile by collecting cache usage information at an instruction granularity via commonly available on-chip debugging interfaces. The profile provides a tighter bound than cache utilization for cache vulnerability estimates (50% for several benchmarks). This can be applied to reduce the number of fault injections required to characterize behavior by at least two-thirds for the benchmarks we examine. The profile enables future work in hardware fault injection for caches that avoids the biases of existing techniques.


2014 ◽  
pp. 26-30
Author(s):  
Goutam Kumar Saha

This paper examines a software implemented self-checking technique that is capable of detecting processorregisters' hardware-transient faults. The proposed approach is intended to detect run-time transient bit-errors in memory and processor status register. Error correction is not considered here. However, this low-cost approach is intended to be adopted in commodity systems that use ordinary off-the-shelf microprocessors, for the purpose of operational faults detection towards gaining fail-safe kind of fault tolerant system.


Sign in / Sign up

Export Citation Format

Share Document