Error resilience of three GMRES implementations under fault injection

Safety-critical embedded systems may either use specialized hardware or rely on Software-Implemented Hardware Fault Tolerance (SIHFT) to meet soft error resilience requirements. SIHFT has the advantage that it can be used with low-cost, off-the-shelf components such as standard Micro-Controller Units. For this, SIHFT methods apply redundancy in software computation and special checker codes to detect transient errors, so called soft errors, that either corrupt the data flow or the control flow of the software and may lead to Silent Data Corruption (SDC). So far, this is done by applying separate SIHFT methods for the data and control flow protection, which leads to large overheads in computation time. This work in contrast presents REPAIR, a method that exploits the checks of the SIHFT data flow protection to also detect control flow errors as well, thereby, yielding higher SDC resilience with less computational overhead. For this, the data flow protection methods entail duplicating the computation with subsequent checks placed strategically throughout the program. These checks assure that the two redundant computation paths, which work on two different parts of the register file, yield the same result. By updating the pairing between the registers used in the primary computation path and the registers in the duplicated computation path using the REPAIR method, these checks also fail with high coverage when a control flow error, which leads to an illegal jumps, occurs. Extensive RTL fault injection simulations are carried out to accurately quantify soft error resilience while evaluating Mibench programs along with an embedded case-study running on an OpenRISC processor. Our method performs slightly better on average in terms of soft error resilience compared to the best state-of-the-art method but requiring significantly lower overheads. These results show that REPAIR is a valuable addition to the set of known SIHFT methods.

Download Full-text

SUGAR

Proceedings of the ACM on Measurement and Analysis of Computing Systems ◽

10.1145/3447375 ◽

2021 ◽

Vol 5 (1) ◽

pp. 1-29

Author(s):

Lishan Yang ◽

Bin Nie ◽

Adwait Jog ◽

Evgenia Smirni

Keyword(s):

Graphics Processing Units ◽

Fault Injection ◽

Error Resilience ◽

Reliable Operation ◽

Estimation Errors ◽

Input Size ◽

Wide Range ◽

Established Fact ◽

Memory Resources ◽

Graphics Processing

As Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One of the major challenges in the domain of GPU reliability is to accurately measure GPGPU application error resilience. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on theapplication error resilience is impractical. Application resilience is evaluated via extensive fault injection campaigns based on sampling of an extensive fault site space. Typically, the larger the input of the GPGPU application, the longer the experimental campaign. In this work, we devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by judicious input sizing. We show how analyzing a small fraction of the input is sufficient to estimate the application resilience with high accuracy and dramatically reduce the duration of experimentation. Key of our estimation methodology is the discovery of repeating patterns as a function of the input size. Using the well-established fact that error resilience in GPGPU applications is mostly determined by the dynamic instruction count at the thread level, we identify the patterns that allow us to accurately predict application error resilience for arbitrarily large inputs. For the cases that we examine in this paper, this new resilience estimation mechanism provides significant speedups (up to 1336 times) and 97.0 on the average, while keeping estimation errors to less than 1%.

Download Full-text

A Manual Diagnosis Approach Using Targeted Fault Injection and Fault Simulation to Extend ATPG Diagnostic Resolution in Localizing Faults

ISTFA 2019: Conference Proceedings from the 45th International Symposium for Testing and Failure Analysis ◽

10.31399/asm.cp.istfa2019p0419 ◽

2019 ◽

Author(s):

Rommel Estores ◽

Karo Vander Gucht

Keyword(s):

Fault Injection ◽

Fault Simulation ◽

Fault Localization ◽

Pattern Generation ◽

Test Coverage ◽

Actual Case ◽

Electrical Failure ◽

Starting Point ◽

Diagnostic Resolution ◽

Diagnosis Approach

Abstract This paper discusses a creative manual diagnosis approach, a complementary technique that provides the possibility to extend Automatic Test Pattern Generation (ATPG) beyond its own limits. The authors will discuss this approach in detail using an actual case – a test coverage issue where user-generated ATPG patterns and the resulting ATPG diagnosis isolated the fault to a small part of the digital core. However, traditional fault localization techniques was unable to isolate the fault further. Using the defect candidates from ATPG diagnosis as a starting point, manual diagnosis through fault Injection and fault simulation was performed. Further fault localization was performed using the ‘not detected’ (ND) and/or ‘detected’ (DT) fault classes for each of the available patterns. The result has successfully deduced the defect candidates until the exact faulty net causing the electrical failure was identified. The ability of the FA lab to maximize the use of ATPG in combination with other tools/techniques to investigate failures in detail; is crucial in the fast root cause determination and, in case of a test coverage, aid in having effective test screen method implemented.

Download Full-text

Timing Sensitivity Analysis of Logical Nodes in Scan Design Integrated Circuits by Pulsed Diode Laser Stimulation

ISTFA 2008: Conference Proceedings from the 34th International Symposium for Testing and Failure Analysis ◽

10.31399/asm.cp.istfa2008p0180 ◽

2008 ◽

Author(s):

T. Kiyan ◽

C. Boit ◽

C. Brillert

Keyword(s):

Laser Pulse ◽

Integrated Circuits ◽

Continuous Wave ◽

Fault Injection ◽

Scan Design ◽

Flip Flop ◽

Laser Stimulation ◽

Scan Chain ◽

P Type ◽

Scan Pattern

Abstract In this paper, a methodology based upon laser stimulation and a comparison of continuous wave and pulsed laser operation will be presented that localizes the fault relevant sites in a fully functional scan chain cell. The technique uses a laser incident from the backside to inject soft faults into internal nodes of a master-slave scan flip-flop in consequence of localized photocurrent. Depending on the illuminated type of the transistors (n- or p-type), injection of a logic ‘0’ or ‘1’ into the master or the slave stage of a flip-flop takes place. The laser pulse is externally triggered and can easily be shifted to various time slots in reference to clock and scan pattern. This feature of the laser diode allows triggering the laser pulse on the rising or the falling edge of the clock. Therefore, it is possible to choose the stage of the flip-flop in which the fault injection should occur. It is also demonstrated that the technique is able to identify the most sensitive signal condition for fault injection with a better time resolution than the pulse width of the laser, a significant improvement for failure analysis of integrated circuits.

Download Full-text

Migrating Electronic Systems from Fault Tolerant Computing to Error Resilience

2018 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA) ◽

10.23919/spa.2018.8563408 ◽

2018 ◽

Author(s):

Heinrich Theodor Vierhaus

Keyword(s):

Fault Tolerant ◽

Error Resilience ◽

Electronic Systems

Download Full-text

Countermeasures Optimization in Multiple Fault-Injection Context

2020 Workshop on Fault Detection and Tolerance in Cryptography (FDTC) ◽

10.1109/fdtc51366.2020.00011 ◽

2020 ◽

Author(s):

Etienne Boespflug ◽

Cristian Ene ◽

Laurent Mounier ◽

Marie-Laure Potet

Keyword(s):

Fault Injection ◽

Multiple Fault

Download Full-text

Security Threat Analyses and Attack Models for Approximate Computing Systems

ACM Transactions on Design Automation of Electronic Systems ◽

10.1145/3442380 ◽

2021 ◽

Vol 26 (4) ◽

pp. 1-31

Author(s):

Pruthvy Yellu ◽

Landon Buell ◽

Miguel Mark ◽

Michel A. Kinsy ◽

Dongpeng Xu ◽

...

Keyword(s):

Error Resilience ◽

Security Threats ◽

Approximate Computing ◽

Security Threat ◽

Security Vulnerabilities ◽

Computing Systems ◽

Quantitative Analyses ◽

Resilience Mechanisms ◽

The Impact ◽

Attack Models

Approximate computing (AC) represents a paradigm shift from conventional precise processing to inexact computation but still satisfying the system requirement on accuracy. The rapid progress on the development of diverse AC techniques allows us to apply approximate computing to many computation-intensive applications. However, the utilization of AC techniques could bring in new unique security threats to computing systems. This work does a survey on existing circuit-, architecture-, and compiler-level approximate mechanisms/algorithms, with special emphasis on potential security vulnerabilities. Qualitative and quantitative analyses are performed to assess the impact of the new security threats on AC systems. Moreover, this work proposes four unique visionary attack models, which systematically cover the attacks that build covert channels, compensate approximation errors, terminate normal error resilience mechanisms, and propagate additional errors. To thwart those attacks, this work further offers the guideline of countermeasure designs. Several case studies are provided to illustrate the implementation of the suggested countermeasures.

Download Full-text