scholarly journals Doubt and Redundancy Kill Soft Errors—Towards Detection and Correction of Silent Data Corruption in Task-based Numerical Software

Author(s):  
Philipp Samfass ◽  
Tobias Weinzierl ◽  
Anne Reinarz ◽  
Michael Bader
Author(s):  
David E Bernholdt ◽  
Wael R Elwasif ◽  
Christos Kartsaklis ◽  
Seyong Lee ◽  
Tiffany M Mintz

We present “programmer-guided reliability” (PGR) as a systematic conceptual approach to address the expected rise in soft errors in coming extreme-scale systems at the application level. The approach involves instrumentation of the application with code to detect data corruption errors. The location and nature of these error detectors are at the discretion of the programmer, who uses their knowledge and experience with the problem domain, the application, the solution algorithms, etc., to determine the most vulnerable areas of the code and the most appropriate ways to detect data corruption. To illustrate the approach, we provide examples of error detectors from four different benchmark-scale applications. We also describe a simple control framework that allows for runtime configuration of the error detectors without recompilation of the application, as well as dynamic reconfiguration during the execution of the application. Finally, we discuss a number of future directions building on the basic PGR approach, including the incorporation of some general error detectors into the programming environment in order to make them more easily usable by the programmer.


Electronics ◽  
2020 ◽  
Vol 10 (1) ◽  
pp. 61
Author(s):  
Na Yang ◽  
Yun Wang

Radiation-induced soft errors degrade the reliability of aerospace-based computing. Silent data corruption (SDC) is the most dangerous and insidious type of soft error result. To detect SDC, program invariant assertions are used to harden programs. However, there exist redundant assertions in hardened programs, which impairs the detection efficiency. Benign errors are another type of soft error result. An assertion may detect benign errors, incurring unnecessary recovery overhead. The detection degree of an assertion represents the detection capability, and an assertion with a high detection degree can detect severe errors. To improve the detection efficiency and detection degree while reducing the benign detection ratio, F_Radish is proposed in the present work to screen redundant assertions in a novel way. At a program point, the detection degree and benign detection ratio are considered to evaluate the importance of the assertions in the program point. As a result, only the most important assertion remains in the program point. Moreover, the redundancy degree is considered to screen redundant assertions for neighbouring program points. Experimental results show that in comparison with the Radish approach, the detection efficiency of F_Radish is about two times greater. Moreover, F_Radish reduces the benign detection ratio and improves the detection degree. It can avoid more unnecessary recovery overheads and detect more serious SDC than can Radish.


2018 ◽  
Author(s):  
Oberon Dixon-Luinenburg ◽  
Jordan Fine

Abstract In this paper, we demonstrate a novel nanoprobing approach to establish cause-and-effect relationships between voltage stress and end-of-life performance loss and failure in SRAM cells. A Hyperion II Atomic Force nanoProber was used to examine degradation for five 6T cells on an Intel 14 nm processor. Ten minutes of asymmetrically applied stress at VDD=2 V was used to simulate a ‘0’ bit state held for a long period, subjecting each pullup and pulldown to either VDS or VGS stress. Resultant degradation caused read and hold margins to be reduced by 20% and 5% respectively for the ‘1’ state and 5% and 2% respectively for the ‘0’ state. ION was also reduced, for pulldown and pullup respectively, by 4.5% and 5.4% following VGS stress and 2.6% and 33.8% following VDS stress. Negative read margin failures, soft errors, and read time failures all become more prevalent with these aging symptoms whereas write stability is improved. This new approach enables highly specific root cause analysis and failure prediction for end-of-life in functional on-product SRAM.


2021 ◽  
Author(s):  
Alexandra Zimpeck ◽  
Cristina Meinhardt ◽  
Laurent Artola ◽  
Ricardo Reis

Sign in / Sign up

Export Citation Format

Share Document