Analyzing the Robustness of HPC Applications Using a Fine-Grained Soft Error Fault Injection Tool

Author(s):  
Qiang Guan ◽  
Nathan DeBardeleben ◽  
Sean Blanchard ◽  
Song Fu ◽  
Claude H. Davis IV ◽  
...  

As the high performance computing (HPC) community continues to push towards exascale computing, HPC applications of today are only affected by soft errors to a small degree but we expect that this will become a more serious issue as HPC systems grow. We propose F-SEFI, a Fine-grained Soft Error Fault Injector, as a tool for profiling software robustness against soft errors. We utilize soft error injection to mimic the impact of errors on logic circuit behavior. Leveraging the open source virtual machine hypervisor QEMU, F-SEFI enables users to modify emulated machine instructions to introduce soft errors. F-SEFI can control what application, which sub-function, when and how to inject soft errors with different granularities, without interference to other applications that share the same environment. We demonstrate use cases of F-SEFI on several benchmark applications with different characteristics to show how data corruption can propagate to incorrect results. The findings from the fault injection campaign can be used for designing robust software and power-efficient hardware.

2014 ◽  
Vol 17 (2) ◽  
Author(s):  
Germán Bianchini ◽  
Paola Caymes Scutari

Forest fires are a major risk factor with strong impact at eco-environmental and socio- economical levels, reasons why their study and modeling are very important. However, the models frequently have a certain level of uncertainty in some input parameters given that they must be approximated or estimated, as a consequence of diverse difficulties to accurately measure the conditions of the phenomenon in real time. This has resulted in the development of several methods for the uncertainty reduction, whose trade-off between accuracy and complexity can vary significantly. The system ESS (Evolutionary- Statistical System) is a method whose aim is to reduce the uncertainty, by combining Statistical Analysis, High Performance Computing (HPC) and Parallel Evolutionary Al- gorithms (PEAs). The PEAs use several parameters that require adjustment and that determine the quality of their use. The calibration of the parameters is a crucial task for reaching a good performance and to improve the system output. This paper presents an empirical study of the parameters tuning to evaluate the effectiveness of different configurations and the impact of their use in the Forest Fires prediction.


2004 ◽  
Vol 14 (02) ◽  
pp. 299-309 ◽  
Author(s):  
R. C. BAUMANN

The once-ephemeral soft error has recently caused considerable concern for manufacturers of advanced silicon technology as this phenomenon now has the potential for inducing the highest failure rate of all other reliability mechanisms combined. We briefly review the three radiation mechanisms responsible for causing soft errors in commercial electronics and the basic physical mechanism by which ionizing radiation can produce a soft error. We then focus on the soft error sensitivity trends in commercial DRAM, SRAM, and peripheral logic devices as a function of technology scaling and discuss some of the solutions used for mitigating the impact of soft errors in high reliability systems.


Author(s):  
Gordon Bell ◽  
David H Bailey ◽  
Jack Dongarra ◽  
Alan H Karp ◽  
Kevin Walsh

The Gordon Bell Prize is awarded each year by the Association for Computing Machinery to recognize outstanding achievement in high-performance computing (HPC). The purpose of the award is to track the progress of parallel computing with particular emphasis on rewarding innovation in applying HPC to applications in science, engineering, and large-scale data analytics. Prizes may be awarded for peak performance or special achievements in scalability and time-to-solution on important science and engineering problems. Financial support for the US$10,000 award is provided through an endowment by Gordon Bell, a pioneer in high-performance and parallel computing. This article examines the evolution of the Gordon Bell Prize and the impact it has had on the field.


2014 ◽  
Vol 22 (2) ◽  
pp. 141-155 ◽  
Author(s):  
Daniel Laney ◽  
Steven Langer ◽  
Christopher Weber ◽  
Peter Lindstrom ◽  
Al Wegener

This paper examines whether lossy compression can be used effectively in physics simulations as a possible strategy to combat the expected data-movement bottleneck in future high performance computing architectures. We show that, for the codes and simulations we tested, compression levels of 3–5X can be applied without causing significant changes to important physical quantities. Rather than applying signal processing error metrics, we utilize physics-based metrics appropriate for each code to assess the impact of compression. We evaluate three different simulation codes: a Lagrangian shock-hydrodynamics code, an Eulerian higher-order hydrodynamics turbulence modeling code, and an Eulerian coupled laser-plasma interaction code. We compress relevant quantities after each time-step to approximate the effects of tightly coupled compression and study the compression rates to estimate memory and disk-bandwidth reduction. We find that the error characteristics of compression algorithms must be carefully considered in the context of the underlying physics being modeled.


2018 ◽  
Vol 27 (09) ◽  
pp. 1850144
Author(s):  
Bahman Arasteh

Decreasing the scale of transistors and exponential increase in the transistor counts has made the soft-errors as one of the major causes of software failures. Fault injection is a powerful method for dependability assessment of a computer system against soft-errors. A considerable number of randomly injected faults in the current methods and tools are effect-less or equivalent. To overcome this problem and reduce the cost of fault injection, this study presents a software based fault-injection method that accurately evaluates the dependability of a computer system with a limited number fault-injection. Using a genetic algorithm (GA) the most vulnerable executable paths of an input program is identified; then only the basic blocs (BBs) into the identified vulnerable paths are considered as the target of fault injection. The results of fault injections on the set of 8 traditional benchmark-programs show that the proposed method reduces about 20% of effect-less faults by avoiding the injection of faults in the error-derating blocks of a program. Furthermore, the number of injected faults is reduced to 60% of its original size in the random injection. Also, the proposed method provides more stable and accurate results than the random injection.


Sign in / Sign up

Export Citation Format

Share Document