Analyzing the Robustness of HPC Applications Using a Fine-Grained Soft Error Fault Injection Tool

As the high performance computing (HPC) community continues to push towards exascale computing, HPC applications of today are only affected by soft errors to a small degree but we expect that this will become a more serious issue as HPC systems grow. We propose F-SEFI, a Fine-grained Soft Error Fault Injector, as a tool for profiling software robustness against soft errors. We utilize soft error injection to mimic the impact of errors on logic circuit behavior. Leveraging the open source virtual machine hypervisor QEMU, F-SEFI enables users to modify emulated machine instructions to introduce soft errors. F-SEFI can control what application, which sub-function, when and how to inject soft errors with different granularities, without interference to other applications that share the same environment. We demonstrate use cases of F-SEFI on several benchmark applications with different characteristics to show how data corruption can propagate to incorrect results. The findings from the fault injection campaign can be used for designing robust software and power-efficient hardware.

Download Full-text

F-SEFI: A Fine-Grained Soft Error Fault Injection Tool for Profiling Application Vulnerability

2014 IEEE 28th International Parallel and Distributed Processing Symposium ◽

10.1109/ipdps.2014.128 ◽

2014 ◽

Cited By ~ 30

Author(s):

Qiang Guan ◽

Nathan Debardeleben ◽

Sean Blanchard ◽

Song Fu

Keyword(s):

Fault Injection ◽

Soft Error ◽

Fine Grained

Download Full-text

Tuned Forest Fire Prediction: Static Calibration of the Evolutionary Component of ‘ESS’

CLEI electronic journal ◽

10.19153/cleiej.17.2.9 ◽

2014 ◽

Vol 17 (2) ◽

Author(s):

Germán Bianchini ◽

Paola Caymes Scutari

Keyword(s):

Forest Fires ◽

High Performance ◽

Strong Impact ◽

Statistical System ◽

Static Calibration ◽

Parameters Tuning ◽

System Output ◽

The Impact ◽

Performance Computing

Forest fires are a major risk factor with strong impact at eco-environmental and socio- economical levels, reasons why their study and modeling are very important. However, the models frequently have a certain level of uncertainty in some input parameters given that they must be approximated or estimated, as a consequence of diverse difficulties to accurately measure the conditions of the phenomenon in real time. This has resulted in the development of several methods for the uncertainty reduction, whose trade-off between accuracy and complexity can vary significantly. The system ESS (Evolutionary- Statistical System) is a method whose aim is to reduce the uncertainty, by combining Statistical Analysis, High Performance Computing (HPC) and Parallel Evolutionary Al- gorithms (PEAs). The PEAs use several parameters that require adjustment and that determine the quality of their use. The calibration of the parameters is a crucial task for reaching a good performance and to improve the system output. This paper presents an empirical study of the parameters tuning to evaluate the effectiveness of different configurations and the impact of their use in the Forest Fires prediction.

Download Full-text

Enabling low latency at large-scale data center and high-performance computing interconnect networks using fine-grained all-optical switching technology

2017 International Conference on Optical Network Design and Modeling (ONDM) ◽

10.23919/ondm.2017.7958532 ◽

2017 ◽

Cited By ~ 3

Author(s):

Nan Hua ◽

Zhizhen Zhong ◽

Xiaoping Zheng

Keyword(s):

Data Center ◽

High Performance ◽

Optical Switching ◽

Large Scale ◽

Fine Grained ◽

Large Scale Data ◽

All Optical ◽

Performance Computing ◽

All Optical Switching ◽

Scale Data

Download Full-text

The impact of high-performance computing in the solution of linear systems: trends and problems

Journal of Computational and Applied Mathematics ◽

10.1016/s0377-0427(00)00401-5 ◽

2000 ◽

Vol 123 (1-2) ◽

pp. 515-530 ◽

Cited By ~ 14

Author(s):

Iain S. Duff

Keyword(s):

High Performance Computing ◽

Linear Systems ◽

High Performance ◽

The Impact ◽

Performance Computing

Download Full-text

SOFT ERRORS IN COMMERCIAL INTEGRATED CIRCUITS

International Journal of High Speed Electronics and Systems ◽

10.1142/s0129156404002363 ◽

2004 ◽

Vol 14 (02) ◽

pp. 299-309 ◽

Cited By ~ 15

Author(s):

R. C. BAUMANN

Keyword(s):

Integrated Circuits ◽

Physical Mechanism ◽

High Reliability ◽

Soft Errors ◽

Soft Error ◽

Radiation Mechanisms ◽

Technology Scaling ◽

The Impact ◽

Reliability Systems ◽

Considerable Concern

The once-ephemeral soft error has recently caused considerable concern for manufacturers of advanced silicon technology as this phenomenon now has the potential for inducing the highest failure rate of all other reliability mechanisms combined. We briefly review the three radiation mechanisms responsible for causing soft errors in commercial electronics and the basic physical mechanism by which ionizing radiation can produce a soft error. We then focus on the soft error sensitivity trends in commercial DRAM, SRAM, and peripheral logic devices as a function of technology scaling and discuss some of the solutions used for mitigating the impact of soft errors in high reliability systems.

Download Full-text

A look back on 30 years of the Gordon Bell Prize

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017738610 ◽

2017 ◽

Vol 31 (6) ◽

pp. 469-484 ◽

Cited By ~ 3

Author(s):

Gordon Bell ◽

David H Bailey ◽

Jack Dongarra ◽

Alan H Karp ◽

Kevin Walsh

Keyword(s):

Parallel Computing ◽

High Performance ◽

Large Scale ◽

Peak Performance ◽

Outstanding Achievement ◽

Computing Machinery ◽

Large Scale Data ◽

The Us ◽

The Impact ◽

Performance Computing

The Gordon Bell Prize is awarded each year by the Association for Computing Machinery to recognize outstanding achievement in high-performance computing (HPC). The purpose of the award is to track the progress of parallel computing with particular emphasis on rewarding innovation in applying HPC to applications in science, engineering, and large-scale data analytics. Prizes may be awarded for peak performance or special achievements in scalability and time-to-solution on important science and engineering problems. Financial support for the US$10,000 award is provided through an endowment by Gordon Bell, a pioneer in high-performance and parallel computing. This article examines the evolution of the Gordon Bell Prize and the impact it has had on the field.

Download Full-text

Assessing the Effects of Data Compression in Simulations Using Physically Motivated Metrics

Scientific Programming ◽

10.1155/2014/835419 ◽

2014 ◽

Vol 22 (2) ◽

pp. 141-155 ◽

Cited By ~ 2

Author(s):

Daniel Laney ◽

Steven Langer ◽

Christopher Weber ◽

Peter Lindstrom ◽

Al Wegener

Keyword(s):

High Performance ◽

Turbulence Modeling ◽

Time Step ◽

Data Movement ◽

Laser Plasma Interaction ◽

Error Metrics ◽

Tightly Coupled ◽

Modeling Code ◽

The Impact ◽

Performance Computing

This paper examines whether lossy compression can be used effectively in physics simulations as a possible strategy to combat the expected data-movement bottleneck in future high performance computing architectures. We show that, for the codes and simulations we tested, compression levels of 3–5X can be applied without causing significant changes to important physical quantities. Rather than applying signal processing error metrics, we utilize physics-based metrics appropriate for each code to assess the impact of compression. We evaluate three different simulation codes: a Lagrangian shock-hydrodynamics code, an Eulerian higher-order hydrodynamics turbulence modeling code, and an Eulerian coupled laser-plasma interaction code. We compress relevant quantities after each time-step to approximate the effects of tightly coupled compression and study the compression rates to estimate memory and disk-bandwidth reduction. We find that the error characteristics of compression algorithms must be carefully considered in the context of the underlying physics being modeled.

Download Full-text

Assessing the impact of high-performance computing on the drug discovery and development process

Drug Discovery Today BIOSILICO ◽

10.1016/s1741-8364(04)02417-5 ◽

2004 ◽

Vol 2 (5) ◽

pp. 175-179 ◽

Cited By ~ 1

Author(s):

Richard K. Scott

Keyword(s):

Drug Discovery ◽

High Performance Computing ◽

High Performance ◽

Development Process ◽

Drug Discovery And Development ◽

The Impact ◽

Performance Computing

Download Full-text

ANALYSIS OF THE IMPACT OF USING HIGH PERFORMANCE COMPUTING IN FIRE MODELING

16th International Multidisciplinary Scientific GeoConference SGEM2016, Informatics, Geoinformatics and Remote Sensing ◽

10.5593/sgem2016/b21/s07.004 ◽

2016 ◽

Author(s):

Marius Suvar

Keyword(s):

High Performance Computing ◽

High Performance ◽

Fire Modeling ◽

In Fire ◽

The Impact ◽

Performance Computing

Download Full-text

A Program-Aware Fault-Injection Method for Dependability Evaluation Against Soft-Error Using Genetic Algorithm

Journal of Circuits System and Computers ◽

10.1142/s021812661850144x ◽

2018 ◽

Vol 27 (09) ◽

pp. 1850144

Author(s):

Bahman Arasteh

Keyword(s):

Genetic Algorithm ◽

Computer System ◽

Fault Injection ◽

Soft Errors ◽

Soft Error ◽

Injection Method ◽

Software Failures ◽

Input Program ◽

Dependability Evaluation ◽

Original Size

Decreasing the scale of transistors and exponential increase in the transistor counts has made the soft-errors as one of the major causes of software failures. Fault injection is a powerful method for dependability assessment of a computer system against soft-errors. A considerable number of randomly injected faults in the current methods and tools are effect-less or equivalent. To overcome this problem and reduce the cost of fault injection, this study presents a software based fault-injection method that accurately evaluates the dependability of a computer system with a limited number fault-injection. Using a genetic algorithm (GA) the most vulnerable executable paths of an input program is identified; then only the basic blocs (BBs) into the identified vulnerable paths are considered as the target of fault injection. The results of fault injections on the set of 8 traditional benchmark-programs show that the proposed method reduces about 20% of effect-less faults by avoiding the injection of faults in the error-derating blocks of a program. Furthermore, the number of injected faults is reduced to 60% of its original size in the random injection. Also, the proposed method provides more stable and accurate results than the random injection.

Download Full-text