Fault-injection-based testing of fault-tolerant algorithms in message-passing parallel computers

Author(s):  
D.M. Blough ◽  
T. Torii
Author(s):  
Gian Franco Sacco ◽  
Robert D. Ferraro ◽  
Paul von Allmen ◽  
Dave A. Rennels

2005 ◽  
Vol 18 (2) ◽  
pp. 219-224
Author(s):  
Emina Milovanovic ◽  
Natalija Stojanovic

Because many universities lack the funds to purchase expensive parallel computers, cost effective alternatives are needed to teach students about parallel processing. Free software is available to support the three major paradigms of parallel computing. Parallaxis is a sophisticated SIMD simulator which runs on a variety of platforms.jBACI shared memory simulator supports the MIMD model of computing with a common shared memory. PVM and MPI allow students to treat a network of workstations as a message passing MIMD multicomputer with distributed memory. Each of this software tools can be used in a variety of courses to give students experience with parallel algorithms.


Author(s):  
Gabriella Carrozza ◽  
Roberto Natella

This paper proposes an approach to software faults diagnosis in complex fault tolerant systems, encompassing the phases of error detection, fault location, and system recovery. Errors are detected in the first phase, exploiting the operating system support. Faults are identified during the location phase, through a machine learning based approach. Then, the best recovery action is triggered once the fault is located. Feedback actions are also used during the location phase to improve detection quality over time. A real world application from the Air Traffic Control field has been used as case study for evaluating the proposed approach. Experimental results, achieved by means of fault injection, show that the diagnosis engine is able to diagnose faults with high accuracy and at a low overhead.


Author(s):  
Gabriella Carrozza ◽  
Roberto Natella

This paper proposes an approach to software faults diagnosis in complex fault tolerant systems, encompassing the phases of error detection, fault location, and system recovery. Errors are detected in the first phase, exploiting the operating system support. Faults are identified during the location phase, through a machine learning based approach. Then, the best recovery action is triggered once the fault is located. Feedback actions are also used during the location phase to improve detection quality over time. A real world application from the Air Traffic Control field has been used as case study for evaluating the proposed approach. Experimental results, achieved by means of fault injection, show that the diagnosis engine is able to diagnose faults with high accuracy and at a low overhead.


1998 ◽  
Vol 09 (01) ◽  
pp. 25-37 ◽  
Author(s):  
THOMAS J. CORTINA ◽  
ZHIWEI XU

We present a family of interconnection networks named the Cube-Of-Rings (COR) networks along with their basic graph-theoretic properties. Aspects of group graph theory are used to show the COR networks are symmetric and optimally fault tolerant. We present a closed-form expression of the diameter and optimal one-to-one routing algorithm for any member of the COR family. We also discuss the suitability of the COR networks as the interconnection network of scalable parallel computers.


Sign in / Sign up

Export Citation Format

Share Document