scholarly journals CAN AGENT INTELLIGENCE BE USED TO ACHIEVE FAULT TOLERANT PARALLEL COMPUTING SYSTEMS?

2011 ◽  
Vol 21 (04) ◽  
pp. 379-396 ◽  
Author(s):  
BLESSON VARGHESE ◽  
GERARD MCKEE ◽  
VASSIL ALEXANDROV

The work reported in this paper is motivated towards validating an alternative approach for fault tolerance over traditional methods like checkpointing that constrain efficacious fault tolerance. Can agent intelligence be used to achieve fault tolerant parallel computing systems? If so, "What agent capabilities are required for fault tolerance?", "What parallel computational tasks can benefit from such agent capabilities?" and "How can agent capabilities be implemented for fault tolerance?" need to be addressed. Cognitive capabilities essential for achieving fault tolerance through agents are considered. Parallel reduction algorithms are identified as a class of algorithms that can benefit from cognitive agent capabilities. The Message Passing Interface is utilized for implementing an intelligent agent based approach. Preliminary results obtained from the experiments validate the feasibility of an agent based approach for achieving fault tolerance in parallel computing systems.

Author(s):  
Omer Subasi ◽  
Tatiana Martsinkevich ◽  
Ferad Zyulkyarov ◽  
Osman Unsal ◽  
Jesus Labarta ◽  
...  

We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.


Author(s):  
Yu-Cheng Chou ◽  
David Ko ◽  
Harry H. Cheng

Parallel computing is widely adotped in scientific and engineering applications to enhance the efficiency. Moreover, there are increasing research interests focusing on utilizing distributed networked computers for parallel computing. The Message Passing Interface (MPI) standard was designed to support portability and platform independence of a developed parallel program. However, the procedure to start an MPI-based parallel computation among distributed computers lacks autonomicity and flexibility. This article presents an autonomic dynamic parallel computing framework that provides autonomicity and flexibility that are important and necessary to some parallel computing applications involving resource constrained and heterogeneous platforms. In this framework, an MPI parallel computing environment consisting of multiple computing entities is dynamically established through inter-agent communications using the IEEE Foundation for Intelligent Physical Agents (FIPA) compliant Agent Communication Language (ACL) messages. For each computing entity in the MPI parallel computing environment, a load-balanced MPI program C source code along with the MPI environment configuration statements are dynamically composed as a mobile agent code. A mobile agent wrapping the mobile agent code is created and sent to the computing entity where the mobile agent code is retrieved and interpretively executed. An example of autonomic parallel matrix multiplication is given to demonstrate the self-configuration and self-optimization properties of the presented framework.


Author(s):  
Hodjatollah Hamidi

The Algorithm-Based Fault Tolerance (ABFT) approach transforms a system that does not tolerate a specific type of faults, called the fault-intolerant system, to a system that provides a specific level of fault tolerance, namely recovery. The ABFT philosophy leads directly to a model from which error correction can be developed. By employing an ABFT scheme with effective convolutional code, the design allows high throughput as well as high fault coverage. The ABFT techniques that detect errors rely on the comparison of parity values computed in two ways. The parallel processing of input parity values produce output parity values comparable with parity values regenerated from the original processed outputs and can apply convolutional codes for the redundancy. This method is a new approach to concurrent error correction in fault-tolerant computing systems. This chapter proposes a novel computing paradigm to provide fault tolerance for numerical algorithms. The authors also present, implement, and evaluate early detection in ABFT.


Author(s):  
Ning Yang ◽  
Shiaaulir Wang ◽  
Paul Schonfeld

A Parallel Genetic Algorithm (PGA) is used for a simulation-based optimization of waterway project schedules. This PGA is designed to distribute a Genetic Algorithm application over multiple processors in order to speed up the solution search procedure for a very large combinational problem. The proposed PGA is based on a global parallel model, which is also called a master-slave model. A Message-Passing Interface (MPI) is used in developing the parallel computing program. A case study is presented, whose results show how the adaption of a simulation-based optimization algorithm to parallel computing can greatly reduce computation time. Additional techniques which are found to further improve the PGA performance include: (1) choosing an appropriate task distribution method, (2) distributing simulation replications instead of different solutions, (3) avoiding the simulation of duplicate solutions, (4) avoiding running multiple simulations simultaneously in shared-memory processors, and (5) avoiding using multiple processors which belong to different clusters (physical sub-networks).


Author(s):  
Yu-Cheng Chou ◽  
Harry H. Cheng

Message Passing Interface (MPI) is a standardized library specification designed for message-passing parallel programming on large-scale distributed systems. A number of MPI libraries have been implemented to allow users to develop portable programs using the scientific programming languages, Fortran, C and C++. Ch is an embeddable C/C++ interpreter that provides an interpretive environment for C/C++ based scripts and programs. Combining Ch with any MPI C/C++ library provides the functionality for rapid development of MPI C/C++ programs without compilation. In this article, the method of interfacing Ch scripts with MPI C implementations is introduced by using the MPICH2 C library as an example. The MPICH2-based Ch MPI package provides users with the ability to interpretively run MPI C program based on the MPICH2 C library. Running MPI programs through the MPICH2-based Ch MPI package across heterogeneous platforms consisting of Linux and Windows machines is illustrated. Comparisons for the bandwidth, latency, and parallel computation speedup between C MPI, Ch MPI, and MPI for Python in an Ethernet-based environment comprising identical Linux machines are presented. A Web-based example is given to demonstrate the use of Ch and MPICH2 in C based CGI scripting to facilitate the development of Web-based applications for parallel computing.


2012 ◽  
Vol 433-440 ◽  
pp. 2892-2898
Author(s):  
Guang Lei Fei ◽  
Jian Guo Ning ◽  
Tian Bao Ma

Parallel computing has been applied in many fields, and the parallel computing platform system, PC cluster based on MPI (Message Passing Interface) library under Linux operating system is a cost-effectiveness approach to parallel compute. In this paper, the key algorithm of parallel program of explosion and impact is presented. The techniques of solving data dependence and realizing communication between subdomain are proposed. From the test of program, the portability of MMIC-3D parallel program is satisfied, and compared with the single computer, PC cluster can improve the calculation speed and enlarge the scale greatly.


2015 ◽  
Vol 2015 ◽  
pp. 1-10 ◽  
Author(s):  
Daniel S. Abdi ◽  
Girma T. Bitsuamlak

A Navier-Stokes equations solver is parallelized to run on a cluster of computers using the domain decomposition method. Two approaches of communication and computation are investigated, namely, synchronous and asynchronous methods. Asynchronous communication between subdomains is not commonly used in CFD codes; however, it has a potential to alleviate scaling bottlenecks incurred due to processors having to wait for each other at designated synchronization points. A common way to avoid this idle time is to overlap asynchronous communication with computation. For this to work, however, there must be something useful and independent a processor can do while waiting for messages to arrive. We investigate an alternative approach of computation, namely, conducting asynchronous iterations to improve local subdomain solution while communication is in progress. An in-house CFD code is parallelized using message passing interface (MPI), and scalability tests are conducted that suggest asynchronous iterations are a viable way of parallelizing CFD code.


1997 ◽  
Vol 40 (1) ◽  
pp. 19-34 ◽  
Author(s):  
Jehoshua Bruck ◽  
Danny Dolev ◽  
Ching-Tien Ho ◽  
Marcel-Cătălin Roşu ◽  
Ray Strong

Author(s):  
Peng Wen ◽  
Wei Qiu

This paper presents the further development of numerical simulation method to solve 3-D highly non-linear slamming problems using parallel computing algorithms. The water entry problems are treated as multi-phase problems (solid, water and air) and governed by the Navier-Stokes (N-S) equations. They are solved by the three-dimensional constrained interpolation profile (CIP) method. The interfaces between different phases are captured using density functions. In the computation, the 3-D CIP method is employed for the advection phase of the N-S equations and a pressure-based algorithm is applied for the non-advection phase. The bi-conjugate gradient stabilized method (BiCGSTAB) is utilized to solve the linear equation systems. A Message Passing Interface (MPI) parallel computing scheme was implemented in the computations. For the parallel computations, the three-dimensional Cartesian decomposition of the computational domain was used. The speed-up performance of various decomposition schemes were studied. Validation studies were carried out for the water entry of a 3-D wedge and a 3-D ship section with prescribed velocities. The computed slamming force, pressure distribution and free-surface elevations are compared with experimental results and numerical results by other methods.


Sign in / Sign up

Export Citation Format

Share Document