MPI-FT: PORTABLE FAULT TOLERANCE SCHEME FOR MPI

2000 ◽  
Vol 10 (04) ◽  
pp. 371-382 ◽  
Author(s):  
SOULLA LOUCA ◽  
NEOPHYTOS NEOPHYTOU ◽  
ADRIANOS LACHANAS ◽  
PARASKEVAS EVRIPIDOU

In this paper, we propose the design and development of a fault tolerant and recovery scheme for the Message Passing Interface (MPI). The proposed scheme consists of a detection mechanism for detecting process failures, and a recovery mechanism. Two different cases are considered, both assuming the existence of a monitoring process, the Observer which triggers the recovery procedure in case of failure. In the first case, each process keeps a buffer with its own message traffic to be used in case of failure, while the implementor uses periodical tests for notification of failure by the Observer. The recovery function simulates all the communication of the processes with the dead one by re-sending to the replacement process all the messages destined for the dead one. In the second case, the Observer receives and stores all message traffic, and sends to the replacement all the buffered messages destined for the dead process. Solutions are provided to the dead communicator problem caused by the death of a process. A description of the prototype developed is provided along with the results of the experiments performed for efficiency and performance.

2015 ◽  
Vol 8 (3) ◽  
pp. 2369-2402
Author(s):  
W. He ◽  
C. Beyer ◽  
J. H. Fleckenstein ◽  
E. Jang ◽  
O. Kolditz ◽  
...  

Abstract. This technical paper presents an efficient and performance-oriented method to model reactive mass transport processes in environmental and geotechnical subsurface systems. The open source scientific software packages OpenGeoSys and IPhreeqc have been coupled, to combine their individual strengths and features to simulate thermo-hydro-mechanical-chemical coupled processes in porous and fractured media with simultaneous consideration of aqueous geochemical reactions. Furthermore, a flexible parallelization scheme using MPI (Message Passing Interface) grouping techniques has been implemented, which allows an optimized allocation of computer resources for the node-wise calculation of chemical reactions on the one hand, and the underlying processes such as for groundwater flow or solute transport on the other hand. The coupling interface and parallelization scheme have been tested and verified in terms of precision and performance.


2014 ◽  
Vol 571-572 ◽  
pp. 26-29
Author(s):  
Xiang Wei Duan ◽  
Wei Chang Shen ◽  
Jun Guo

The paper introduce the Mandelbrot Set and the message passing interface (MPI) and shared-memory (OpenMP), analyses the characteristic of algorithm design in the MPI and OpenMP environment, describes the implementation of parallel algorithm about Mandelbrot Set in the MPI environment and the OpenMP environment, conducted a series of evaluation and performance testing during the process of running, then the difference between the two system implementations is compared.


Author(s):  
Omer Subasi ◽  
Tatiana Martsinkevich ◽  
Ferad Zyulkyarov ◽  
Osman Unsal ◽  
Jesus Labarta ◽  
...  

We present a unified fault-tolerance framework for task-parallel message-passing applications to mitigate transient errors. First, we propose a fault-tolerant message-logging protocol that only requires the restart of the task that experienced the error and transparently handles any message passing interface calls inside the task. In our experiments we demonstrate that our fault-tolerant solution has a reasonable overhead, with a maximum observed overhead of 4.5%. We also show that fine-grained parallelization is important for hiding the overheads related to the protocol as well as the recovery of tasks. Secondly, we develop a mathematical model to unify task-level checkpointing and our protocol with system-wide checkpointing in order to provide complete failure coverage. We provide closed formulas for the optimal checkpointing interval and the performance score of the unified scheme. Experimental results show that the performance improvement can be as high as 98% with the unified scheme.


2017 ◽  
Vol 2017 ◽  
pp. 1-12 ◽  
Author(s):  
Anuj Sharma ◽  
Irene Moulitsas

High-resolution numerical methods and unstructured meshes are required in many applications of Computational Fluid Dynamics (CFD). These methods are quite computationally expensive and hence benefit from being parallelized. Message Passing Interface (MPI) has been utilized traditionally as a parallelization strategy. However, the inherent complexity of MPI contributes further to the existing complexity of the CFD scientific codes. The Partitioned Global Address Space (PGAS) parallelization paradigm was introduced in an attempt to improve the clarity of the parallel implementation. We present our experiences of converting an unstructured high-resolution compressible Navier-Stokes CFD solver from MPI to PGAS Coarray Fortran. We present the challenges, methodology, and performance measurements of our approach using Coarray Fortran. With the Cray compiler, we observe Coarray Fortran as a viable alternative to MPI. We are hopeful that Intel and open-source implementations could be utilized in the future.


Author(s):  
NENAD STANKOVIC ◽  
KANG ZHANG

The attractiveness of visual programming stems in large part from the direct interaction with program elements as if they were real objects, since people deal better with concrete objects than with the abstract. This paper describes a new graph based software visualization tool for parallel message-passing programming named Visper that combines the levels of abstraction at which message-passing parallel programs are expressed and makes use of compositional programming. Central to the tool is the Process Communication Graph that correlates both the control and data flow graphs into a single graph formalism, without a need for complex textual annotation. The graph can express static and runtime communication and replication structures, as found in Message Passing Interface (MPI) and Parallel Virtual Machine (PVM). It also forms the basis for visualizing parallel debugging and performance.


Author(s):  
George W. Leaver ◽  
Martin J. Turner ◽  
James S. Perrin ◽  
Paul M. Mummery ◽  
Philip J. Withers

Remote scientific visualization, where rendering services are provided by larger scale systems than are available on the desktop, is becoming increasingly important as dataset sizes increase beyond the capabilities of desktop workstations. Uptake of such services relies on access to suitable visualization applications and the ability to view the resulting visualization in a convenient form. We consider five rules from the e-Science community to meet these goals with the porting of a commercial visualization package to a large-scale system. The application uses message-passing interface (MPI) to distribute data among data processing and rendering processes. The use of MPI in such an interactive application is not compatible with restrictions imposed by the Cray system being considered. We present details, and performance analysis, of a new MPI proxy method that allows the application to run within the Cray environment yet still support MPI communication required by the application. Example use cases from materials science are considered.


2019 ◽  
Vol 214 ◽  
pp. 05029 ◽  
Author(s):  
Alexey Rybalchenko ◽  
Dennis Klein ◽  
Mohammad Al-Turany ◽  
Thorsten Kollegger

The high data rates expected for the next generation of particle physics experiments (e.g.: new experiments at FAIR/GSI and the upgrade of CERN experiments) call for dedicated attention with respect to design of the needed computing infrastructure. The common ALICE-FAIR framework ALFA is a modern software layer, that serves as a platform for simulation, reconstruction and analysis of particle physics experiments. Beside standard services needed for simulation and reconstruction of particle physics experiments, ALFA also provides tools for data transport, configuration and deployment. The FairMQ module in ALFA offers building blocks for creating distributed software components (processes) that communicate between each other via message passing. The abstract "message passing" interface in FairMQ has at the moment three implementations: ZeroMQ, nanomsg and shared memory. The newly developed shared memory transport will be presented, that provides significant per-formance benefits for transferring large data chunks between components on the same node. The implementation in FairMQ allows users to switch between the different transports via a trivial configuration change. The design decisions, im-plementation details and performance numbers of the shared memory transport in FairMQ/ALFA will be highlighted.


2011 ◽  
Vol 21 (04) ◽  
pp. 379-396 ◽  
Author(s):  
BLESSON VARGHESE ◽  
GERARD MCKEE ◽  
VASSIL ALEXANDROV

The work reported in this paper is motivated towards validating an alternative approach for fault tolerance over traditional methods like checkpointing that constrain efficacious fault tolerance. Can agent intelligence be used to achieve fault tolerant parallel computing systems? If so, "What agent capabilities are required for fault tolerance?", "What parallel computational tasks can benefit from such agent capabilities?" and "How can agent capabilities be implemented for fault tolerance?" need to be addressed. Cognitive capabilities essential for achieving fault tolerance through agents are considered. Parallel reduction algorithms are identified as a class of algorithms that can benefit from cognitive agent capabilities. The Message Passing Interface is utilized for implementing an intelligent agent based approach. Preliminary results obtained from the experiments validate the feasibility of an agent based approach for achieving fault tolerance in parallel computing systems.


Sign in / Sign up

Export Citation Format

Share Document