mpi applications
Recently Published Documents


TOTAL DOCUMENTS

212
(FIVE YEARS 24)

H-INDEX

18
(FIVE YEARS 3)

2021 ◽  
Vol 10 ◽  
pp. 100151
Author(s):  
Patrick Bell ◽  
Kae Suarez ◽  
Dylan Chapp ◽  
Nigel Tan ◽  
Sanjukta Bhowmick ◽  
...  

Author(s):  
Roberto Rocco ◽  
Davide Gadioli ◽  
Gianluca Palermo

AbstractDue to the increasing size of HPC machines, dealing with faults is becoming mandatory due to their high frequency. Natively, MPI cannot handle faults and it stops the execution prematurely when it finds one. With the introduction of ULFM, it is possible to continue the execution, but it requires complex integration with the application. In this paper we propose Legio, a framework that introduces fault resiliency in embarrassingly parallel MPI applications. Legio exposes its features to the application transparently, removing any integration difficulty. After a fault, the execution continues only with the non-failed processes. We also propose a hierarchical alternative, which features lower repair costs on large communicators. We evaluated our solutions on the Marconi100 cluster at CINECA with benchmarks and real-world applications, showing that the overhead introduced by the library is negligible and it does not limit the scalability properties of MPI.


Author(s):  
Konstantinos Parasyris ◽  
Giorgis Georgakoudis ◽  
Leonardo Bautista-Gomez ◽  
Ignacio Laguna
Keyword(s):  

Author(s):  
Kiril Dichev ◽  
Daniele De Sensi ◽  
Dimitrios S. Nikolopoulos ◽  
Kirk W Cameron ◽  
Ivor Spence

2021 ◽  
pp. 19-34
Author(s):  
Peter Arzt ◽  
Yannic Fischler ◽  
Jan-Patrick Lehr ◽  
Christian Bischof

2021 ◽  
pp. 466-481
Author(s):  
Thomas Dionisi ◽  
Stephane Bouhrour ◽  
Julien Jaeger ◽  
Patrick Carribault ◽  
Marc Pérache

2020 ◽  
Vol 31 (11) ◽  
pp. 2696-2709 ◽  
Author(s):  
Daniele Cesarini ◽  
Andrea Bartolini ◽  
Andrea Borghesi ◽  
Carlo Cavazzoni ◽  
Mathieu Luisier ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document