mpi implementation
Recently Published Documents


TOTAL DOCUMENTS

76
(FIVE YEARS 8)

H-INDEX

15
(FIVE YEARS 1)

Author(s):  
Sergey Abramov ◽  
Vladimir Roganov ◽  
Valeriy Osipov ◽  
German Matveev

Supercomputer applications are usually implemented in the C, C++, and Fortran programming languages using different versions of the Message Passing Interface library. The "T-system" project (OpenTS) studies the issues of automatic dynamic parallelization of programs. In practical terms, the implementation of applications in a mixed (hybrid) style is relevant, when one part of the application is written in the paradigm of automatic dynamic parallelization of programs and does not use any primitives of the MPI library, and the other part of it is written using the Message Passing Interface library. In this case, the library is used, which is a part of the T-system and is called DMPI (Dynamic Message Passing Interface). In this way, it is necessary to evaluate the effectiveness of the MPI implementation available in the T-system. The purpose of this work is to examine the effectiveness of DMPI implementation in the T-system. In a classic MPI application, 0% of the code is implemented using automatic dynamic parallelization of programs and 100% of the code is implemented in the form of a regular Message Passing Interface program. For comparative analysis, at the beginning the code is executed on the standard Message Passing Interface, for which it was originally written, and then it is executed using the DMPI library taken from the developed T-system. Сomparing the effectiveness of the approaches, the performance losses and the prospects for using a hybrid programming style are evaluated. As a result of the conducted experimental studies for different types of computational problems, it was possible to make sure that the efficiency losses are negligible. This allowed to formulate the direction of further work on the T-system and the most promising options for building hybrid applications. Thus, this article presents the results of the comparative tests of LAMMPS application using OpenMPI and using OpenTS DMPI. The test results confirm the effectiveness of the DMPI implementation in the OpenTS parallel programming environment


2021 ◽  
Author(s):  
Lukas Hübner ◽  
Alexey M. Kozlov ◽  
Demian Hespe ◽  
Peter Sanders ◽  
Alexandros Stamatakis

Phylogenetic trees are now routinely inferred on large scale HPC systems with thousands of cores as the parallel scalability of phylogenetic inference tools has improved over the past years to cope with the molecular data avalanche. Thus, the parallel fault tolerance of phylogenetic inference tools has become a relevant challenge. To this end, we explore parallel fault tolerance mechanisms and algorithms, the software modifications required, and the performance penalties induced via enabling parallel fault tolerance by example of RAxML-NG, the successor of the widely used RAxML tool for maximum likelihood based phylogenetic tree inference. We find that the slowdown induced by the necessary additional recovery mechanisms in RAxML-NG is on average 2%. The overall slowdown by using these recovery mechanisms in conjunction with a fault tolerant MPI implementation amounts to 8% on average for large empirical datasets. Via failure simulations, we show that RAxML-NG can successfully recover from multiple simultaneous failures, subsequent failures, failures during recovery, and failures during checkpointing. Recoveries are automatic and transparent to the user. The modified fault tolerant RAxML-NG code is available under GNU GPL at https://github.com/lukashuebner/ft-raxml-ng Contact: lukas.huebner@{kit.edu,h-its.org};, [email protected], [email protected], [email protected], [email protected] Supplementary information: Supplementary data are available at bioRχiv.


2020 ◽  
Vol 62 (6-7) ◽  
pp. 1007-1033
Author(s):  
Noémie Debroux ◽  
Carole Le Guyader ◽  
Luminita A. Vese
Keyword(s):  

Author(s):  
M.B. Kakenov ◽  
E.V. Zemlyanaya

The MPI implementation of the calculation of the microscopic optical potential of nucleon-nucleus scattering within the single folding model has been developed. The folding potential and the corresponding differential cross section of the 11Li + p elastic scattering have been calculated at 62 MeV/nucleon on the heterogeneous cluster "HybriLIT" of the Multifunctional Information and Computational Complex (MICC) of the Laboratory of Information Technologies of JINR. The agreement between experimental data and numerical results for various models of the 11Li density distribution used in the construction of the folding potential is demonstrated


2019 ◽  
Vol 34 (6) ◽  
pp. 317-326
Author(s):  
Sergei A. Goreinov

Abstract We consider a method due to P. Vassilevski and Yu. A. Kuznetsov [4, 10] for solving linear systems with matrices of low Kronecker rank such that all factors in Kronecker products are banded. Most important examples of such matrices arise from discretized div K grad operator with diffusion term k1(x)k2(y)k3(z). Several practical issues are addressed: an MPI implementation with distribution of data along processor grid inheriting Cartesian 3D structure of discretized problem; implicit deflation of the known nullspace of the system matrix; links with two-grid framework of multigrid algorithm which allow one to remove the requirement of Kronecker structure in one or two of axes. Numerical experiments show the efficiency of 3D data distribution having the scalability analogous to (structured) HYPRE solvers yet the absolute timings being an order of magnitude lower, on the range from 10 to 104 cores.


Author(s):  
Jean Luca Bez ◽  
André Ramos Carneiro ◽  
Pablo José Pavan ◽  
Valéria Soldera Girelli ◽  
Francieli Zanon Boito ◽  
...  

In this article, we study the I/O performance of the Santos Dumont supercomputer, since the gap between processing and data access speeds causes many applications to spend a large portion of their execution on I/O operations. For a large-scale expensive supercomputer, it is essential to ensure applications achieve the best I/O performance to promote efficient usage. We monitor a week of the machine’s activity and present a detailed study on the obtained metrics, aiming at providing an understanding of its workload. From experiences with one numerical simulation, we identified large I/O performance differences between the MPI implementations available to users. We investigated the phenomenon and narrowed it down to collective I/O operations with small request sizes. For these, we concluded that the customized MPI implementation by the machine’s vendor (used by more than 20% of the jobs) presents the worst performance. By investigating the issue, we provide information to help improve future MPI-IO collective write implementations and practical guidelines to help users and steer future system upgrades. Finally, we discuss the challenge of describing applications I/O behavior without depending on information from users. That allows for identifying the application’s I/O bottlenecks and proposing ways of improving its I/O performance. We propose a methodology to do so, and use GROMACS, the application with the largest number of jobs in 2017, as a case study.


Author(s):  
Alexandre Denis ◽  
Julien Jaeger ◽  
Emmanuel Jeannot ◽  
Marc Pérache ◽  
Hugo Taboada

To amortize the cost of MPI collective operations, nonblocking collectives have been proposed so as to allow communications to be overlapped with computation. Unfortunately, collective communications are more CPU-hungry than point-to-point communications and running them in a communication thread on a dedicated CPU core makes them slow. On the other hand, running collective communications on the application cores leads to no overlap. In this article, we propose placement algorithms for progress threads that do not degrade performance when running on cores dedicated to communications to get communication/computation overlap. We first show that even simple collective operations, such as those based on a chain topology, are not straightforward to make progress in background on a dedicated core. Then, we propose an algorithm for tree-based collective operations that splits the tree between communication cores and application cores. To get the best of both worlds, the algorithm runs the short but heavy part of the tree on application cores, and the long but narrow part of the tree on one or several communication cores, so as to get a trade-off between overlap and absolute performance. We provide a model to study and predict its behavior and to tune its parameters. We implemented both algorithms in the multiprocessor computing framework, which is a thread-based MPI implementation. We have run benchmarks on manycore processors such as the KNL and Skylake and get good results for both performance and overlap.


Author(s):  
Christian Simmendinger ◽  
Roman Iakymchuk ◽  
Luis Cebamanos ◽  
Dana Akhmetova ◽  
Valeria Bartsch ◽  
...  

One of the main hurdles of partitioned global address space (PGAS) approaches is the dominance of message passing interface (MPI), which as a de facto standard appears in the code basis of many applications. To take advantage of the PGAS APIs like global address space programming interface (GASPI) without a major change in the code basis, interoperability between MPI and PGAS approaches needs to be ensured. In this article, we consider an interoperable GASPI/MPI implementation for the communication/performance crucial parts of the Ludwig and iPIC3D applications. To address the discovered performance limitations, we develop a novel strategy for significantly improved performance and interoperability between both APIs by leveraging GASPI shared windows and shared notifications. First results with a corresponding implementation in the MiniGhost proxy application and the Allreduce collective operation demonstrate the viability of this approach.


Geophysics ◽  
2018 ◽  
Vol 83 (2) ◽  
pp. R159-R171 ◽  
Author(s):  
Lei Fu ◽  
Bowen Guo ◽  
Gerard T. Schuster

We present a scheme for multiscale phase inversion (MPI) of seismic data that is less sensitive than full-waveform inversion (FWI) to the unmodeled physics of wave propagation and to a poor starting model. To avoid cycle skipping, the multiscale strategy temporally integrates the traces several times, i.e., high-order integration, to produce low-boost seismograms that are used as input data for the initial iterations of MPI. As the iterations proceed, lower frequencies in the data are boosted by using integrated traces of lower order as the input data. The input data are also filtered into different narrow frequency bands for the MPI implementation. Numerical results with synthetic acoustic data indicate that, for the Marmousi model, MPI is more robust than conventional multiscale FWI when the initial model is moderately far from the true model. Results from synthetic viscoacoustic and elastic data indicate that MPI is less sensitive than FWI to some of the unmodeled physics. Inversion of marine data indicates that MPI is more robust and produces modestly more accurate results than FWI for this data set.


Sign in / Sign up

Export Citation Format

Share Document