mpi implementation Latest Research Papers

Supercomputer applications are usually implemented in the C, C++, and Fortran programming languages using different versions of the Message Passing Interface library. The "T-system" project (OpenTS) studies the issues of automatic dynamic parallelization of programs. In practical terms, the implementation of applications in a mixed (hybrid) style is relevant, when one part of the application is written in the paradigm of automatic dynamic parallelization of programs and does not use any primitives of the MPI library, and the other part of it is written using the Message Passing Interface library. In this case, the library is used, which is a part of the T-system and is called DMPI (Dynamic Message Passing Interface). In this way, it is necessary to evaluate the effectiveness of the MPI implementation available in the T-system. The purpose of this work is to examine the effectiveness of DMPI implementation in the T-system. In a classic MPI application, 0% of the code is implemented using automatic dynamic parallelization of programs and 100% of the code is implemented in the form of a regular Message Passing Interface program. For comparative analysis, at the beginning the code is executed on the standard Message Passing Interface, for which it was originally written, and then it is executed using the DMPI library taken from the developed T-system. Сomparing the effectiveness of the approaches, the performance losses and the prospects for using a hybrid programming style are evaluated. As a result of the conducted experimental studies for different types of computational problems, it was possible to make sure that the efficiency losses are negligible. This allowed to formulate the direction of further work on the T-system and the most promising options for building hybrid applications. Thus, this article presents the results of the comparative tests of LAMMPS application using OpenMPI and using OpenTS DMPI. The test results confirm the effectiveness of the DMPI implementation in the OpenTS parallel programming environment

Download Full-text

Exploring Parallel MPI Fault Tolerance Mechanisms for Phylogenetic Inference with RAxML-NG

10.1101/2021.01.15.426773 ◽

2021 ◽

Author(s):

Lukas Hübner ◽

Alexey M. Kozlov ◽

Demian Hespe ◽

Peter Sanders ◽

Alexandros Stamatakis

Keyword(s):

Fault Tolerance ◽

Phylogenetic Trees ◽

Large Scale ◽

Fault Tolerant ◽

Phylogenetic Inference ◽

Molecular Data ◽

Supplementary Information ◽

Tolerance Mechanisms ◽

Recovery Mechanisms ◽

Mpi Implementation

Phylogenetic trees are now routinely inferred on large scale HPC systems with thousands of cores as the parallel scalability of phylogenetic inference tools has improved over the past years to cope with the molecular data avalanche. Thus, the parallel fault tolerance of phylogenetic inference tools has become a relevant challenge. To this end, we explore parallel fault tolerance mechanisms and algorithms, the software modifications required, and the performance penalties induced via enabling parallel fault tolerance by example of RAxML-NG, the successor of the widely used RAxML tool for maximum likelihood based phylogenetic tree inference. We find that the slowdown induced by the necessary additional recovery mechanisms in RAxML-NG is on average 2%. The overall slowdown by using these recovery mechanisms in conjunction with a fault tolerant MPI implementation amounts to 8% on average for large empirical datasets. Via failure simulations, we show that RAxML-NG can successfully recover from multiple simultaneous failures, subsequent failures, failures during recovery, and failures during checkpointing. Recoveries are automatic and transparent to the user. The modified fault tolerant RAxML-NG code is available under GNU GPL at https://github.com/lukashuebner/ft-raxml-ng Contact: lukas.huebner@{kit.edu,h-its.org};, [email protected], [email protected], [email protected], [email protected] Supplementary information: Supplementary data are available at bioRχiv.

Download Full-text

A Nonlocal Laplacian-Based Model for Bituminous Surfacing Crack Recovery and its MPI Implementation

Journal of Mathematical Imaging and Vision ◽

10.1007/s10851-020-00968-3 ◽

2020 ◽

Vol 62 (6-7) ◽

pp. 1007-1033

Author(s):

Noémie Debroux ◽

Carole Le Guyader ◽

Luminita A. Vese

Keyword(s):

Mpi Implementation

Download Full-text

Parallel molecular dynamics for silicon and silicon carbide: MPI, CUDA and CUDA-MPI implementation

10.1063/5.0028297 ◽

2020 ◽

Author(s):

A. V. Utkin ◽

V. M. Fomin ◽

E. I. Golovneva

Keyword(s):

Molecular Dynamics ◽

Silicon Carbide ◽

Parallel Molecular Dynamics ◽

Mpi Implementation

Download Full-text

CALCULATION OF THE 11LI + P ELASTIC SCATTERING CROSS SECTIONS USING THE FOLDING OPTICAL POTENTIAL

Bulletin of Dubna International University for Nature, Society, and Man. Series: Natural and engineering sciences ◽

10.37005/1818-0744-2019-4-3-5 ◽

2019 ◽

pp. 3-5

Author(s):

M.B. Kakenov ◽

E.V. Zemlyanaya

Keyword(s):

Elastic Scattering ◽

Cross Sections ◽

Information Technologies ◽

Optical Potential ◽

Heterogeneous Cluster ◽

Folding Potential ◽

Nucleus Scattering ◽

Mpi Implementation ◽

Microscopic Optical Potential ◽

Scattering Cross Sections

The MPI implementation of the calculation of the microscopic optical potential of nucleon-nucleus scattering within the single folding model has been developed. The folding potential and the corresponding differential cross section of the 11Li + p elastic scattering have been calculated at 62 MeV/nucleon on the heterogeneous cluster "HybriLIT" of the Multifunctional Information and Computational Complex (MICC) of the Laboratory of Information Technologies of JINR. The agreement between experimental data and numerical results for various models of the 11Li density distribution used in the construction of the folding potential is demonstrated

Download Full-text

A note on the fast direct method for discrete elliptic problems

Russian Journal of Numerical Analysis and Mathematical Modelling ◽

10.1515/rnam-2019-0027 ◽

2019 ◽

Vol 34 (6) ◽

pp. 317-326

Author(s):

Sergei A. Goreinov

Keyword(s):

Direct Method ◽

3D Structure ◽

Kronecker Products ◽

System Matrix ◽

Diffusion Term ◽

Multigrid Algorithm ◽

3D Data ◽

Order Of Magnitude ◽

Mpi Implementation ◽

Discretized Problem

Abstract We consider a method due to P. Vassilevski and Yu. A. Kuznetsov [4, 10] for solving linear systems with matrices of low Kronecker rank such that all factors in Kronecker products are banded. Most important examples of such matrices arise from discretized div K grad operator with diffusion term k1(x)k2(y)k3(z). Several practical issues are addressed: an MPI implementation with distribution of data along processor grid inheriting Cartesian 3D structure of discretized problem; implicit deflation of the known nullspace of the system matrix; links with two-grid framework of multigrid algorithm which allow one to remove the requirement of Kronecker structure in one or two of axes. Numerical experiments show the efficiency of 3D data distribution having the scalability analogous to (structured) HYPRE solvers yet the absolute timings being an order of magnitude lower, on the range from 10 to 104 cores.

Download Full-text

I/O performance of the Santos Dumont supercomputer

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019868526 ◽

2019 ◽

Vol 34 (2) ◽

pp. 227-245

Author(s):

Jean Luca Bez ◽

André Ramos Carneiro ◽

Pablo José Pavan ◽

Valéria Soldera Girelli ◽

Francieli Zanon Boito ◽

...

Keyword(s):

Numerical Simulation ◽

Large Scale ◽

Data Access ◽

Performance Differences ◽

Future System ◽

Practical Guidelines ◽

Mpi Implementation ◽

Do So ◽

System Upgrades

In this article, we study the I/O performance of the Santos Dumont supercomputer, since the gap between processing and data access speeds causes many applications to spend a large portion of their execution on I/O operations. For a large-scale expensive supercomputer, it is essential to ensure applications achieve the best I/O performance to promote efficient usage. We monitor a week of the machine’s activity and present a detailed study on the obtained metrics, aiming at providing an understanding of its workload. From experiences with one numerical simulation, we identified large I/O performance differences between the MPI implementations available to users. We investigated the phenomenon and narrowed it down to collective I/O operations with small request sizes. For these, we concluded that the customized MPI implementation by the machine’s vendor (used by more than 20% of the jobs) presents the worst performance. By investigating the issue, we provide information to help improve future MPI-IO collective write implementations and practical guidelines to help users and steer future system upgrades. Finally, we discuss the challenge of describing applications I/O behavior without depending on information from users. That allows for identifying the application’s I/O bottlenecks and proposing ways of improving its I/O performance. We propose a methodology to do so, and use GROMACS, the application with the largest number of jobs in 2017, as a case study.

Download Full-text

Study on progress threads placement and dedicated cores for overlapping MPI nonblocking collectives on manycore processor

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019860184 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1240-1254 ◽

Cited By ~ 1

Author(s):

Alexandre Denis ◽

Julien Jaeger ◽

Emmanuel Jeannot ◽

Marc Pérache ◽

Hugo Taboada

Keyword(s):

The Other ◽

Trade Off ◽

Manycore Processors ◽

Narrow Part ◽

Collective Communications ◽

Computing Framework ◽

A Chain ◽

Point To Point ◽

Mpi Implementation ◽

The Cost

To amortize the cost of MPI collective operations, nonblocking collectives have been proposed so as to allow communications to be overlapped with computation. Unfortunately, collective communications are more CPU-hungry than point-to-point communications and running them in a communication thread on a dedicated CPU core makes them slow. On the other hand, running collective communications on the application cores leads to no overlap. In this article, we propose placement algorithms for progress threads that do not degrade performance when running on cores dedicated to communications to get communication/computation overlap. We first show that even simple collective operations, such as those based on a chain topology, are not straightforward to make progress in background on a dedicated core. Then, we propose an algorithm for tree-based collective operations that splits the tree between communication cores and application cores. To get the best of both worlds, the algorithm runs the short but heavy part of the tree on application cores, and the long but narrow part of the tree on one or several communication cores, so as to get a trade-off between overlap and absolute performance. We provide a model to study and predict its behavior and to tune its parameters. We implemented both algorithms in the multiprocessor computing framework, which is a thread-based MPI implementation. We have run benchmarks on manycore processors such as the KNL and Skylake and get good results for both performance and overlap.

Download Full-text

Interoperability strategies for GASPI and MPI in large-scale scientific applications

The International Journal of High Performance Computing Applications ◽

10.1177/1094342018808359 ◽

2018 ◽

Vol 33 (3) ◽

pp. 554-568

Author(s):

Christian Simmendinger ◽

Roman Iakymchuk ◽

Luis Cebamanos ◽

Dana Akhmetova ◽

Valeria Bartsch ◽

...

Keyword(s):

Message Passing ◽

Large Scale ◽

Message Passing Interface ◽

Major Change ◽

Address Space ◽

Communication Performance ◽

Global Address Space ◽

Improved Performance ◽

Novel Strategy ◽

Mpi Implementation

One of the main hurdles of partitioned global address space (PGAS) approaches is the dominance of message passing interface (MPI), which as a de facto standard appears in the code basis of many applications. To take advantage of the PGAS APIs like global address space programming interface (GASPI) without a major change in the code basis, interoperability between MPI and PGAS approaches needs to be ensured. In this article, we consider an interoperable GASPI/MPI implementation for the communication/performance crucial parts of the Ludwig and iPIC3D applications. To address the discovered performance limitations, we develop a novel strategy for significantly improved performance and interoperability between both APIs by leveraging GASPI shared windows and shared notifications. First results with a corresponding implementation in the MiniGhost proxy application and the Allreduce collective operation demonstrate the viability of this approach.

Download Full-text

Multiscale phase inversion of seismic data

Geophysics ◽

10.1190/geo2017-0353.1 ◽

2018 ◽

Vol 83 (2) ◽

pp. R159-R171 ◽

Cited By ~ 14

Author(s):

Lei Fu ◽

Bowen Guo ◽

Gerard T. Schuster

Keyword(s):

Seismic Data ◽

Phase Inversion ◽

Input Data ◽

Waveform Inversion ◽

Data Set ◽

Acoustic Data ◽

Full Waveform ◽

Elastic Data ◽

Mpi Implementation ◽

Starting Model

We present a scheme for multiscale phase inversion (MPI) of seismic data that is less sensitive than full-waveform inversion (FWI) to the unmodeled physics of wave propagation and to a poor starting model. To avoid cycle skipping, the multiscale strategy temporally integrates the traces several times, i.e., high-order integration, to produce low-boost seismograms that are used as input data for the initial iterations of MPI. As the iterations proceed, lower frequencies in the data are boosted by using integrated traces of lower order as the input data. The input data are also filtered into different narrow frequency bands for the MPI implementation. Numerical results with synthetic acoustic data indicate that, for the Marmousi model, MPI is more robust than conventional multiscale FWI when the initial model is moderately far from the true model. Results from synthetic viscoacoustic and elastic data indicate that MPI is less sensitive than FWI to some of the unmodeled physics. Inversion of marine data indicates that MPI is more robust and produces modestly more accurate results than FWI for this data set.

Download Full-text

mpi implementation
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Implementation of the LAMMPS package using T-system with an Open Architecture

Exploring Parallel MPI Fault Tolerance Mechanisms for Phylogenetic Inference with RAxML-NG

A Nonlocal Laplacian-Based Model for Bituminous Surfacing Crack Recovery and its MPI Implementation

Parallel molecular dynamics for silicon and silicon carbide: MPI, CUDA and CUDA-MPI implementation

CALCULATION OF THE 11LI + P ELASTIC SCATTERING CROSS SECTIONS USING THE FOLDING OPTICAL POTENTIAL

A note on the fast direct method for discrete elliptic problems

I/O performance of the Santos Dumont supercomputer

Study on progress threads placement and dedicated cores for overlapping MPI nonblocking collectives on manycore processor

Interoperability strategies for GASPI and MPI in large-scale scientific applications

Multiscale phase inversion of seismic data

Export Citation Format

mpi implementationRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Implementation of the LAMMPS package using T-system with an Open Architecture

Exploring Parallel MPI Fault Tolerance Mechanisms for Phylogenetic Inference with RAxML-NG

A Nonlocal Laplacian-Based Model for Bituminous Surfacing Crack Recovery and its MPI Implementation

Parallel molecular dynamics for silicon and silicon carbide: MPI, CUDA and CUDA-MPI implementation

CALCULATION OF THE 11LI + P ELASTIC SCATTERING CROSS SECTIONS USING THE FOLDING OPTICAL POTENTIAL

A note on the fast direct method for discrete elliptic problems

I/O performance of the Santos Dumont supercomputer

Study on progress threads placement and dedicated cores for overlapping MPI nonblocking collectives on manycore processor

Interoperability strategies for GASPI and MPI in large-scale scientific applications

Multiscale phase inversion of seismic data

mpi implementation
Recently Published Documents