scholarly journals Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

2014 ◽  
Vol 22 (2) ◽  
pp. 75-91 ◽  
Author(s):  
Robert Gerstenberger ◽  
Maciej Besta ◽  
Torsten Hoefler

Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth and message rate. We also demonstrate application performance improvements with comparable programming complexity.

2004 ◽  
Vol 12 (3) ◽  
pp. 169-183 ◽  
Author(s):  
Alexandros V. Gerbessiotis ◽  
Seung-Yeop Lee

In this work we make a strong case for remote memory access (RMA) as the effective way to program a parallel computer by proposing a framework that supports RMA in a library independent, simple and intuitive way. If one uses our approach the parallel code one writes will run transparently under MPI-2 enabled libraries but also bulk-synchronous parallel libraries. The advantage of using RMA is code simplicity, reduced programming complexity, and increased efficiency. We support the latter claims by implementing under this framework a collection of benchmark programs consisting of a communication and synchronization performance assessment program, a dense matrix multiplication algorithm, and two variants of a parallel radix-sort algorithm and examine their performance on a LINUX-based PC cluster under three different RMA enabled libraries: LAM MPI, BSPlib, and PUB. We conclude that implementations of such parallel algorithms using RMA communication primitives lead to code that is as efficient as the message-passing equivalent code and in the case of radix-sort substantially more efficient. In addition our work can be used as a comparative study of the relevant capabilities of the three libraries.


2002 ◽  
Vol 10 (1) ◽  
pp. 45-53 ◽  
Author(s):  
Jie Tao ◽  
Wolfgang Karl ◽  
Martin Schulz

Shared memory applications running transparently on top of NUMA architectures often face severe performance problems due to bad data locality and excessive remote memory accesses. Optimizations with respect to data locality are therefore necessary, but require a fundamental understanding of an application's memory access behavior. The information necessary for this cannot be obtained using simple code instrumentation due to the implicit nature of the communication handled by the NUMA hardware, the large amount of traffic produced at runtime, and the fine access granularity in shared memory codes. In this paper an approach to overcome these problems and thereby to enable an easy and efficient optimization process is presented. Based on a low-level hardware monitoring facility in coordination with a comprehensive visualization tool, it enables the generation of memory access histograms capable of showing all memory accesses across the complete address space of an application's working set. This information can be used to identify access hot spots, to understand the dynamic behavior of shared memory applications, and to optimize applications using an application specific data layout resulting in significant performance improvements.


2011 ◽  
Vol 374-377 ◽  
pp. 2078-2081
Author(s):  
Guo Fu Feng ◽  
Ming Wang ◽  
Ming Chen ◽  
Tao Chi

Heterogeneous multi-core processors are attractive for power efficient green computing because of their ability to meet varied resource requirements. The multi-level memory hierarchy of Cell Broadband Engine Architecture (CBEA) which requires explicit management by software poses significant challenges to performance increasing and programming. In this paper, with analysis of characteristic of the architecture, we implemented four access methods and a corresponding access library with a uniform memory access interface. Besides getting performance boosts beyond current level technology, the memory access library with uniform access interface could collect profile information of memory management for further performance optimization. Experimental results show the performance of proposed method is better than related works and profile information provided by the method is helpful for programmer to optimize application performance.


2021 ◽  
Vol 49 (4) ◽  
pp. 12-17
Author(s):  
Feilong Liu ◽  
Claude Barthels ◽  
Spyros Blanas ◽  
Hideaki Kimura ◽  
Garret Swart

Networkswith Remote DirectMemoryAccess (RDMA) support are becoming increasingly common. RDMA, however, offers a limited programming interface to remote memory that consists of read, write and atomic operations. With RDMA alone, completing the most basic operations on remote data structures often requires multiple round-trips over the network. Data-intensive systems strongly desire higher-level communication abstractions that supportmore complex interaction patterns. A natural candidate to consider is MPI, the de facto standard for developing high-performance applications in the HPC community. This paper critically evaluates the communication primitives of MPI and shows that using MPI in the context of a data processing system comes with its own set of insurmountable challenges. Based on this analysis, we propose a new communication abstraction named RDMO, or Remote DirectMemory Operation, that dispatches a short sequence of reads, writes and atomic operations to remote memory and executes them in a single round-trip.


2015 ◽  
Vol 25 (03) ◽  
pp. 1541002 ◽  
Author(s):  
Henri Casanova ◽  
Anshul Gupta ◽  
Frédéric Suter

The off-line (or post-mortem) analysis of execution event traces is a popular approach to understand the performance of HPC applications that use the message passing paradigm. Combining this analysis with simulation makes it possible to “replay” the application execution to explore “what if?” scenarios, e.g., assessing application performance in a range of (hypothetical) execution environments. However, such off-line analysis faces scalability issues for acquiring, storing, or replaying large event traces. We first present two previously proposed and complementary frameworks for off-line replaying of MPI application event traces, each with its own objectives and limitations. We then describe how these frameworks can be combined so as to capitalize on their respective strengths while alleviating several of their limitations. We claim that the combined framework affords levels of scalability that are beyond that achievable by either one of the two individual frameworks. We evaluate this framework to illustrate the benefits of the proposed combination for a more scalable off-line analysis of MPI applications.


Sign in / Sign up

Export Citation Format

Share Document