Enabling Highly-Scalable Remote Memory Access Programming with MPI-3 One Sided

Modern interconnects offer remote direct memory access (RDMA) features. Yet, most applications rely on explicit message passing for communications albeit their unwanted overheads. The MPI-3.0 standard defines a programming interface for exploiting RDMA networks directly, however, it's scalability and practicability has to be demonstrated in practice. In this work, we develop scalable bufferless protocols that implement the MPI-3.0 specification. Our protocols support scaling to millions of cores with negligible memory consumption while providing highest performance and minimal overheads. To arm programmers, we provide a spectrum of performance models for all critical functions and demonstrate the usability of our library and models with several application studies with up to half a million processes. We show that our design is comparable to, or better than UPC and Fortran Coarrays in terms of latency, bandwidth and message rate. We also demonstrate application performance improvements with comparable programming complexity.

Download Full-text

Remote Memory Access: A Case for Portable, Efficient and Library Independent Parallel Programming

Scientific Programming ◽

10.1155/2004/934718 ◽

2004 ◽

Vol 12 (3) ◽

pp. 169-183 ◽

Cited By ~ 6

Author(s):

Alexandros V. Gerbessiotis ◽

Seung-Yeop Lee

Keyword(s):

Message Passing ◽

Matrix Multiplication ◽

Memory Access ◽

Parallel Computer ◽

Remote Memory ◽

Dense Matrix ◽

Radix Sort ◽

Matrix Multiplication Algorithm ◽

Bulk Synchronous Parallel ◽

Remote Memory Access

In this work we make a strong case for remote memory access (RMA) as the effective way to program a parallel computer by proposing a framework that supports RMA in a library independent, simple and intuitive way. If one uses our approach the parallel code one writes will run transparently under MPI-2 enabled libraries but also bulk-synchronous parallel libraries. The advantage of using RMA is code simplicity, reduced programming complexity, and increased efficiency. We support the latter claims by implementing under this framework a collection of benchmark programs consisting of a communication and synchronization performance assessment program, a dense matrix multiplication algorithm, and two variants of a parallel radix-sort algorithm and examine their performance on a LINUX-based PC cluster under three different RMA enabled libraries: LAM MPI, BSPlib, and PUB. We conclude that implementations of such parallel algorithms using RMA communication primitives lead to code that is as efficient as the message-passing equivalent code and in the case of radix-sort substantially more efficient. In addition our work can be used as a comparative study of the relevant capabilities of the three libraries.

Download Full-text

Memory Access Behavior Analysis of NUMA-Based Shared Memory Programs

Scientific Programming ◽

10.1155/2002/790749 ◽

2002 ◽

Vol 10 (1) ◽

pp. 45-53 ◽

Cited By ~ 3

Author(s):

Jie Tao ◽

Wolfgang Karl ◽

Martin Schulz

Keyword(s):

Shared Memory ◽

Data Locality ◽

Memory Access ◽

Remote Memory ◽

Data Layout ◽

Performance Improvements ◽

Significant Performance ◽

Working Set ◽

Memory Accesses ◽

Memory Applications

Shared memory applications running transparently on top of NUMA architectures often face severe performance problems due to bad data locality and excessive remote memory accesses. Optimizations with respect to data locality are therefore necessary, but require a fundamental understanding of an application's memory access behavior. The information necessary for this cannot be obtained using simple code instrumentation due to the implicit nature of the communication handled by the NUMA hardware, the large amount of traffic produced at runtime, and the fine access granularity in shared memory codes. In this paper an approach to overcome these problems and thereby to enable an easy and efficient optimization process is presented. Based on a low-level hardware monitoring facility in coordination with a comprehensive visualization tool, it enables the generation of memory access histograms capable of showing all memory accesses across the complete address space of an application's working set. This information can be used to identify access hot spots, to understand the dynamic behavior of shared memory applications, and to optimize applications using an application specific data layout resulting in significant performance improvements.

Download Full-text

A shared virtual memory network with fast remote direct memory access and message passing

2004 IEEE International Conference on Cluster Computing (IEEE Cat. No.04EX935) ◽

10.1109/clustr.2004.1392660 ◽

2005 ◽

Author(s):

Gang Shi ◽

Mingchang Hu ◽

Hongda Yin ◽

Weiwu Hu ◽

Zhimin Tang

Keyword(s):

Message Passing ◽

Direct Memory Access ◽

Virtual Memory ◽

Memory Access ◽

Memory Network ◽

Shared Virtual Memory

Download Full-text

FRAMP: A fast remote direct memory access and message passing network

IEEE International Symposium on Communications and Information Technology, 2004. ISCIT 2004. ◽

10.1109/iscit.2004.1412914 ◽

2005 ◽

Author(s):

Gang Shi ◽

Mingehang Hu ◽

Hongda Yin ◽

Weiwu Hu ◽

Zhimin Tang

Keyword(s):

Message Passing ◽

Direct Memory Access ◽

Memory Access

Download Full-text

Enabling Efficient Inter-Node Message Passing and Remote Memory Access Via a uGNI Based Light-Weight Network Substrate for Cray Interconnects

2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) ◽

10.1109/ccgrid.2018.00006 ◽

2018 ◽

Cited By ~ 1

Author(s):

Udayanga Wickramasinghe ◽

Andrew Lumsdaine

Keyword(s):

Message Passing ◽

Memory Access ◽

Remote Memory ◽

Light Weight ◽

Remote Memory Access

Download Full-text

Study on Explicit Memory Management for CBEA Green Computing Architecture

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.374-377.2078 ◽

2011 ◽

Vol 374-377 ◽

pp. 2078-2081

Author(s):

Guo Fu Feng ◽

Ming Wang ◽

Ming Chen ◽

Tao Chi

Keyword(s):

Performance Optimization ◽

Memory Management ◽

Green Computing ◽

Memory Access ◽

Application Performance ◽

Access Methods ◽

Power Efficient ◽

Profile Information ◽

Resource Requirements ◽

Better Than

Heterogeneous multi-core processors are attractive for power efficient green computing because of their ability to meet varied resource requirements. The multi-level memory hierarchy of Cell Broadband Engine Architecture (CBEA) which requires explicit management by software poses significant challenges to performance increasing and programming. In this paper, with analysis of characteristic of the architecture, we implemented four access methods and a corresponding access library with a uniform memory access interface. Besides getting performance boosts beyond current level technology, the memory access library with uniform access interface could collect profile information of memory management for further performance optimization. Experimental results show the performance of proposed method is better than related works and profile information provided by the method is helpful for programmer to optimize application performance.

Download Full-text

Beyond MPI

ACM SIGMOD Record ◽

10.1145/3456859.3456862 ◽

2021 ◽

Vol 49 (4) ◽

pp. 12-17

Author(s):

Feilong Liu ◽

Claude Barthels ◽

Spyros Blanas ◽

Hideaki Kimura ◽

Garret Swart

Keyword(s):

High Performance ◽

Processing System ◽

Complex Interaction ◽

Remote Memory ◽

Interaction Patterns ◽

Round Trip ◽

Data Processing System ◽

Data Intensive ◽

Multiple Round ◽

Programming Interface

Networkswith Remote DirectMemoryAccess (RDMA) support are becoming increasingly common. RDMA, however, offers a limited programming interface to remote memory that consists of read, write and atomic operations. With RDMA alone, completing the most basic operations on remote data structures often requires multiple round-trips over the network. Data-intensive systems strongly desire higher-level communication abstractions that supportmore complex interaction patterns. A natural candidate to consider is MPI, the de facto standard for developing high-performance applications in the HPC community. This paper critically evaluates the communication primitives of MPI and shows that using MPI in the context of a data processing system comes with its own set of insurmountable challenges. Based on this analysis, we propose a new communication abstraction named RDMO, or Remote DirectMemory Operation, that dispatches a short sequence of reads, writes and atomic operations to remote memory and executes them in a single round-trip.

Download Full-text

sAXI: A High-Efficient Hardware Inter-Node Link in ARM Server for Remote Memory Access

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) ◽

10.1109/ccgrid.2016.66 ◽

2016 ◽

Cited By ~ 2

Author(s):

Ke Zhang ◽

Yisong Chang ◽

Lixin Zhang ◽

Mingyu Chen ◽

Lei Yu ◽

...

Keyword(s):

Memory Access ◽

Remote Memory ◽

High Efficient ◽

Remote Memory Access

Download Full-text

Quantifying and resolving remote memory access contention on hardware DSM multiprocessors

Proceedings 16th International Parallel and Distributed Processing Symposium ◽

10.1109/ipdps.2002.1015503 ◽

2002 ◽

Cited By ~ 1

Author(s):

D.S. Nikolopoulos

Keyword(s):

Memory Access ◽

Remote Memory ◽

Remote Memory Access

Download Full-text

Toward More Scalable Off-Line Simulations of MPI Applications

Parallel Processing Letters ◽

10.1142/s0129626415410029 ◽

2015 ◽

Vol 25 (03) ◽

pp. 1541002 ◽

Cited By ~ 2

Author(s):

Henri Casanova ◽

Anshul Gupta ◽

Frédéric Suter

Keyword(s):

Message Passing ◽

Post Mortem ◽

Application Performance ◽

Popular Approach ◽

Mpi Applications ◽

Line Analysis ◽

Application Execution

The off-line (or post-mortem) analysis of execution event traces is a popular approach to understand the performance of HPC applications that use the message passing paradigm. Combining this analysis with simulation makes it possible to “replay” the application execution to explore “what if?” scenarios, e.g., assessing application performance in a range of (hypothetical) execution environments. However, such off-line analysis faces scalability issues for acquiring, storing, or replaying large event traces. We first present two previously proposed and complementary frameworks for off-line replaying of MPI application event traces, each with its own objectives and limitations. We then describe how these frameworks can be combined so as to capitalize on their respective strengths while alleviating several of their limitations. We claim that the combined framework affords levels of scalability that are beyond that achievable by either one of the two individual frameworks. We evaluate this framework to illustrate the benefits of the proposed combination for a more scalable off-line analysis of MPI applications.

Download Full-text