scholarly journals The Case for Message Passing on Many-Core Chips

2010 ◽  
pp. 115-123 ◽  
Author(s):  
Rakesh Kumar ◽  
Timothy G. Mattson ◽  
Gilles Pokam ◽  
Rob Van Der Wijngaart
Keyword(s):  
2017 ◽  
Vol 77 ◽  
pp. 72-82 ◽  
Author(s):  
Aurang Zaib ◽  
Thomas Wild ◽  
Andreas Herkersdorf ◽  
Jan Heisswolf ◽  
Jürgen Becker ◽  
...  

2014 ◽  
Vol 4 (2) ◽  
pp. 307-320
Author(s):  
Sumeet S. Kumar ◽  
Mitzi Tjin-A-Djie ◽  
Rene van Leuken

Author(s):  
Carsten Clauss ◽  
Simon Pickartz ◽  
Stefan Lankes ◽  
Thomas Bemmerl
Keyword(s):  

Author(s):  
Jörg Mische ◽  
Martin Frieb ◽  
Alexander Stegmeier ◽  
Theo Ungerer

Abstract To improve the scalability, several many-core architectures use message passing instead of shared memory accesses for communication. Unfortunately, Direct Memory Access (DMA) transfers in a shared address space are usually used to emulate message passing, which entails a lot of overhead and thwarts the advantages of message passing. Recently proposed register-level message passing alternatives use special instructions to send the contents of a single register to another core. The reduced communication overhead and architectural simplicity lead to good many-core scalability. After investigating several other approaches in terms of hardware complexity and throughput overhead, we recommend a small instruction set extension to enable register-level message passing at minimal hardware costs and describe its integration into a classical five stage RISC-V pipeline.


2019 ◽  
Vol 12 (4) ◽  
pp. 1423-1441 ◽  
Author(s):  
Luca Bertagna ◽  
Michael Deakin ◽  
Oksana Guba ◽  
Daniel Sunderland ◽  
Andrew M. Bradley ◽  
...  

Abstract. We present an architecture-portable and performant implementation of the atmospheric dynamical core (High-Order Methods Modeling Environment, HOMME) of the Energy Exascale Earth System Model (E3SM). The original Fortran implementation is highly performant and scalable on conventional architectures using the Message Passing Interface (MPI) and Open MultiProcessor (OpenMP) programming models. We rewrite the model in C++ and use the Kokkos library to express on-node parallelism in a largely architecture-independent implementation. Kokkos provides an abstraction of a compute node or device, layout-polymorphic multidimensional arrays, and parallel execution constructs. The new implementation achieves the same or better performance on conventional multicore computers and is portable to GPUs. We present performance data for the original and new implementations on multiple platforms, on up to 5400 compute nodes, and study several aspects of the single- and multi-node performance characteristics of the new implementation on conventional CPU (e.g., Intel Xeon), many core CPU (e.g., Intel Xeon Phi Knights Landing), and Nvidia V100 GPU.


2014 ◽  
Vol 10 (4) ◽  
pp. 531-549 ◽  
Author(s):  
Andrea Bartolini ◽  
Can Hankendi ◽  
Ayse Kivilcim Coskun ◽  
Luca Benini

Sign in / Sign up

Export Citation Format

Share Document