collective communications Latest Research Papers

One of the main bottlenecks for NEMO scalability is the time spent performing communications. Two complementary strategies are here proposed to reduce the communication frequency and the communication time: the MPI3 neighbourhood collective communications instead of multiple point to point exchanges and the increasing of the halo region size.NEMO performs Lateral Boundaries Conditions update by using four point to point MPI communications at north, south, east and west for each MPI domain. The model completes east-west exchange before performing north-south communications. The order of the exchanges allows us to preserve both 5-point and 9-point stencils. MPI3 neighbourhood collectives provide a way to have sub-communicators used to perform collective communications. Two different sub-communicators can be defined in order to support the two different stencils. A single MPI message is needed to be built for all neighbours instead of 4 different messages before calling the collective communication, while the received message is used to update the halo region, following the order of the neighbours in the sub-communicator.The new communication strategy has been tested on two computational kernels (i.e. one for 5-point stencil and one for 9-point stencil), selected among the main relevant routines from the computational point of view. Preliminary tests, performed on a domain size of 3000x2000x31 grid points on the Zeus Intel Xeon Gold 6154 machine, available at CMCC, show a gain in communication time for the 5-point stencil use case up to 31% on 2016 cores. The improvement is reduced when communications with processes on the diagonal are activated. However, a modest gain is still achieved, depending on the number of cores.On the other side, the analysis of some NEMO routines shows how the exchange of more than one row/column of halo would allow to move communications outside the routine, preserving data dependencies. A wider halo size reduces the frequency of message exchanges whilst increases the message size at each exchange. It allows us to adopt some optimisation strategies (i.e. loop fusion, tiling, etc.) to improve the data locality. Nevertheless, the use of a wider halo introduces itself some improvements for some kernels like for the MUSCL advection scheme which shows a gain of ~23% in the execution time comparing the original version and the new one with halo extended to 2 lines and the communication moved outside the computing region.The current work has been performed according to the NEMO development strategy plan, defined by the NEMO Consortium, which establish the priorities of the design strategies to reduce the bottlenecks to the scalability and the time to solution.&#160;AcknowledgmentsThis work is co-funded by the EU H2020 IS-ENES project Phase 3 (ISENES3) under Grant Agreement number 824084.

Download Full-text

A new distributed algorithm for routing network generation in model coupling and its evaluation based on C-Coupler2

10.5194/gmd-2020-91 ◽

2020 ◽

Author(s):

Hao Yu ◽

Li Liu ◽

Chao Sun ◽

Ruizhe Li ◽

Xinzhu Yu ◽

...

Keyword(s):

Distributed Algorithm ◽

Data Transfer ◽

System Modeling ◽

Earth System ◽

Major Step ◽

Earth System Modeling ◽

Network Generation ◽

Collective Communications ◽

Component Models ◽

Processor Cores

Abstract. It is a fundamental functionality of a coupler for Earth system modeling to efficiently handle data transfer between component models. Routing network generation is a major step for initializing the data transfer functionality. Most existing couplers employ an inefficient and unscalable global implementation for routing network generation that relies on collective communications. That’s a main reason why the initialization cost of a coupler increases rapidly when using more processor cores. In this paper, we propose a new Distributed algorithm for Routing network generation (DaRong), which does not introduce any collective communication and achieves much lower complexities than the global implementation. DaRong is of course much more efficient and scalable than the global implementation, which has been further demonstrated via empirical evaluations. DaRong has already been implemented in C-Coupler2. We believe that existing and future couplers can also benefit from it.

Download Full-text

Decoding collective communications using information theory tools

Journal of The Royal Society Interface ◽

10.1098/rsif.2019.0563 ◽

2020 ◽

Vol 17 (164) ◽

pp. 20190563 ◽

Cited By ~ 3

Author(s):

K. R. Pilkiewicz ◽

B. H. Lemasson ◽

M. A. Rowland ◽

A. Hein ◽

J. Sun ◽

...

Keyword(s):

Situational Awareness ◽

Reaction Times ◽

Interaction Patterns ◽

Ecological Constraints ◽

Information Theoretic ◽

Movement Data ◽

Social Coordination ◽

Collective Communications ◽

Pertinent Information ◽

Benefits And Costs

Organisms have evolved sensory mechanisms to extract pertinent information from their environment, enabling them to assess their situation and act accordingly. For social organisms travelling in groups, like the fish in a school or the birds in a flock, sharing information can further improve their situational awareness and reaction times. Data on the benefits and costs of social coordination, however, have largely allowed our understanding of why collective behaviours have evolved to outpace our mechanistic knowledge of how they arise. Recent studies have begun to correct this imbalance through fine-scale analyses of group movement data. One approach that has received renewed attention is the use of information theoretic (IT) tools like mutual information , transfer entropy and causation entropy , which can help identify causal interactions in the type of complex, dynamical patterns often on display when organisms act collectively. Yet, there is a communications gap between studies focused on the ecological constraints and solutions of collective action with those demonstrating the promise of IT tools in this arena. We attempt to bridge this divide through a series of ecologically motivated examples designed to illustrate the benefits and challenges of using IT tools to extract deeper insights into the interaction patterns governing group-level dynamics. We summarize some of the approaches taken thus far to circumvent existing challenges in this area and we conclude with an optimistic, yet cautionary perspective.

Download Full-text

Genetic Node-Mapping Methods for Rapid Collective Communications

IEICE Transactions on Information and Systems ◽

10.1587/transinf.2018edp7386 ◽

2020 ◽

Vol E103.D (1) ◽

pp. 111-129

Author(s):

Takashi YOKOTA ◽

Kanemitsu OOTSU ◽

Takeshi OHKAWA

Keyword(s):

Collective Communications

Download Full-text

Impacts of Multi-GPU MPI Collective Communications on Large FFT Computation

2019 IEEE/ACM Workshop on Exascale MPI (ExaMPI) ◽

10.1109/exampi49596.2019.00007 ◽

2019 ◽

Cited By ~ 1

Author(s):

Alan Ayala ◽

Stanimire Tomov ◽

Xi Luo ◽

Hejer Shaeik ◽

Azzam Haidar ◽

...

Keyword(s):

Collective Communications

Download Full-text

Study on progress threads placement and dedicated cores for overlapping MPI nonblocking collectives on manycore processor

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019860184 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1240-1254 ◽

Cited By ~ 1

Author(s):

Alexandre Denis ◽

Julien Jaeger ◽

Emmanuel Jeannot ◽

Marc Pérache ◽

Hugo Taboada

Keyword(s):

The Other ◽

Trade Off ◽

Manycore Processors ◽

Narrow Part ◽

Collective Communications ◽

Computing Framework ◽

A Chain ◽

Point To Point ◽

Mpi Implementation ◽

The Cost

To amortize the cost of MPI collective operations, nonblocking collectives have been proposed so as to allow communications to be overlapped with computation. Unfortunately, collective communications are more CPU-hungry than point-to-point communications and running them in a communication thread on a dedicated CPU core makes them slow. On the other hand, running collective communications on the application cores leads to no overlap. In this article, we propose placement algorithms for progress threads that do not degrade performance when running on cores dedicated to communications to get communication/computation overlap. We first show that even simple collective operations, such as those based on a chain topology, are not straightforward to make progress in background on a dedicated core. Then, we propose an algorithm for tree-based collective operations that splits the tree between communication cores and application cores. To get the best of both worlds, the algorithm runs the short but heavy part of the tree on application cores, and the long but narrow part of the tree on one or several communication cores, so as to get a trade-off between overlap and absolute performance. We provide a model to study and predict its behavior and to tune its parameters. We implemented both algorithms in the multiprocessor computing framework, which is a thread-based MPI implementation. We have run benchmarks on manycore processors such as the KNL and Skylake and get good results for both performance and overlap.

Download Full-text