New communication strategies in NEMO

Author(s):  
Italo Epicoco ◽  
Silvia Mocavero ◽  
Francesca Mele ◽  
Alessandro D'Anca ◽  
Giovanni Aloisio

<p>One of the main bottlenecks for NEMO scalability is the time spent performing communications. Two complementary strategies are here proposed to reduce the communication frequency and the communication time: the MPI3 neighbourhood collective communications instead of multiple point to point exchanges and the increasing of the halo region size.</p><p>NEMO performs Lateral Boundaries Conditions update by using four point to point MPI communications at north, south, east and west for each MPI domain. The model completes east-west exchange before performing north-south communications. The order of the exchanges allows us to preserve both 5-point and 9-point stencils. MPI3 neighbourhood collectives provide a way to have sub-communicators used to perform collective communications. Two different sub-communicators can be defined in order to support the two different stencils. A single MPI message is needed to be built for all neighbours instead of 4 different messages before calling the collective communication, while the received message is used to update the halo region, following the order of the neighbours in the sub-communicator.</p><p>The new communication strategy has been tested on two computational kernels (i.e. one for 5-point stencil and one for 9-point stencil), selected among the main relevant routines from the computational point of view. Preliminary tests, performed on a domain size of 3000x2000x31 grid points on the Zeus Intel Xeon Gold 6154 machine, available at CMCC, show a gain in communication time for the 5-point stencil use case up to 31% on 2016 cores. The improvement is reduced when communications with processes on the diagonal are activated. However, a modest gain is still achieved, depending on the number of cores.</p><p>On the other side, the analysis of some NEMO routines shows how the exchange of more than one row/column of halo would allow to move communications outside the routine, preserving data dependencies. A wider halo size reduces the frequency of message exchanges whilst increases the message size at each exchange. It allows us to adopt some optimisation strategies (i.e. loop fusion, tiling, etc.) to improve the data locality. Nevertheless, the use of a wider halo introduces itself some improvements for some kernels like for the MUSCL advection scheme which shows a gain of ~23% in the execution time comparing the original version and the new one with halo extended to 2 lines and the communication moved outside the computing region.</p><p>The current work has been performed according to the NEMO development strategy plan, defined by the NEMO Consortium, which establish the priorities of the design strategies to reduce the bottlenecks to the scalability and the time to solution.</p><p> </p><p>Acknowledgments</p><p>This work is co-funded by the EU H2020 IS-ENES project Phase 3 (ISENES3) under Grant Agreement number 824084.</p>

Sensors ◽  
2021 ◽  
Vol 21 (4) ◽  
pp. 1477
Author(s):  
João Ramos ◽  
Roberto Ribeiro ◽  
David Safadinho ◽  
João Barroso ◽  
Carlos Rabadão ◽  
...  

The demand for online services is increasing. Services that would require a long time to understand, use and master are becoming as transparent as possible to the users, that tend to focus only on the final goals. Combined with the advantages of the unmanned vehicles (UV), from the unmanned factor to the reduced size and costs, we found an opportunity to bring to users a wide variety of services supported by UV, through the Internet of Unmanned Vehicles (IoUV). Current solutions were analyzed and we discussed scalability and genericity as the principal concerns. Then, we proposed a solution that combines several services and UVs, available from anywhere at any time, from a cloud platform. The solution considers a cloud distributed architecture, composed by users, services, vehicles and a platform, interconnected through the Internet. Each vehicle provides to the platform an abstract and generic interface for the essential commands. Therefore, this modular design makes easier the creation of new services and the reuse of the different vehicles. To confirm the feasibility of the solution we implemented a prototype considering a cloud-hosted platform and the integration of custom-built small-sized cars, a custom-built quadcopter, and a commercial Vertical Take-Off and Landing (VTOL) aircraft. To validate the prototype and the vehicles’ remote control, we created several services accessible via a web browser and controlled through a computer keyboard. We tested the solution in a local network, remote networks and mobile networks (i.e., 3G and Long-Term Evolution (LTE)) and proved the benefits of decentralizing the communications into multiple point-to-point links for the remote control. Consequently, the solution can provide scalable UV-based services, with low technical effort, for anyone at anytime and anywhere.


Author(s):  
Alexandre Denis ◽  
Julien Jaeger ◽  
Emmanuel Jeannot ◽  
Marc Pérache ◽  
Hugo Taboada

To amortize the cost of MPI collective operations, nonblocking collectives have been proposed so as to allow communications to be overlapped with computation. Unfortunately, collective communications are more CPU-hungry than point-to-point communications and running them in a communication thread on a dedicated CPU core makes them slow. On the other hand, running collective communications on the application cores leads to no overlap. In this article, we propose placement algorithms for progress threads that do not degrade performance when running on cores dedicated to communications to get communication/computation overlap. We first show that even simple collective operations, such as those based on a chain topology, are not straightforward to make progress in background on a dedicated core. Then, we propose an algorithm for tree-based collective operations that splits the tree between communication cores and application cores. To get the best of both worlds, the algorithm runs the short but heavy part of the tree on application cores, and the long but narrow part of the tree on one or several communication cores, so as to get a trade-off between overlap and absolute performance. We provide a model to study and predict its behavior and to tune its parameters. We implemented both algorithms in the multiprocessor computing framework, which is a thread-based MPI implementation. We have run benchmarks on manycore processors such as the KNL and Skylake and get good results for both performance and overlap.


2020 ◽  
Vol 6 (30) ◽  
pp. eaaw9975 ◽  
Author(s):  
Gerard A. Vliegenthart ◽  
Arvind Ravichandran ◽  
Marisol Ripoll ◽  
Thorsten Auth ◽  
Gerhard Gompper

Motor proteins drive persistent motion and self-organization of cytoskeletal filaments. However, state-of-the-art microscopy techniques and continuum modeling approaches focus on large length and time scales. Here, we perform component-based computer simulations of polar filaments and molecular motors linking microscopic interactions and activity to self-organization and dynamics from the filament level up to the mesoscopic domain level. Dynamic filament cross-linking and sliding and excluded-volume interactions promote formation of bundles at small densities and of active polar nematics at high densities. A buckling-type instability sets the size of polar domains and the density of topological defects. We predict a universal scaling of the active diffusion coefficient and the domain size with activity, and its dependence on parameters like motor concentration and filament persistence length. Our results provide a microscopic understanding of cytoplasmic streaming in cells and help to develop design strategies for novel engineered active materials.


2019 ◽  
Author(s):  
Dimitri Marques Abramov

AbstractBackgroundMethods for p-value correction are criticized for either increasing Type II error or improperly reducing Type I error. This problem is worse when dealing with hundreds or thousands of paired comparisons between waves or images which are performed point-to-point. This text considers patterns in probability vectors resulting from multiple point-to-point comparisons between two ERP waves (mass univariate analysis) to correct p-values. These patterns (probability waves) mirror ERP waveshapes and might be indicators of consistency in statistical differences.New methodIn order to compute and analyze these patterns, we convoluted the decimal logarithm of the probability vector (p’) using a Gaussian vector with size compatible to the ERP periods observed. For verify consistency of this method, we also calculated mean amplitudes of late ERPs from Pz (P300 wave) and O1 electrodes in two samples, respectively of typical and ADHD subjects.Resultsthe present method reduces the range of p’-values that did not show covariance with neighbors (that is, that are likely random differences, type I errors), while preserving the amplitude of probability waves, in accordance to difference between respective mean amplitudes.Comparison with existing methodsthe positive-FDR resulted in a different profile of corrected p-values, which is not consistent with expected results or differences between mean amplitudes of the analyzed ERPs.Conclusionthe present new method seems to be biological and statistically more suitable to correct p-values in mass univariate analysis of ERP waves.


Sign in / Sign up

Export Citation Format

Share Document