scholarly journals PIMP My Many-Core: Pipeline-Integrated Message Passing

Author(s):  
Jörg Mische ◽  
Martin Frieb ◽  
Alexander Stegmeier ◽  
Theo Ungerer

Abstract To improve the scalability, several many-core architectures use message passing instead of shared memory accesses for communication. Unfortunately, Direct Memory Access (DMA) transfers in a shared address space are usually used to emulate message passing, which entails a lot of overhead and thwarts the advantages of message passing. Recently proposed register-level message passing alternatives use special instructions to send the contents of a single register to another core. The reduced communication overhead and architectural simplicity lead to good many-core scalability. After investigating several other approaches in terms of hardware complexity and throughput overhead, we recommend a small instruction set extension to enable register-level message passing at minimal hardware costs and describe its integration into a classical five stage RISC-V pipeline.

1995 ◽  
Vol 05 (04) ◽  
pp. 575-586
Author(s):  
BEN LEE ◽  
ALI R. HURSON

The issue of scalability is key to the success of massively parallel processing. Due to their distributed nature, message-passing multicomputers are appropriate for achieving scalar performance. However, the message-passing model lacks programmability due to difficulties encountered by the programmers to partition and schedule the computation over the processors and to establish efficient inter-processor communication in the user code. Therefore, this paper presents a compile-time scheduling heuristic, called BLS, that maps programs onto the processors of a message-passing multicomputer. In contrast to other methods proposed, BLS takes a more global approach in attempt to balance the tradeoff between exploiting parallelism and reducing communication overhead. To evaluate the effectiveness of BLS, simulation studies of scheduling SISAL programs are presented.


Author(s):  
Irfan Uddin

The microthreaded many-core architecture is comprised of multiple clusters of fine-grained multi-threaded cores. The management of concurrency is supported in the instruction set architecture of the cores and the computational work in application is asynchronously delegated to different clusters of cores, where the cluster is allocated dynamically. Computer architects are always interested in analyzing the complex interaction amongst the dynamically allocated resources. Generally a detailed simulation with a cycle-accurate simulation of the execution time is used. However, the cycle-accurate simulator for the microthreaded architecture executes at the rate of 100,000 instructions per second, divided over the number of simulated cores. This means that the evaluation of a complex application executing on a contemporary multi-core machine can be very slow. To perform efficient design space exploration we present a co-simulation environment, where the detailed execution of instructions in the pipeline of microthreaded cores and the interactions amongst the hardware components are abstracted. We present the evaluation of the high-level simulation framework against the cycle-accurate simulation framework. The results show that the high-level simulator is faster and less complicated than the cycle-accurate simulator but with the cost of losing accuracy.


2019 ◽  
Vol 6 (1) ◽  
pp. 1697555
Author(s):  
S. Tynymbayev ◽  
R. Berdibayev ◽  
T. Omar ◽  
Y. Aitkhozhayeva ◽  
A. Shaikulova ◽  
...  

2017 ◽  
Vol 59 (5) ◽  
Author(s):  
Robert Höttger ◽  
Harald Mackamul ◽  
Andreas Sailer ◽  
Jan-Philipp Steghöfer ◽  
Jörg Tessmer

AbstractSince especially the automotive domain increasingly utilizes multi- and many-core systems, appropriate models, analyses, and tooling are required to address challenges that were nearly non existent so far. APP4MC is an open source Eclipse platform that provides AUTOSAR compliant common data models namely AMALTHEA, basic parallelization features, visualizations, and the great possibility to add any existing tooling. For example, Eclipse Capra can be added to provide comprehensive traceability throughout the development processes but any proprietary, commercial, open-source, or prototypical implementations can be integrated. The platform enables the creation and management of complex tool chains including performance simulation and validation. The entire community benefits from reduced hardware costs, faster time to market, higher quality systems, and rapid adoption. APP4MC is not retricted to the automotive domain and utilizable in robotics or generic real-time systems as well.


2017 ◽  
Vol 77 ◽  
pp. 72-82 ◽  
Author(s):  
Aurang Zaib ◽  
Thomas Wild ◽  
Andreas Herkersdorf ◽  
Jan Heisswolf ◽  
Jürgen Becker ◽  
...  

Water ◽  
2018 ◽  
Vol 10 (10) ◽  
pp. 1461 ◽  
Author(s):  
Zhipeng Lin ◽  
Wenjing Yang ◽  
Houcun Zhou ◽  
Xinhai Xu ◽  
Liaoyuan Sun ◽  
...  

Multiphase flow solvers are widely-used applications in OpenFOAM, whose scalability suffers from the costly communication overhead. Therefore, we establish communication-optimized multiphase flow solvers in OpenFOAM. In this paper, we first deliver a scalability bottleneck test on the typical multiphase flow case damBreak and reveal that the Message Passing Interface (MPI) communication in a Multidimensional Universal Limiter for Explicit Solution (MULES) and a Preconditioned Conjugate Gradient (PCG) algorithm is the short slab of multiphase flow solvers. Furthermore, an analysis of the communication behavior is carried out. We find that the redundant communication in MULES and the global synchronization in PCG are the performance limiting factors. Based on the analysis, we propose our communication optimization algorithm. For MULES, we remove the redundant communication and obtain optMULES. For PCG, we import several intermediate variables and rearrange PCG to reduce the global communication. We also overlap the computation of matrix-vector multiply and vector update with the non-blocking computation. The resulting algorithms are respectively referred to as OFPiPePCG and OFRePiPePCG. Extensive experiments show that our proposed method could dramatically increase the parallel scalability and solving speed of multiphase flow solvers in OpenFOAM approximately without the loss of accuracy.


2014 ◽  
Vol 4 (2) ◽  
pp. 307-320
Author(s):  
Sumeet S. Kumar ◽  
Mitzi Tjin-A-Djie ◽  
Rene van Leuken

Sign in / Sign up

Export Citation Format

Share Document