PIMP My Many-Core: Pipeline-Integrated Message Passing

Abstract To improve the scalability, several many-core architectures use message passing instead of shared memory accesses for communication. Unfortunately, Direct Memory Access (DMA) transfers in a shared address space are usually used to emulate message passing, which entails a lot of overhead and thwarts the advantages of message passing. Recently proposed register-level message passing alternatives use special instructions to send the contents of a single register to another core. The reduced communication overhead and architectural simplicity lead to good many-core scalability. After investigating several other approaches in terms of hardware complexity and throughput overhead, we recommend a small instruction set extension to enable register-level message passing at minimal hardware costs and describe its integration into a classical five stage RISC-V pipeline.

Download Full-text

A STRATEGY FOR SCHEDULING PARTIALLY ORDERED PROGRAM GRAPHS ONTO MULTICOMPUTERS

Parallel Processing Letters ◽

10.1142/s0129626495000515 ◽

1995 ◽

Vol 05 (04) ◽

pp. 575-586

Author(s):

BEN LEE ◽

ALI R. HURSON

Keyword(s):

Parallel Processing ◽

Message Passing ◽

Massively Parallel ◽

Communication Overhead ◽

Simulation Studies ◽

Global Approach ◽

Partially Ordered ◽

Massively Parallel Processing ◽

Time Scheduling ◽

Scheduling Heuristic

The issue of scalability is key to the success of massively parallel processing. Due to their distributed nature, message-passing multicomputers are appropriate for achieving scalar performance. However, the message-passing model lacks programmability due to difficulties encountered by the programmers to partition and schedule the computation over the processors and to establish efficient inter-processor communication in the user code. Therefore, this paper presents a compile-time scheduling heuristic, called BLS, that maps programs onto the processors of a message-passing multicomputer. In contrast to other methods proposed, BLS takes a more global approach in attempt to balance the tradeoff between exploiting parallelism and reducing communication overhead. To evaluate the effectiveness of BLS, simulation studies of scheduling SISAL programs are presented.

Download Full-text

One-IPC high-level simulation of microthreaded many-core architectures

The International Journal of High Performance Computing Applications ◽

10.1177/1094342015584495 ◽

2016 ◽

Vol 31 (2) ◽

pp. 152-162 ◽

Cited By ~ 3

Author(s):

Irfan Uddin

Keyword(s):

Design Space Exploration ◽

Instruction Set ◽

Efficient Design ◽

Simulation Framework ◽

Fine Grained ◽

Detailed Simulation ◽

High Level ◽

Many Core ◽

The Cost ◽

Multiple Clusters

The microthreaded many-core architecture is comprised of multiple clusters of fine-grained multi-threaded cores. The management of concurrency is supported in the instruction set architecture of the cores and the computational work in application is asynchronously delegated to different clusters of cores, where the cluster is allocated dynamically. Computer architects are always interested in analyzing the complex interaction amongst the dynamically allocated resources. Generally a detailed simulation with a cycle-accurate simulation of the execution time is used. However, the cycle-accurate simulator for the microthreaded architecture executes at the rate of 100,000 instructions per second, divided over the number of simulated cores. This means that the evaluation of a complex application executing on a contemporary multi-core machine can be very slow. To perform efficient design space exploration we present a co-simulation environment, where the detailed execution of instructions in the pipeline of microthreaded cores and the interactions amongst the hardware components are abstracted. We present the evaluation of the high-level simulation framework against the cycle-accurate simulation framework. The results show that the high-level simulator is faster and less complicated than the cycle-accurate simulator but with the cost of losing accuracy.

Download Full-text

An Enhanced Variable Phase Accumulator with Minimal Hardware Complexity Dedicated to ADPLL Applications

2018 15th International Multi-Conference on Systems, Signals & Devices (SSD) ◽

10.1109/ssd.2018.8570405 ◽

2018 ◽

Author(s):

Sehmi Saad ◽

Mongia Mhiri ◽

Aymen Ben Hammadi ◽

Kamel Besbes

Keyword(s):

Hardware Complexity ◽

Variable Phase ◽

Minimal Hardware

Download Full-text

High-speed devices for modular reduction with minimal hardware costs

Cogent Engineering ◽

10.1080/23311916.2019.1697555 ◽

2019 ◽

Vol 6 (1) ◽

pp. 1697555

Author(s):

S. Tynymbayev ◽

R. Berdibayev ◽

T. Omar ◽

Y. Aitkhozhayeva ◽

A. Shaikulova ◽

...

Keyword(s):

High Speed ◽

Hardware Costs ◽

Minimal Hardware

Download Full-text

APP4MC: Application platform project for multi- and many-core systems

it - Information Technology ◽

10.1515/itit-2017-0019 ◽

2017 ◽

Vol 59 (5) ◽

Author(s):

Robert Höttger ◽

Harald Mackamul ◽

Andreas Sailer ◽

Jan-Philipp Steghöfer ◽

Jörg Tessmer

Keyword(s):

Open Source ◽

Real Time Systems ◽

Time To Market ◽

Performance Simulation ◽

Community Benefits ◽

Development Processes ◽

Great Possibility ◽

Hardware Costs ◽

Many Core ◽

Time Systems

AbstractSince especially the automotive domain increasingly utilizes multi- and many-core systems, appropriate models, analyses, and tooling are required to address challenges that were nearly non existent so far. APP4MC is an open source Eclipse platform that provides AUTOSAR compliant common data models namely AMALTHEA, basic parallelization features, visualizations, and the great possibility to add any existing tooling. For example, Eclipse Capra can be added to provide comprehensive traceability throughout the development processes but any proprietary, commercial, open-source, or prototypical implementations can be integrated. The platform enables the creation and management of complex tool chains including performance simulation and validation. The entire community benefits from reduced hardware costs, faster time to market, higher quality systems, and rapid adoption. APP4MC is not retricted to the automotive domain and utilizable in robotics or generic real-time systems as well.

Download Full-text

Efficient task spawning for shared memory and message passing in many-core architectures

Journal of Systems Architecture ◽

10.1016/j.sysarc.2017.03.004 ◽

2017 ◽

Vol 77 ◽

pp. 72-82 ◽

Cited By ~ 3

Author(s):

Aurang Zaib ◽

Thomas Wild ◽

Andreas Herkersdorf ◽

Jan Heisswolf ◽

Jürgen Becker ◽

...

Keyword(s):

Shared Memory ◽

Message Passing ◽

Many Core

Download Full-text

On the Minimal Hardware Complexity of Pseudorandom Function Generators

STACS 2001 - Lecture Notes in Computer Science ◽

10.1007/3-540-44693-1_37 ◽

2001 ◽

pp. 419-430 ◽

Cited By ~ 5

Author(s):

Matthias Krause ◽

Stefan Lucks

Keyword(s):

Hardware Complexity ◽

Function Generators ◽

Pseudorandom Function ◽

Minimal Hardware

Download Full-text

Communication Optimization for Multiphase Flow Solver in the Library of OpenFOAM

Water ◽

10.3390/w10101461 ◽

2018 ◽

Vol 10 (10) ◽

pp. 1461 ◽

Cited By ~ 6

Author(s):

Zhipeng Lin ◽

Wenjing Yang ◽

Houcun Zhou ◽

Xinhai Xu ◽

Liaoyuan Sun ◽

...

Keyword(s):

Multiphase Flow ◽

Message Passing ◽

Message Passing Interface ◽

Limiting Factors ◽

Preconditioned Conjugate Gradient ◽

Communication Overhead ◽

Communication Optimization ◽

Flow Solver ◽

Costly Communication ◽

Intermediate Variables

Multiphase flow solvers are widely-used applications in OpenFOAM, whose scalability suffers from the costly communication overhead. Therefore, we establish communication-optimized multiphase flow solvers in OpenFOAM. In this paper, we first deliver a scalability bottleneck test on the typical multiphase flow case damBreak and reveal that the Message Passing Interface (MPI) communication in a Multidimensional Universal Limiter for Explicit Solution (MULES) and a Preconditioned Conjugate Gradient (PCG) algorithm is the short slab of multiphase flow solvers. Furthermore, an analysis of the communication behavior is carried out. We find that the redundant communication in MULES and the global synchronization in PCG are the performance limiting factors. Based on the analysis, we propose our communication optimization algorithm. For MULES, we remove the redundant communication and obtain optMULES. For PCG, we import several intermediate variables and rearrange PCG to reduce the global communication. We also overlap the computation of matrix-vector multiply and vector update with the non-blocking computation. The resulting algorithms are respectively referred to as OFPiPePCG and OFRePiPePCG. Extensive experiments show that our proposed method could dramatically increase the parallel scalability and solving speed of multiphase flow solvers in OpenFOAM approximately without the loss of accuracy.

Download Full-text