A Cross-Core Performance Model for Heterogeneous Many-Core Architectures

Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architecture

The Journal of Supercomputing ◽

10.1007/s11227-021-03853-x ◽

2021 ◽

Author(s):

Xiaohan Tao ◽

Jianmin Pang ◽

Jinlong Xu ◽

Yu Zhu

Keyword(s):

Energy Consumption ◽

High Performance ◽

Scientific Computing ◽

Data Transfer ◽

Performance Model ◽

Experimental Result ◽

Transfer Model ◽

Scratchpad Memory ◽

On Chip ◽

Many Core

AbstractThe heterogeneous many-core architecture plays an important role in the fields of high-performance computing and scientific computing. It uses accelerator cores with on-chip memories to improve performance and reduce energy consumption. Scratchpad memory (SPM) is a kind of fast on-chip memory with lower energy consumption compared with a hardware cache. However, data transfer between SPM and off-chip memory can be managed only by a programmer or compiler. In this paper, we propose a compiler-directed multithreaded SPM data transfer model (MSDTM) to optimize the process of data transfer in a heterogeneous many-core architecture. We use compile-time analysis to classify data accesses, check dependences and determine the allocation of data transfer operations. We further present the data transfer performance model to derive the optimal granularity of data transfer and select the most profitable data transfer strategy. We implement the proposed MSDTM on the GCC complier and evaluate it on Sunway TaihuLight with selected test cases from benchmarks and scientific computing applications. The experimental result shows that the proposed MSDTM improves the application execution time by 5.49$$\times$$ × and achieves an energy saving of 5.16$$\times$$ × on average.

Download Full-text

Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning

The International Journal of High Performance Computing Applications ◽

10.1177/1094342012440466 ◽

2012 ◽

Vol 26 (4) ◽

pp. 399-412 ◽

Cited By ~ 13

Author(s):

E Wes Bethel ◽

Mark Howison

Keyword(s):

Shared Memory ◽

Volume Rendering ◽

Performance Optimization ◽

Optimal Algorithm ◽

Performance Model ◽

Crossover Point ◽

Memory Cache ◽

And Performance ◽

Many Core ◽

Optimal Configurations

Given the computing industry trend of increasing processing capacity by adding more cores to a chip, the focus of this work is tuning the performance of a staple visualization algorithm, raycasting volume rendering, for shared-memory parallelism on multi-core CPUs and many-core GPUs. Our approach is to vary tunable algorithmic settings, along with known algorithmic optimizations and two different memory layouts, and measure performance in terms of absolute runtime and L2 memory cache misses. Our results indicate there is a wide variation in runtime performance on all platforms, as much as 254% for the tunable parameters we test on multi-core CPUs and 265% on many-core GPUs, and the optimal configurations vary across platforms, often in a non-obvious way. For example, our results indicate the optimal configurations on the GPU occur at a crossover point between those that maintain good cache utilization and those that saturate computational throughput. This result is likely to be extremely difficult to predict with an empirical performance model for this particular algorithm because it has an unstructured memory access pattern that varies locally for individual rays and globally for the selected viewpoint. Our results also show that optimal parameters on modern architectures are markedly different from those in previous studies run on older architectures. In addition, given the dramatic performance variation across platforms for both optimal algorithm settings and performance results, there is a clear benefit for production visualization and analysis codes to adopt a strategy for performance optimization through auto-tuning. These benefits will likely become more pronounced in the future as the number of cores per chip and the cost of moving data through the memory hierarchy both increase.

Download Full-text

Transitioning Spiking Neural Network Simulators to Heterogeneous Hardware

ACM Transactions on Modeling and Computer Simulation ◽

10.1145/3422389 ◽

2021 ◽

Vol 31 (2) ◽

pp. 1-26

Author(s):

Quang Anh Pham Nguyen ◽

Philipp Andelfinger ◽

Wen Jun Tan ◽

Wentong Cai ◽

Alois Knoll

Keyword(s):

Large Scale ◽

Simulation Models ◽

Performance Model ◽

Model Code ◽

Code Base ◽

Heterogeneous Hardware ◽

Computationally Intensive ◽

The Many ◽

Many Core ◽

Hardware Platforms

Spiking neural networks (SNN) are among the most computationally intensive types of simulation models, with node counts on the order of up to 10 11 . Currently, there is intensive research into hardware platforms suitable to support large-scale SNN simulations, whereas several of the most widely used simulators still rely purely on the execution on CPUs. Enabling the execution of these established simulators on heterogeneous hardware allows new studies to exploit the many-core hardware prevalent in modern supercomputing environments, while still being able to reproduce and compare with results from a vast body of existing literature. In this article, we propose a transition approach for CPU-based SNN simulators to enable the execution on heterogeneous hardware (e.g., CPUs, GPUs, and FPGAs), with only limited modifications to an existing simulator code base and without changes to model code. Our approach relies on manual porting of a small number of core simulator functionalities as found in common SNN simulators, whereas the unmodified model code is analyzed and transformed automatically. We apply our approach to the well-known simulator NEST and make a version executable on heterogeneous hardware available to the community. Our measurements show that at full utilization, a single GPU achieves the performance of about 9 CPU cores. A CPU-GPU co-execution with load balancing is also demonstrated, which shows better performance compared to CPU-only or GPU-only execution. Finally, an analytical performance model is proposed to heuristically determine the optimal parameters to execute the heterogeneous NEST.

Download Full-text

A Performance Model of Dense Matrix Operations on Many-Core Architectures

Lecture Notes in Computer Science - Euro-Par 2008 – Parallel Processing ◽

10.1007/978-3-540-85451-7_14 ◽

2008 ◽

pp. 120-129 ◽

Cited By ~ 4

Author(s):

Guoping Long ◽

Dongrui Fan ◽

Junchao Zhang ◽

Fenglong Song ◽

Nan Yuan ◽

...

Keyword(s):

Performance Model ◽

Dense Matrix ◽

Matrix Operations ◽

Many Core ◽

A Performance

Download Full-text

Driver performance model: 1. Conceptual framework

PsycEXTRA Dataset ◽

10.1037/e447302006-001 ◽

2001 ◽

Author(s):

Joseph M. Heimerl

Keyword(s):

Conceptual Framework ◽

Performance Model ◽

Driver Performance

Download Full-text

Family Identification with the Firm, Non-family Stakeholders Orientation, and Economic Performance--Model

PsycTESTS Dataset ◽

10.1037/t76582-000 ◽

2020 ◽

Author(s):

Mª de la Cruz Déniz‐Déniz ◽

Mª Katiuska Cabrera-Suárez ◽

Josefa D. Martín-Santana

Keyword(s):

Economic Performance ◽

Performance Model

Download Full-text

Environment-Usage-Performance Model on Participating Firms of Electronic Marketplace in Export Marketing

International Commerce and Information Review ◽

10.15798/kaici.9.1.200703.119 ◽

2007 ◽

Vol 9 (1) ◽

pp. 119-148

Author(s):

정찬근 ◽

곽수영

Keyword(s):

Performance Model ◽

Electronic Marketplace ◽

Export Marketing

Download Full-text

Autonomic Diffusive Load Balancing on Many-Core Architecture Using Simulated Annealing

IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences ◽

10.1587/transfun.e100.a.1640 ◽

2017 ◽

Vol E100.A (8) ◽

pp. 1640-1649

Author(s):

Hyunjik SONG ◽

Kiyoung CHOI

Keyword(s):

Simulated Annealing ◽

Load Balancing ◽

Many Core

Download Full-text

Architecture and Evaluation of Low Power Many-Core SoC with Two 32-Core Clusters

IEICE Transactions on Electronics ◽

10.1587/transele.e97.c.360 ◽

2014 ◽

Vol E97.C (4) ◽

pp. 360-368

Author(s):

Takashi MIYAMORI ◽

Hui XU ◽

Hiroyuki USUI ◽

Soichiro HOSODA ◽

Toru SANO ◽

...

Keyword(s):

Low Power ◽

Many Core

Download Full-text

Applied Research and Analysis of Construction of Foreign Language Learning Performance Model

2018 International Conference on Social Sciences, Education and Management (SOCSEM 2018) ◽

10.25236/socsem.2018.84 ◽

2018 ◽

Keyword(s):

Foreign Language ◽

Language Learning ◽

Applied Research ◽

Foreign Language Learning ◽

Performance Model ◽

Learning Performance

Download Full-text