A Cross-Core Performance Model for Heterogeneous Many-Core Architectures

Author(s):  
Rui Pinheiro ◽  
Nuno Roma ◽  
Pedro Tomás
Keyword(s):  
Author(s):  
Xiaohan Tao ◽  
Jianmin Pang ◽  
Jinlong Xu ◽  
Yu Zhu

AbstractThe heterogeneous many-core architecture plays an important role in the fields of high-performance computing and scientific computing. It uses accelerator cores with on-chip memories to improve performance and reduce energy consumption. Scratchpad memory (SPM) is a kind of fast on-chip memory with lower energy consumption compared with a hardware cache. However, data transfer between SPM and off-chip memory can be managed only by a programmer or compiler. In this paper, we propose a compiler-directed multithreaded SPM data transfer model (MSDTM) to optimize the process of data transfer in a heterogeneous many-core architecture. We use compile-time analysis to classify data accesses, check dependences and determine the allocation of data transfer operations. We further present the data transfer performance model to derive the optimal granularity of data transfer and select the most profitable data transfer strategy. We implement the proposed MSDTM on the GCC complier and evaluate it on Sunway TaihuLight with selected test cases from benchmarks and scientific computing applications. The experimental result shows that the proposed MSDTM improves the application execution time by 5.49$$\times$$ × and achieves an energy saving of 5.16$$\times$$ × on average.


Author(s):  
E Wes Bethel ◽  
Mark Howison

Given the computing industry trend of increasing processing capacity by adding more cores to a chip, the focus of this work is tuning the performance of a staple visualization algorithm, raycasting volume rendering, for shared-memory parallelism on multi-core CPUs and many-core GPUs. Our approach is to vary tunable algorithmic settings, along with known algorithmic optimizations and two different memory layouts, and measure performance in terms of absolute runtime and L2 memory cache misses. Our results indicate there is a wide variation in runtime performance on all platforms, as much as 254% for the tunable parameters we test on multi-core CPUs and 265% on many-core GPUs, and the optimal configurations vary across platforms, often in a non-obvious way. For example, our results indicate the optimal configurations on the GPU occur at a crossover point between those that maintain good cache utilization and those that saturate computational throughput. This result is likely to be extremely difficult to predict with an empirical performance model for this particular algorithm because it has an unstructured memory access pattern that varies locally for individual rays and globally for the selected viewpoint. Our results also show that optimal parameters on modern architectures are markedly different from those in previous studies run on older architectures. In addition, given the dramatic performance variation across platforms for both optimal algorithm settings and performance results, there is a clear benefit for production visualization and analysis codes to adopt a strategy for performance optimization through auto-tuning. These benefits will likely become more pronounced in the future as the number of cores per chip and the cost of moving data through the memory hierarchy both increase.


2021 ◽  
Vol 31 (2) ◽  
pp. 1-26
Author(s):  
Quang Anh Pham Nguyen ◽  
Philipp Andelfinger ◽  
Wen Jun Tan ◽  
Wentong Cai ◽  
Alois Knoll

Spiking neural networks (SNN) are among the most computationally intensive types of simulation models, with node counts on the order of up to 10 11 . Currently, there is intensive research into hardware platforms suitable to support large-scale SNN simulations, whereas several of the most widely used simulators still rely purely on the execution on CPUs. Enabling the execution of these established simulators on heterogeneous hardware allows new studies to exploit the many-core hardware prevalent in modern supercomputing environments, while still being able to reproduce and compare with results from a vast body of existing literature. In this article, we propose a transition approach for CPU-based SNN simulators to enable the execution on heterogeneous hardware (e.g., CPUs, GPUs, and FPGAs), with only limited modifications to an existing simulator code base and without changes to model code. Our approach relies on manual porting of a small number of core simulator functionalities as found in common SNN simulators, whereas the unmodified model code is analyzed and transformed automatically. We apply our approach to the well-known simulator NEST and make a version executable on heterogeneous hardware available to the community. Our measurements show that at full utilization, a single GPU achieves the performance of about 9 CPU cores. A CPU-GPU co-execution with load balancing is also demonstrated, which shows better performance compared to CPU-only or GPU-only execution. Finally, an analytical performance model is proposed to heuristically determine the optimal parameters to execute the heterogeneous NEST.


Author(s):  
Guoping Long ◽  
Dongrui Fan ◽  
Junchao Zhang ◽  
Fenglong Song ◽  
Nan Yuan ◽  
...  

2020 ◽  
Author(s):  
Mª de la Cruz Déniz‐Déniz ◽  
Mª Katiuska Cabrera-Suárez ◽  
Josefa D. Martín-Santana

2014 ◽  
Vol E97.C (4) ◽  
pp. 360-368
Author(s):  
Takashi MIYAMORI ◽  
Hui XU ◽  
Hiroyuki USUI ◽  
Soichiro HOSODA ◽  
Toru SANO ◽  
...  
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document