scholarly journals Study of Automatic Offloading Method in Mixed Offloading Destination Environment 移行先混在環境における自動オフロード方式の検討

2021 ◽  
Author(s):  
Yoji Yamato

IEICE Technical Report, IN2020-30.In recent years, utilization of heterogeneous hardware other than small core CPU such as GPU, FPGA or many core CPU is increasing. However, when using heterogeneous hardware, barriers of technical skills such as OpenMP, CUDA and OpenCL are high. Based on that, I have proposed environment-adaptive software that enables automatic conversion, configuration, and high performance operation of once written code, according to the hardware to be placed. However, including existing technologies, there has been no research to properly and automatically offload the mixed offloading destination environment such as GPU, FPGA and many core CPU. In this paper, as a new element of environment-adaptive software, I study a method for offloading applications properly and automatically in the environment where the offloading destination is mixed with GPU, FPGA and many core CPU. I evaluate the effectiveness of the proposed method in multiple applications.

2021 ◽  
Author(s):  
Yoji Yamato

IEICE technical workshop on Network Software, NWS-19-6.Recently, heterogeneous hardware such as GPU and FPGA is used in many systems and also IoT devices are increased rapidly. However, to utilize heterogeneous hardware, the hurdles are high because of much technical skills. I have proposed environment adaptive software to operate an once written application with high performance by automatically converting the code and configuring setting so that we can utilize GPU, FPGA and IoT devices in the location to be deployed and I have also achieved automatic GPU offloading partly. In this paper, I study a method of FPGA offloading which automatically extracts appropriate loop statements of application software.


Author(s):  
Xiaohan Tao ◽  
Jianmin Pang ◽  
Jinlong Xu ◽  
Yu Zhu

AbstractThe heterogeneous many-core architecture plays an important role in the fields of high-performance computing and scientific computing. It uses accelerator cores with on-chip memories to improve performance and reduce energy consumption. Scratchpad memory (SPM) is a kind of fast on-chip memory with lower energy consumption compared with a hardware cache. However, data transfer between SPM and off-chip memory can be managed only by a programmer or compiler. In this paper, we propose a compiler-directed multithreaded SPM data transfer model (MSDTM) to optimize the process of data transfer in a heterogeneous many-core architecture. We use compile-time analysis to classify data accesses, check dependences and determine the allocation of data transfer operations. We further present the data transfer performance model to derive the optimal granularity of data transfer and select the most profitable data transfer strategy. We implement the proposed MSDTM on the GCC complier and evaluate it on Sunway TaihuLight with selected test cases from benchmarks and scientific computing applications. The experimental result shows that the proposed MSDTM improves the application execution time by 5.49$$\times$$ × and achieves an energy saving of 5.16$$\times$$ × on average.


Electronics ◽  
2020 ◽  
Vol 9 (3) ◽  
pp. 449
Author(s):  
Mohammad Amir Mansoori ◽  
Mario R. Casu

Principal Component Analysis (PCA) is a technique for dimensionality reduction that is useful in removing redundant information in data for various applications such as Microwave Imaging (MI) and Hyperspectral Imaging (HI). The computational complexity of PCA has made the hardware acceleration of PCA an active research topic in recent years. Although the hardware design flow can be optimized using High Level Synthesis (HLS) tools, efficient high-performance solutions for complex embedded systems still require careful design. In this paper we propose a flexible PCA hardware accelerator in Field-Programmable Gate Arrays (FPGA) that we designed entirely in HLS. In order to make the internal PCA computations more efficient, a new block-streaming method is also introduced. Several HLS optimization strategies are adopted to create an efficient hardware. The flexibility of our design allows us to use it for different FPGA targets, with flexible input data dimensions, and it also lets us easily switch from a more accurate floating-point implementation to a higher speed fixed-point solution. The results show the efficiency of our design compared to state-of-the-art implementations on GPUs, many-core CPUs, and other FPGA approaches in terms of resource usage, execution time and power consumption.


Author(s):  
R. G. Beausoleil ◽  
M. Fiorentino ◽  
J. Ahn ◽  
N. Binkert ◽  
A. Davis ◽  
...  

2020 ◽  
Vol 10 (4) ◽  
pp. 37
Author(s):  
Habiba Lahdhiri ◽  
Jordane Lorandel ◽  
Salvatore Monteleone ◽  
Emmanuelle Bourdel ◽  
Maurizio Palesi

The Network-on-chip (NoC) paradigm has been proposed as a promising solution to enable the handling of a high degree of integration in multi-/many-core architectures. Despite their advantages, wired NoC infrastructures are facing several performance issues regarding multi-hop long-distance communications. RF-NoC is an attractive solution offering high performance and multicast/broadcast capabilities. However, managing RF links is a critical aspect that relies on both application-dependent and architectural parameters. This paper proposes a design space exploration framework for OFDMA-based RF-NoC architecture, which takes advantage of both real application benchmarks simulated using Sniper and RF-NoC architecture modeled using Noxim. We adopted the proposed framework to finely configure a routing algorithm, working with real traffic, achieving up to 45% of delay reduction, compared to a wired NoC setup in similar conditions.


Author(s):  
Simon McIntosh-Smith ◽  
James Price ◽  
Richard B Sessions ◽  
Amaurys A Ibarra

Sign in / Sign up

Export Citation Format

Share Document