Study of Automatic Offloading Method in Mixed Offloading Destination Environment 移行先混在環境における自動オフロード方式の検討

Mapping Intimacies ◽

10.31224/osf.io/3bk65 ◽

2021 ◽

Author(s):

Yoji Yamato

Keyword(s):

High Performance ◽

Technical Skills ◽

Adaptive Software ◽

Small Core ◽

Technical Report ◽

Heterogeneous Hardware ◽

Many Core

IEICE Technical Report, IN2020-30.In recent years, utilization of heterogeneous hardware other than small core CPU such as GPU, FPGA or many core CPU is increasing. However, when using heterogeneous hardware, barriers of technical skills such as OpenMP, CUDA and OpenCL are high. Based on that, I have proposed environment-adaptive software that enables automatic conversion, configuration, and high performance operation of once written code, according to the hardware to be placed. However, including existing technologies, there has been no research to properly and automatically offload the mixed offloading destination environment such as GPU, FPGA and many core CPU. In this paper, as a new element of environment-adaptive software, I study a method for offloading applications properly and automatically in the environment where the offloading destination is mixed with GPU, FPGA and many core CPU. I evaluate the effectiveness of the proposed method in multiple applications.

Download Full-text

Study of Automatic FPGA Offloading for Loop Statements of Applications

10.31224/osf.io/h4crw ◽

2021 ◽

Author(s):

Yoji Yamato

Keyword(s):

High Performance ◽

Technical Skills ◽

Application Software ◽

Adaptive Software ◽

Heterogeneous Hardware ◽

Iot Devices ◽

Network Software

IEICE technical workshop on Network Software, NWS-19-6.Recently, heterogeneous hardware such as GPU and FPGA is used in many systems and also IoT devices are increased rapidly. However, to utilize heterogeneous hardware, the hurdles are high because of much technical skills. I have proposed environment adaptive software to operate an once written application with high performance by automatically converting the code and configuring setting so that we can utilize GPU, FPGA and IoT devices in the location to be deployed and I have also achieved automatic GPU offloading partly. In this paper, I study a method of FPGA offloading which automatically extracts appropriate loop statements of application software.

Download Full-text

The Plural Many‐core Architecture – High Performance at Low Power

Multi‐Processor System‐on‐Chip 1 ◽

10.1002/9781119818298.ch3 ◽

2021 ◽

pp. 53-68

Author(s):

Ran Ginosar

Keyword(s):

Low Power ◽

High Performance ◽

Many Core

Download Full-text

Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architecture

The Journal of Supercomputing ◽

10.1007/s11227-021-03853-x ◽

2021 ◽

Author(s):

Xiaohan Tao ◽

Jianmin Pang ◽

Jinlong Xu ◽

Yu Zhu

Keyword(s):

Energy Consumption ◽

High Performance ◽

Scientific Computing ◽

Data Transfer ◽

Performance Model ◽

Experimental Result ◽

Transfer Model ◽

Scratchpad Memory ◽

On Chip ◽

Many Core

AbstractThe heterogeneous many-core architecture plays an important role in the fields of high-performance computing and scientific computing. It uses accelerator cores with on-chip memories to improve performance and reduce energy consumption. Scratchpad memory (SPM) is a kind of fast on-chip memory with lower energy consumption compared with a hardware cache. However, data transfer between SPM and off-chip memory can be managed only by a programmer or compiler. In this paper, we propose a compiler-directed multithreaded SPM data transfer model (MSDTM) to optimize the process of data transfer in a heterogeneous many-core architecture. We use compile-time analysis to classify data accesses, check dependences and determine the allocation of data transfer operations. We further present the data transfer performance model to derive the optimal granularity of data transfer and select the most profitable data transfer strategy. We implement the proposed MSDTM on the GCC complier and evaluate it on Sunway TaihuLight with selected test cases from benchmarks and scientific computing applications. The experimental result shows that the proposed MSDTM improves the application execution time by 5.49$$\times$$ × and achieves an energy saving of 5.16$$\times$$ × on average.

Download Full-text

Hybrid silicon-photonic network-on-chip for future generations of high-performance many-core systems

The Journal of Supercomputing ◽

10.1007/s11227-015-1539-0 ◽

2015 ◽

Vol 71 (12) ◽

pp. 4446-4475 ◽

Cited By ~ 12

Author(s):

Achraf Ben Ahmed ◽

Abderazek Ben Abdallah

Keyword(s):

High Performance ◽

Network On Chip ◽

Future Generations ◽

Photonic Network ◽

Silicon Photonic ◽

Hybrid Silicon ◽

On Chip ◽

Many Core

Download Full-text

A hierarchical grid algorithm for accelerating high-performance conjugate gradient benchmark on sunway many-core processor

Proceedings of the 3rd International Conference on Communication and Information Processing - ICCIP '17 ◽

10.1145/3162957.3163049 ◽

2017 ◽

Author(s):

Chenzhi Liao ◽

Junshi Chen ◽

Wenting Han ◽

Huanqi Cao ◽

Zhichao Su ◽

...

Keyword(s):

Conjugate Gradient ◽

High Performance ◽

Grid Algorithm ◽

Many Core

Download Full-text

High Level Design of a Flexible PCA Hardware Accelerator Using a New Block-Streaming Method

Electronics ◽

10.3390/electronics9030449 ◽

2020 ◽

Vol 9 (3) ◽

pp. 449

Author(s):

Mohammad Amir Mansoori ◽

Mario R. Casu

Keyword(s):

High Performance ◽

Principal Component ◽

Hardware Acceleration ◽

Design Flow ◽

Hardware Accelerator ◽

Field Programmable ◽

Point Solution ◽

Active Research ◽

High Level ◽

Many Core

Principal Component Analysis (PCA) is a technique for dimensionality reduction that is useful in removing redundant information in data for various applications such as Microwave Imaging (MI) and Hyperspectral Imaging (HI). The computational complexity of PCA has made the hardware acceleration of PCA an active research topic in recent years. Although the hardware design flow can be optimized using High Level Synthesis (HLS) tools, efficient high-performance solutions for complex embedded systems still require careful design. In this paper we propose a flexible PCA hardware accelerator in Field-Programmable Gate Arrays (FPGA) that we designed entirely in HLS. In order to make the internal PCA computations more efficient, a new block-streaming method is also introduced. Several HLS optimization strategies are adopted to create an efficient hardware. The flexibility of our design allows us to use it for different FPGA targets, with flexible input data dimensions, and it also lets us easily switch from a more accurate floating-point implementation to a higher speed fixed-point solution. The results show the efficiency of our design compared to state-of-the-art implementations on GPUs, many-core CPUs, and other FPGA approaches in terms of resource usage, execution time and power consumption.

Download Full-text

A nanophotonic interconnect for high-performance many-core computation

2008 5th IEEE International Conference on Group IV Photonics ◽

10.1109/group4.2008.4638201 ◽

2008 ◽

Cited By ~ 1

Author(s):

R. G. Beausoleil ◽

M. Fiorentino ◽

J. Ahn ◽

N. Binkert ◽

A. Davis ◽

...

Keyword(s):

High Performance ◽

Many Core ◽

Nanophotonic Interconnect

Download Full-text

High Performance Parallel Summed-Area Table Kernels for Multi-core and Many-core Systems

Euro-Par 2016: Parallel Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-319-43659-3_23 ◽

2016 ◽

pp. 306-318

Author(s):

Angelos Papatriantafyllou ◽

Dimitris Sacharidis

Keyword(s):

High Performance ◽

Many Core

Download Full-text

Framework for Design Exploration and Performance Analysis of RF-NoC Manycore Architecture

Journal of Low Power Electronics and Applications ◽

10.3390/jlpea10040037 ◽

2020 ◽

Vol 10 (4) ◽

pp. 37

Author(s):

Habiba Lahdhiri ◽

Jordane Lorandel ◽

Salvatore Monteleone ◽

Emmanuelle Bourdel ◽

Maurizio Palesi

Keyword(s):

High Performance ◽

Design Space Exploration ◽

Routing Algorithm ◽

Long Distance ◽

Promising Solution ◽

And Performance ◽

On Chip ◽

Many Core ◽

High Degree ◽

Real Traffic

The Network-on-chip (NoC) paradigm has been proposed as a promising solution to enable the handling of a high degree of integration in multi-/many-core architectures. Despite their advantages, wired NoC infrastructures are facing several performance issues regarding multi-hop long-distance communications. RF-NoC is an attractive solution offering high performance and multicast/broadcast capabilities. However, managing RF links is a critical aspect that relies on both application-dependent and architectural parameters. This paper proposes a design space exploration framework for OFDMA-based RF-NoC architecture, which takes advantage of both real application benchmarks simulated using Sniper and RF-NoC architecture modeled using Noxim. We adopted the proposed framework to finely configure a routing algorithm, working with real traffic, achieving up to 45% of delay reduction, compared to a wired NoC setup in similar conditions.

Download Full-text

High performance in silico virtual drug screening on many-core processors

The International Journal of High Performance Computing Applications ◽

10.1177/1094342014528252 ◽

2014 ◽

Vol 29 (2) ◽

pp. 119-134 ◽

Cited By ~ 43

Author(s):

Simon McIntosh-Smith ◽

James Price ◽

Richard B Sessions ◽

Amaurys A Ibarra

Keyword(s):

Drug Screening ◽

In Silico ◽

High Performance ◽

Many Core ◽

Virtual Drug Screening

Download Full-text