Coordinated transformations for high-level synthesis of high performance microprocessor blocks

Due to performance and energy requirements, FPGA-based accelerators have become a promising solution for high-performance computations. Meanwhile, with the help of high-level synthesis (HLS) compilers, FPGA can be programmed using common programming languages such as C, C++, or OpenCL, thereby improving design efficiency and portability. Stencil computations are significant kernels in various scientific applications. In this paper, we introduce an architecture design for implementing stencil kernels on state-of-the-art FPGA with high bandwidth memory (HBM). Traditional FPGAs are usually equipped with external memory, e.g., DDR3 or DDR4, which limits the design space exploration in the spatial domain of stencil kernels. Therefore, many previous studies mainly relied on exploiting parallelism in the temporal domain to eliminate the bandwidth limitations. In our approach, we scale-up the design performance by considering both the spatial and temporal parallelism of the stencil kernel equally. We also discuss the design portability among different HLS compilers. We use typical stencil kernels to evaluate our design on a Xilinx U280 FPGA board and compare the results with other existing studies. By adopting our method, developers can take broad parallelization strategies based on specific FPGA resources to improve performance.

Download Full-text

Buffer Placement and Sizing for High-Performance Dataflow Circuits

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3477053 ◽

2022 ◽

Vol 15 (1) ◽

pp. 1-32

Author(s):

Lana Josipović ◽

Shabnam Sheikhha ◽

Andrea Guerrieri ◽

Paolo Ienne ◽

Jordi Cortadella

Keyword(s):

Performance Optimization ◽

Optimization Model ◽

High Performance ◽

Control Flow ◽

High Level Synthesis ◽

Software Applications ◽

Marked Graphs ◽

Variable Latency ◽

High Level ◽

Strong Contrast

Commercial high-level synthesis tools typically produce statically scheduled circuits. Yet, effective C-to-circuit conversion of arbitrary software applications calls for dataflow circuits, as they can handle efficiently variable latencies (e.g., caches), unpredictable memory dependencies, and irregular control flow. Dataflow circuits exhibit an unconventional property: registers (usually referred to as “buffers”) can be placed anywhere in the circuit without changing its semantics, in strong contrast to what happens in traditional datapaths. Yet, although functionally irrelevant, this placement has a significant impact on the circuit’s timing and throughput. In this work, we show how to strategically place buffers into a dataflow circuit to optimize its performance. Our approach extracts a set of choice-free critical loops from arbitrary dataflow circuits and relies on the theory of marked graphs to optimize the buffer placement and sizing. Our performance optimization model supports important high-level synthesis features such as pipelined computational units, units with variable latency and throughput, and if-conversion. We demonstrate the performance benefits of our approach on a set of dataflow circuits obtained from imperative code.

Download Full-text

A methodology for high level synthesis of high performance DSP structures targetting FPGAs

10.1109/icasic.1996.562758 ◽

2002 ◽

Author(s):

S. Shehata ◽

B. Haroun ◽

A. Al-Khalili

Keyword(s):

High Performance ◽

High Level Synthesis ◽

High Level

Download Full-text

On the Design of High Performance HW Accelerator through High-level Synthesis Scheduling Approximations

2020 Design, Automation & Test in Europe Conference & Exhibition (DATE) ◽

10.23919/date48585.2020.9116358 ◽

2020 ◽

Author(s):

Siyuan Xu ◽

Benjamin Carrion Schafer

Keyword(s):

High Performance ◽

High Level Synthesis ◽

High Level

Download Full-text

Efficient FPGA Implementation of OpenCL High-Performance Computing Applications via High-Level Synthesis

IEEE Access ◽

10.1109/access.2017.2671881 ◽

2017 ◽

Vol 5 ◽

pp. 2747-2762 ◽

Cited By ~ 29

Author(s):

Fahad Bin Muslim ◽

Liang Ma ◽

Mehdi Roozmeh ◽

Luciano Lavagno

Keyword(s):

High Performance Computing ◽

High Performance ◽

Fpga Implementation ◽

High Level Synthesis ◽

High Level ◽

Performance Computing

Download Full-text

Architecture Exploration of High-Performance Floating-Point Fused Multiply-Add Units and their Automatic Use in High-Level Synthesis

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum ◽

10.1109/ipdpsw.2013.106 ◽

2013 ◽

Cited By ~ 2

Author(s):

Bjorn Liebig ◽

Jens Huthmann ◽

Andreas Koch

Keyword(s):

High Performance ◽

High Level Synthesis ◽

Floating Point ◽

Architecture Exploration ◽

High Level

Download Full-text

Transformations of High-Level Synthesis Codes for High-Performance Computing

IEEE Transactions on Parallel and Distributed Systems ◽

10.1109/tpds.2020.3039409 ◽

2021 ◽

Vol 32 (5) ◽

pp. 1014-1029

Author(s):

Johannes de Fine Licht ◽

Maciej Besta ◽

Simon Meierhans ◽

Torsten Hoefler

Keyword(s):

High Performance Computing ◽

High Performance ◽

High Level Synthesis ◽

High Level ◽

Performance Computing

Download Full-text

A Parametrizable High-Level Synthesis Library for Accelerating Neural Networks on FPGAs

Journal of Signal Processing Systems ◽

10.1007/s11265-021-01651-5 ◽

2021 ◽

Author(s):

Lester Kalms ◽

Pedram Amini Rad ◽

Muhammad Ali ◽

Arsany Iskander ◽

Diana Göhringer

Keyword(s):

Neural Networks ◽

High Performance ◽

Multimedia Retrieval ◽

High Level Synthesis ◽

Feature Maps ◽

Data Types ◽

Efficient System ◽

Computer Vision Applications ◽

On Chip ◽

High Level

AbstractIn recent years, Convolutional Neural Network CNN have been incorporated in a large number of applications, including multimedia retrieval and image classification. However, CNN based algorithms are computationally and resource intensive and therefore difficult to be used in embedded systems. FPGA based accelerators are becoming more and more popular in research and industry due to their flexibility and energy efficiency. However, the available resources and the size of the on-chip memory can limit the performance of the FPGA accelerator for CNN. This work proposes an High-Level Synthesis HLS library for CNN algorithms. It contains seven different streaming-capable CNN (plus two conversion) functions for creating large neural networks with deep pipelines. The different functions have many parameter settings (e.g. for resolution, feature maps, data types, kernel size, parallelilization, accuracy, etc.), which also enable compile-time optimizations. Our functions are integrated into the HiFlipVX library, which is an open source HLS FPGA library for image processing and object detection. This offers the possibility to implement different types of computer vision applications with one library. Due to the various configuration and parallelization possibilities of the library functions, it is possible to implement a high-performance, scalable and resource-efficient system, as our evaluation of the MobileNets algorithm shows.

Download Full-text