FPGA-Based Convolutional Neural Network Accelerator with Resource-Optimized Approximate Multiply-Accumulate Unit

Convolutional neural networks (CNNs) are widely used in modern applications for their versatility and high classification accuracy. Field-programmable gate arrays (FPGAs) are considered to be suitable platforms for CNNs based on their high performance, rapid development, and reconfigurability. Although many studies have proposed methods for implementing high-performance CNN accelerators on FPGAs using optimized data types and algorithm transformations, accelerators can be optimized further by investigating more efficient uses of FPGA resources. In this paper, we propose an FPGA-based CNN accelerator using multiple approximate accumulation units based on a fixed-point data type. We implemented the LeNet-5 CNN architecture, which performs classification of handwritten digits using the MNIST handwritten digit dataset. The proposed accelerator was implemented, using a high-level synthesis tool on a Xilinx FPGA. The proposed accelerator applies an optimized fixed-point data type and loop parallelization to improve performance. Approximate operation units are implemented using FPGA logic resources instead of high-precision digital signal processing (DSP) blocks, which are inefficient for low-precision data. Our accelerator model achieves 66% less memory usage and approximately 50% reduced network latency, compared to a floating point design and its resource utilization is optimized to use 78% fewer DSP blocks, compared to general fixed-point designs.

Download Full-text

Fixed Point Data Type Modeling for High Level Synthesis

IEICE Transactions on Electronics ◽

10.1587/transele.e93.c.361 ◽

2010 ◽

Vol E93-C (3) ◽

pp. 361-368

Author(s):

Benjamin CARRION SCHAFER ◽

Yusuke IGUCHI ◽

Wataru TAKAHASHI ◽

Shingo NAGATANI ◽

Kazutoshi WAKABAYASHI

Keyword(s):

Fixed Point ◽

Data Type ◽

High Level Synthesis ◽

Point Data ◽

High Level

Download Full-text

Embedded GPU Implementation for High-Performance Ultrasound Imaging

Electronics ◽

10.3390/electronics10080884 ◽

2021 ◽

Vol 10 (8) ◽

pp. 884

Author(s):

Stefano Rossi ◽

Enrico Boni

Keyword(s):

High Performance ◽

Graphics Processing Unit ◽

Digital Signal ◽

Processing Unit ◽

Embedded Computing ◽

Field Programmable ◽

Peripheral Component Interconnect ◽

Programmable Gate Arrays ◽

Graphics Processing ◽

Signal Processors

Methods of increasing complexity are currently being proposed for ultrasound (US) echographic signal processing. Graphics Processing Unit (GPU) resources allowing massive exploitation of parallel computing are ideal candidates for these tasks. Many high-performance US instruments, including open scanners like ULA-OP 256, have an architecture based only on Field-Programmable Gate Arrays (FPGAs) and/or Digital Signal Processors (DSPs). This paper proposes the implementation of the embedded NVIDIA Jetson Xavier AGX module on board ULA-OP 256. The system architecture was revised to allow the introduction of a new Peripheral Component Interconnect Express (PCIe) communication channel, while maintaining backward compatibility with all other embedded computing resources already on board. Moreover, the Input/Output (I/O) peripherals of the module make the ultrasound system independent, freeing the user from the need to use an external controlling PC.

Download Full-text

Parameter Estimation of an High Performance I.M. Drive for Sensorless Operation

Advances in Computer and Electrical Engineering - Handbook of Research on Emerging Technologies for Electrical Power Planning, Analysis, and Optimization ◽

10.4018/978-1-4666-9911-3.ch016 ◽

2016 ◽

pp. 325-346

Author(s):

Monika Dixit ◽

Smita Shandilya

Keyword(s):

High Performance ◽

Intelligent System ◽

Rapid Development ◽

Electric Energy ◽

Digital Signal ◽

Motor Drive ◽

Switching Converters ◽

Processing Power ◽

Ac Motor ◽

Energy Processing

A modern AC motor drive is a very intelligent system which covers a wide range of different electro technical apparatus and a wide scope of electrical engineering skills. A modern AC motor drive consists of four closely acting main parts: the AC machine, the power electronics, the motor control algorithm and the control hardware, i.e. the signal electronics. The advances in semiconductors and microelectronics have made the rapid development of AC motor drives possible. Semiconductors used in the switching converters provide the electric energy processing capability and microcontrollers and digital signal processors provide the data processing power for complex control algorithms.

Download Full-text

High Level Design of a Flexible PCA Hardware Accelerator Using a New Block-Streaming Method

Electronics ◽

10.3390/electronics9030449 ◽

2020 ◽

Vol 9 (3) ◽

pp. 449

Author(s):

Mohammad Amir Mansoori ◽

Mario R. Casu

Keyword(s):

High Performance ◽

Principal Component ◽

Hardware Acceleration ◽

Design Flow ◽

Hardware Accelerator ◽

Field Programmable ◽

Point Solution ◽

Active Research ◽

High Level ◽

Many Core

Principal Component Analysis (PCA) is a technique for dimensionality reduction that is useful in removing redundant information in data for various applications such as Microwave Imaging (MI) and Hyperspectral Imaging (HI). The computational complexity of PCA has made the hardware acceleration of PCA an active research topic in recent years. Although the hardware design flow can be optimized using High Level Synthesis (HLS) tools, efficient high-performance solutions for complex embedded systems still require careful design. In this paper we propose a flexible PCA hardware accelerator in Field-Programmable Gate Arrays (FPGA) that we designed entirely in HLS. In order to make the internal PCA computations more efficient, a new block-streaming method is also introduced. Several HLS optimization strategies are adopted to create an efficient hardware. The flexibility of our design allows us to use it for different FPGA targets, with flexible input data dimensions, and it also lets us easily switch from a more accurate floating-point implementation to a higher speed fixed-point solution. The results show the efficiency of our design compared to state-of-the-art implementations on GPUs, many-core CPUs, and other FPGA approaches in terms of resource usage, execution time and power consumption.

Download Full-text

Efficient Implementation of Multichannel FM and T-DMB Repeater in FPGA with Automatic Gain Controller

Electronics ◽

10.3390/electronics8050482 ◽

2019 ◽

Vol 8 (5) ◽

pp. 482

Author(s):

Mangi Han ◽

Youngmin Kim

Keyword(s):

High Performance ◽

Electronic Device ◽

Finite Impulse Response ◽

Parallel Implementation ◽

Power Level ◽

Digital Signal ◽

Digital Circuit ◽

Output Power Level ◽

Verilog Hdl ◽

Field Programmable

In this study, we implemented a high-performance multichannel repeater, both for FM and T-Digital Multimedia Broadcasting (DMB) signals using a Field Programmable Gate Array (FPGA). In a system for providing services using wireless communication, a radio-shaded area is inevitably generated due to various obstacles. Thus, an electronic device that receives weak or low-level signals and retransmits them at a higher level is crucial. In addition, parallel implementation of digital filters and gain controllers is necessary for a multichannel repeater. When power level is too low or too high, the repeater is required to compensate the power level and ensure a stable signal. However, analog- and software-based repeaters are expensive and they are difficult to install. They also cannot effectively process multichannel in parallel. The proposed system exploits various digital signal-processing algorithms, which include modulation, demodulation, Cascaded Integrator Comb (CIC) filters, Finite Impulse Response (FIR) filters, Interpolated Second Ordered Polynomials (ISOP) filters, and Automatic Gain Controllers (AGCs). The newly proposed AGC is more efficient than others in terms of computation amount and throughput. The designed digital circuit was implemented by using Verilog HDL, and tested using a Xilinx Kintex 7 device. As a result, the proposed repeater can simultaneously handle 40 FM channels and 6 DMB channels in parallel. Output power level is also always maintained by the AGC.

Download Full-text

Automated Conversion of SystemC Fixed-Point Data Types

IFIP International Federation for Information Processing - VLSI-SOC: From Systems to Chips ◽

10.1007/0-387-33403-3_4 ◽

2006 ◽

pp. 55-72 ◽

Cited By ~ 1

Author(s):

Axel G. Braun ◽

Djones V. Lettnin ◽

Joachim Gerlach ◽

Wolfgang Rosenstiel

Keyword(s):

Fixed Point ◽

Data Types ◽

Point Data

Download Full-text

Automated fixed-point data-type optimization tool for signal processing and communication systems

Proceedings of the 41st annual conference on Design automation - DAC '04 ◽

10.1145/996566.996700 ◽

2004 ◽

Cited By ~ 41

Author(s):

Changchun Shi ◽

Robert W. Brodersen

Keyword(s):

Signal Processing ◽

Fixed Point ◽

Communication Systems ◽

Data Type ◽

Point Data

Download Full-text

Low-precision Floating-point Arithmetic for High-performance FPGA-based CNN Acceleration

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3474597 ◽

2022 ◽

Vol 15 (1) ◽

pp. 1-21

Author(s):

Chen Wu ◽

Mingyu Wang ◽

Xinyuan Chu ◽

Kun Wang ◽

Lei He

Keyword(s):

Fixed Point ◽

High Performance ◽

Good Accuracy ◽

Data Representation ◽

Floating Point ◽

Average Throughput ◽

Precision Data ◽

Content Type ◽

Point Arithmetic ◽

Better Than

Low-precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for deep CNNs and (2) needing 16-bit floating-point or 8-bit fixed-point for a good accuracy. In this article, we propose a low-precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitations. Without any re-training, LPFP finds an optimal 8-bit data representation with negligible top-1/top-5 accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder and one 3-bit adder, and therefore implement four 8-bit LPFP multiplications using one DSP48E1 of Xilinx Kintex-7 family or DSP48E2 of Xilinx Ultrascale/Ultrascale+ family, whereas one DSP can implement only two 8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by over existing FPGA accelerators. Particularly for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput by 3.5 and 27.5 and average throughput per DSP by 4.1 and 5 , respectively.

Download Full-text

Hardware Implementation and Quantization of Tiny-Yolo-v2 using Open CL

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.b1150.0782s619 ◽

2019 ◽

Vol 8 (2S6) ◽

pp. 808-813

Keyword(s):

Fixed Point ◽

Embedded System ◽

Visual Recognition ◽

Peak Performance ◽

Detection Algorithms ◽

Field Programmable ◽

Comparable Performance ◽

Model Size ◽

Working Frequency ◽

High Level

The trend of increasingly model size in Deep Neural Network (DNN) algorithms boost the performance of visual recognition tasks. These gains in performance have come at a cost of increase in computational complexity and memory bandwidth. Recent studies have explored the fixed-point implementation of DNN algorithms such as AlexNet and VGG on Field Programmable Gate Array (FPGA) to facilitate the potential of deployment on embedded system. However, there are still lacking research on DNN object detection algorithms on FPGA. Consequently, we propose the implementation of Tiny-Yolo-v2 on Cyclone V PCIe FPGA board using the High-Level Synthesis Tool: Intel FPGA Software Development Kit (SDK) for OpenCL. In this work, a systematic approach is proposed to convert the floating point Tiny-Yolo-v2 algorithms into 8-bit fixed-point. Our experiments show that the 8-bit fixed-point Tiny-Yolo-v2 have significantly reduce the hardware consumption with only 0.3% loss in accuracy. Finally, our implementation achieves peak performance of 31.34 Giga Operation per Second (GOPS) and comparable performance density of 0.28GOPs/DSP to prior works under 120MHz working frequency.

Download Full-text

High-Performance Computing Using FPGAs for Improving the DTC Performances of Induction Motors

Advances in Systems Analysis, Software Engineering, and High Performance Computing - FPGA Algorithms and Applications for the Internet of Things ◽

10.4018/978-1-5225-9806-0.ch007 ◽

2020 ◽

pp. 133-153

Author(s):

Saber Krim ◽

Mohamed Faouzi Mimouni

Keyword(s):

Digital Signal Processor ◽

High Performance ◽

Induction Motors ◽

Control Method ◽

Direct Torque Control ◽

Digital Signal ◽

Sampling Frequency ◽

Torque Control ◽

Field Programmable ◽

Torque Ripples

The conventional direct torque control (DTC) of induction motors has become the most used control strategy. This control method is known by its simplicity, fast torque response, and its lack of dependence on machine parameters. Despite the cited advantages, the conventional DTC suffers from several limitations, like the torque ripples. This chapter aims to improve the conventional DTC performances by keeping its advantages. These ripples depend on the hysteresis bandwidth of the torque and the sampling frequency. The conventional DTC limitations can be prevented by increasing the sampling frequency. Nevertheless, the operation with higher sampling frequency is not possible with the software solutions, like the digital signal processor (DSP), due to the serial processing of the implemented algorithm. To overcome the DSP limitations, the field programmable gate array (FPGA) can be chosen as an alternative solution to implement the DTC algorithm with shorter execution time. In this chapter, the FPGA is chosen thanks to its parallel processing.

Download Full-text