scholarly journals FPGA-Based Convolutional Neural Network Accelerator with Resource-Optimized Approximate Multiply-Accumulate Unit

Electronics ◽  
2021 ◽  
Vol 10 (22) ◽  
pp. 2859
Author(s):  
Mannhee Cho ◽  
Youngmin Kim

Convolutional neural networks (CNNs) are widely used in modern applications for their versatility and high classification accuracy. Field-programmable gate arrays (FPGAs) are considered to be suitable platforms for CNNs based on their high performance, rapid development, and reconfigurability. Although many studies have proposed methods for implementing high-performance CNN accelerators on FPGAs using optimized data types and algorithm transformations, accelerators can be optimized further by investigating more efficient uses of FPGA resources. In this paper, we propose an FPGA-based CNN accelerator using multiple approximate accumulation units based on a fixed-point data type. We implemented the LeNet-5 CNN architecture, which performs classification of handwritten digits using the MNIST handwritten digit dataset. The proposed accelerator was implemented, using a high-level synthesis tool on a Xilinx FPGA. The proposed accelerator applies an optimized fixed-point data type and loop parallelization to improve performance. Approximate operation units are implemented using FPGA logic resources instead of high-precision digital signal processing (DSP) blocks, which are inefficient for low-precision data. Our accelerator model achieves 66% less memory usage and approximately 50% reduced network latency, compared to a floating point design and its resource utilization is optimized to use 78% fewer DSP blocks, compared to general fixed-point designs.

2010 ◽  
Vol E93-C (3) ◽  
pp. 361-368
Author(s):  
Benjamin CARRION SCHAFER ◽  
Yusuke IGUCHI ◽  
Wataru TAKAHASHI ◽  
Shingo NAGATANI ◽  
Kazutoshi WAKABAYASHI

Electronics ◽  
2021 ◽  
Vol 10 (8) ◽  
pp. 884
Author(s):  
Stefano Rossi ◽  
Enrico Boni

Methods of increasing complexity are currently being proposed for ultrasound (US) echographic signal processing. Graphics Processing Unit (GPU) resources allowing massive exploitation of parallel computing are ideal candidates for these tasks. Many high-performance US instruments, including open scanners like ULA-OP 256, have an architecture based only on Field-Programmable Gate Arrays (FPGAs) and/or Digital Signal Processors (DSPs). This paper proposes the implementation of the embedded NVIDIA Jetson Xavier AGX module on board ULA-OP 256. The system architecture was revised to allow the introduction of a new Peripheral Component Interconnect Express (PCIe) communication channel, while maintaining backward compatibility with all other embedded computing resources already on board. Moreover, the Input/Output (I/O) peripherals of the module make the ultrasound system independent, freeing the user from the need to use an external controlling PC.


Author(s):  
Monika Dixit ◽  
Smita Shandilya

A modern AC motor drive is a very intelligent system which covers a wide range of different electro technical apparatus and a wide scope of electrical engineering skills. A modern AC motor drive consists of four closely acting main parts: the AC machine, the power electronics, the motor control algorithm and the control hardware, i.e. the signal electronics. The advances in semiconductors and microelectronics have made the rapid development of AC motor drives possible. Semiconductors used in the switching converters provide the electric energy processing capability and microcontrollers and digital signal processors provide the data processing power for complex control algorithms.


Electronics ◽  
2020 ◽  
Vol 9 (3) ◽  
pp. 449
Author(s):  
Mohammad Amir Mansoori ◽  
Mario R. Casu

Principal Component Analysis (PCA) is a technique for dimensionality reduction that is useful in removing redundant information in data for various applications such as Microwave Imaging (MI) and Hyperspectral Imaging (HI). The computational complexity of PCA has made the hardware acceleration of PCA an active research topic in recent years. Although the hardware design flow can be optimized using High Level Synthesis (HLS) tools, efficient high-performance solutions for complex embedded systems still require careful design. In this paper we propose a flexible PCA hardware accelerator in Field-Programmable Gate Arrays (FPGA) that we designed entirely in HLS. In order to make the internal PCA computations more efficient, a new block-streaming method is also introduced. Several HLS optimization strategies are adopted to create an efficient hardware. The flexibility of our design allows us to use it for different FPGA targets, with flexible input data dimensions, and it also lets us easily switch from a more accurate floating-point implementation to a higher speed fixed-point solution. The results show the efficiency of our design compared to state-of-the-art implementations on GPUs, many-core CPUs, and other FPGA approaches in terms of resource usage, execution time and power consumption.


Electronics ◽  
2019 ◽  
Vol 8 (5) ◽  
pp. 482
Author(s):  
Mangi Han ◽  
Youngmin Kim

In this study, we implemented a high-performance multichannel repeater, both for FM and T-Digital Multimedia Broadcasting (DMB) signals using a Field Programmable Gate Array (FPGA). In a system for providing services using wireless communication, a radio-shaded area is inevitably generated due to various obstacles. Thus, an electronic device that receives weak or low-level signals and retransmits them at a higher level is crucial. In addition, parallel implementation of digital filters and gain controllers is necessary for a multichannel repeater. When power level is too low or too high, the repeater is required to compensate the power level and ensure a stable signal. However, analog- and software-based repeaters are expensive and they are difficult to install. They also cannot effectively process multichannel in parallel. The proposed system exploits various digital signal-processing algorithms, which include modulation, demodulation, Cascaded Integrator Comb (CIC) filters, Finite Impulse Response (FIR) filters, Interpolated Second Ordered Polynomials (ISOP) filters, and Automatic Gain Controllers (AGCs). The newly proposed AGC is more efficient than others in terms of computation amount and throughput. The designed digital circuit was implemented by using Verilog HDL, and tested using a Xilinx Kintex 7 device. As a result, the proposed repeater can simultaneously handle 40 FM channels and 6 DMB channels in parallel. Output power level is also always maintained by the AGC.


Author(s):  
Axel G. Braun ◽  
Djones V. Lettnin ◽  
Joachim Gerlach ◽  
Wolfgang Rosenstiel
Keyword(s):  

2022 ◽  
Vol 15 (1) ◽  
pp. 1-21
Author(s):  
Chen Wu ◽  
Mingyu Wang ◽  
Xinyuan Chu ◽  
Kun Wang ◽  
Lei He

Low-precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for deep CNNs and (2) needing 16-bit floating-point or 8-bit fixed-point for a good accuracy. In this article, we propose a low-precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitations. Without any re-training, LPFP finds an optimal 8-bit data representation with negligible top-1/top-5 accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder and one 3-bit adder, and therefore implement four 8-bit LPFP multiplications using one DSP48E1 of Xilinx Kintex-7 family or DSP48E2 of Xilinx Ultrascale/Ultrascale+ family, whereas one DSP can implement only two 8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by over existing FPGA accelerators. Particularly for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput by 3.5 and 27.5 and average throughput per DSP by 4.1 and 5 , respectively.


The trend of increasingly model size in Deep Neural Network (DNN) algorithms boost the performance of visual recognition tasks. These gains in performance have come at a cost of increase in computational complexity and memory bandwidth. Recent studies have explored the fixed-point implementation of DNN algorithms such as AlexNet and VGG on Field Programmable Gate Array (FPGA) to facilitate the potential of deployment on embedded system. However, there are still lacking research on DNN object detection algorithms on FPGA. Consequently, we propose the implementation of Tiny-Yolo-v2 on Cyclone V PCIe FPGA board using the High-Level Synthesis Tool: Intel FPGA Software Development Kit (SDK) for OpenCL. In this work, a systematic approach is proposed to convert the floating point Tiny-Yolo-v2 algorithms into 8-bit fixed-point. Our experiments show that the 8-bit fixed-point Tiny-Yolo-v2 have significantly reduce the hardware consumption with only 0.3% loss in accuracy. Finally, our implementation achieves peak performance of 31.34 Giga Operation per Second (GOPS) and comparable performance density of 0.28GOPs/DSP to prior works under 120MHz working frequency.


Author(s):  
Saber Krim ◽  
Mohamed Faouzi Mimouni

The conventional direct torque control (DTC) of induction motors has become the most used control strategy. This control method is known by its simplicity, fast torque response, and its lack of dependence on machine parameters. Despite the cited advantages, the conventional DTC suffers from several limitations, like the torque ripples. This chapter aims to improve the conventional DTC performances by keeping its advantages. These ripples depend on the hysteresis bandwidth of the torque and the sampling frequency. The conventional DTC limitations can be prevented by increasing the sampling frequency. Nevertheless, the operation with higher sampling frequency is not possible with the software solutions, like the digital signal processor (DSP), due to the serial processing of the implemented algorithm. To overcome the DSP limitations, the field programmable gate array (FPGA) can be chosen as an alternative solution to implement the DTC algorithm with shorter execution time. In this chapter, the FPGA is chosen thanks to its parallel processing.


Sign in / Sign up

Export Citation Format

Share Document