scholarly journals Argus CNN Accelerator Based on Kernel Clustering and Resource-Aware Pruning

2021 ◽  
Vol 27 (3) ◽  
pp. 57-70
Author(s):  
Damjan M. Rakanovic ◽  
Vuk Vranjkovic ◽  
Rastislav J. R. Struharik

Paper proposes a two-step Convolutional Neural Network (CNN) pruning algorithm and resource-efficient Field-programmable gate array (FPGA) CNN accelerator named “Argus”. The proposed CNN pruning algorithm first combines similar kernels into clusters, which are then pruned using the same regular pruning pattern. The pruning algorithm is carefully tailored for FPGAs, considering their resource characteristics. Regular sparsity results in high Multiply-accumulate (MAC) efficiency, reducing the amount of logic required to balance workloads among different MAC units. As a result, the Argus accelerator requires about 170 Look-up tables (LUTs) per Digital Signal Processor (DSP) block. This number is close to the average LUT/DPS ratio for various FPGA families, enabling balanced resource utilization when implementing Argus. Benchmarks conducted using Xilinx Zynq Ultrascale + Multi-Processor System-on-Chip (MPSoC) indicate that Argus is achieving up to 25 times higher frames per second than NullHop, 2 and 2.5 times higher than NEURAghe and Snowflake, respectively, and 2 times higher than NVDLA. Argus shows comparable performance to MIT’s Eyeriss v2 and Caffeine, requiring up to 3 times less memory bandwidth and utilizing 4 times fewer DSP blocks, respectively. Besides the absolute performance, Argus has at least 1.3 and 2 times better GOP/s/DSP and GOP/s/Block-RAM (BRAM) ratios, while being competitive in terms of GOP/s/LUT, compared to some of the state-of-the-art solutions.

2013 ◽  
Vol 760-762 ◽  
pp. 70-75
Author(s):  
Xiao Qing Luo ◽  
Rong Hu ◽  
Bing Hui Zheng

Fiber Bragg sensors become research focus of sensing technology, and have been widely used in many applications. This paper proposed a novel Fiber Bragg Grating sensor analyzer based on FPGA (Field Programmable Gate Array) and DSP (Digital Signal Processor) platform, which converted external parameter changes into wavelength shift in fiber Bragg gratings. The system can measure real-time temperature, strain, pressure, displacement and others through key steps including data acquisition, clutter Filtering, signal peak detection, Gaussian curve fitting and weighted wavelength calculation to carry out wavelength demodulation. Moreover, it is able to achieve fault diagnosis and positioning of the fiber link. Experimental results show that the system has advantages of low power consumption, good linearity, strong robustness, high precision and resolution on wavelength demodulation. And the system is still stable and reliable after a long test under different conditions.


2022 ◽  
Vol 15 (2) ◽  
pp. 1-29
Author(s):  
Paolo D'Alberto ◽  
Victor Wu ◽  
Aaron Ng ◽  
Rahul Nimaiyar ◽  
Elliott Delaye ◽  
...  

We present xDNN, an end-to-end system for deep-learning inference based on a family of specialized hardware processors synthesized on Field-Programmable Gate Array (FPGAs) and Convolution Neural Networks (CNN). We present a design optimized for low latency, high throughput, and high compute efficiency with no batching. The design is scalable and a parametric function of the number of multiply-accumulate units, on-chip memory hierarchy, and numerical precision. The design can produce a scale-down processor for embedded devices, replicated to produce more cores for larger devices, or resized to optimize efficiency. On Xilinx Virtex Ultrascale+ VU13P FPGA, we achieve 800 MHz that is close to the Digital Signal Processing maximum frequency and above 80% efficiency of on-chip compute resources. On top of our processor family, we present a runtime system enabling the execution of different networks for different input sizes (i.e., from 224× 224 to 2048× 1024). We present a compiler that reads CNNs from native frameworks (i.e., MXNet, Caffe, Keras, and Tensorflow), optimizes them, generates codes, and provides performance estimates. The compiler combines quantization information from the native environment and optimizations to feed the runtime with code as efficient as any hardware expert could write. We present tools partitioning a CNN into subgraphs for the division of work to CPU cores and FPGAs. Notice that the software will not change when or if the FPGA design becomes an ASIC, making our work vertical and not just a proof-of-concept FPGA project. We show experimental results for accuracy, latency, and power for several networks: In summary, we can achieve up to 4 times higher throughput, 3 times better power efficiency than the GPUs, and up to 20 times higher throughput than the latest CPUs. To our knowledge, we provide solutions faster than any previous FPGA-based solutions and comparable to any other top-of-the-shelves solutions.


Algorithms ◽  
2019 ◽  
Vol 12 (5) ◽  
pp. 112 ◽  
Author(s):  
Yulin Zhao ◽  
Donghui Wang ◽  
Leiou Wang

Convolutional neural networks (CNNs) have achieved great success in image processing. However, the heavy computational burden it imposes makes it difficult for use in embedded applications that have limited power consumption and performance. Although there are many fast convolution algorithms that can reduce the computational complexity, they increase the difficulty of practical implementation. To overcome these difficulties, this paper proposes several convolution accelerator designs using fast algorithms. The designs are based on the field programmable gate array (FPGA) and display a better balance between the digital signal processor (DSP) and the logic resource, while also requiring lower power consumption. The implementation results show that the power consumption of the accelerator design based on the Strassen–Winograd algorithm is 21.3% less than that of conventional accelerators.


2014 ◽  
Vol 556-562 ◽  
pp. 1741-1744
Author(s):  
Jun Deng ◽  
Hua Yong Tan ◽  
Lun Cai Liu ◽  
Lin Tao Liu

This paper presents a novel architecture for mixed-signal SoC, which integrates a Field Programmable Analog Array (FPAA) into a SoC based on 32-bit RISC CPU. The FPAA unit can be configured as Filter, Comparator, Gain Amplifier, and so on. The proposed mixed-signal SoC can transform the intermediate frequency (IF) analog signal to baseband digital signal and realize the real-time baseband signal processing, besides this, which can transmit the modulated IF signals which are converted from baseband signals by digital up-conversion (DUC). The proposed mixed-signal SoC is a transceiver on chip actually, due to the internal integrated IPs, such as ADC, DAC, DDC and DUC, which can provide smaller board area, lower power consumption and the system cost for the product development of transceiver. This design will have a good potential for wireless communication applications.


2016 ◽  
Vol 14 (2) ◽  
pp. 23
Author(s):  
Tibisay Sánchez ◽  
Alfredo David Redondo ◽  
Andrés Felipe García ◽  
Cristina Gómez ◽  
Leonardo Betancur ◽  
...  

En este artículo se presenta una revisión bibliográfica y un análisis comparativo de implementaciones en hardware de técnicas de Radiogoniometría, también conocidas como Radio Direction Finding (RDF), que permiten identificar la mejor opción para implementar estas funcionalidad en actividades de gestión del espectro en países en vía de desarrollo. Dentro de las implementaciones tratadas se incluyen técnicas clásicas como Pseudo-Doppler y técnicas avanzadas de alta resolución como MUSIC. Se presentan diferentes alternativas de hardware para realizar las implementaciones las cuales incluyen SDR (Software Defined Radio), FPGA (Field Programmable Gate Array) y DSP (Digital Signal Processor); a la vez que se incluyen algunas configuraciones híbridas dónde se mezcla el software y el hardware con el fin de optimizar recursos de tiempo y dinero. Adicionalmente se muestran algunas aplicaciones comerciales que emplean técnicas de geolocalización basadas en información de ángulos de llegada, tiempos de llegada u otros parámetros que permiten realizar el proceso de triangulación o trilateración según sea el caso.


Author(s):  
Saber Krim ◽  
Mohamed Faouzi Mimouni

The conventional direct torque control (DTC) of induction motors has become the most used control strategy. This control method is known by its simplicity, fast torque response, and its lack of dependence on machine parameters. Despite the cited advantages, the conventional DTC suffers from several limitations, like the torque ripples. This chapter aims to improve the conventional DTC performances by keeping its advantages. These ripples depend on the hysteresis bandwidth of the torque and the sampling frequency. The conventional DTC limitations can be prevented by increasing the sampling frequency. Nevertheless, the operation with higher sampling frequency is not possible with the software solutions, like the digital signal processor (DSP), due to the serial processing of the implemented algorithm. To overcome the DSP limitations, the field programmable gate array (FPGA) can be chosen as an alternative solution to implement the DTC algorithm with shorter execution time. In this chapter, the FPGA is chosen thanks to its parallel processing.


Sign in / Sign up

Export Citation Format

Share Document