xDNN: Inference for Deep Convolutional Neural Networks

Paolo D'Alberto; Victor Wu; Aaron Ng; Rahul Nimaiyar; Elliott Delaye; Ashish Sirasao

doi:10.1145/3473334

xDNN: Inference for Deep Convolutional Neural Networks

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3473334 ◽

2022 ◽

Vol 15 (2) ◽

pp. 1-29

Author(s):

Paolo D'Alberto ◽

Victor Wu ◽

Aaron Ng ◽

Rahul Nimaiyar ◽

Elliott Delaye ◽

...

We present xDNN, an end-to-end system for deep-learning inference based on a family of specialized hardware processors synthesized on Field-Programmable Gate Array (FPGAs) and Convolution Neural Networks (CNN). We present a design optimized for low latency, high throughput, and high compute efficiency with no batching. The design is scalable and a parametric function of the number of multiply-accumulate units, on-chip memory hierarchy, and numerical precision. The design can produce a scale-down processor for embedded devices, replicated to produce more cores for larger devices, or resized to optimize efficiency. On Xilinx Virtex Ultrascale+ VU13P FPGA, we achieve 800 MHz that is close to the Digital Signal Processing maximum frequency and above 80% efficiency of on-chip compute resources. On top of our processor family, we present a runtime system enabling the execution of different networks for different input sizes (i.e., from 224× 224 to 2048× 1024). We present a compiler that reads CNNs from native frameworks (i.e., MXNet, Caffe, Keras, and Tensorflow), optimizes them, generates codes, and provides performance estimates. The compiler combines quantization information from the native environment and optimizations to feed the runtime with code as efficient as any hardware expert could write. We present tools partitioning a CNN into subgraphs for the division of work to CPU cores and FPGAs. Notice that the software will not change when or if the FPGA design becomes an ASIC, making our work vertical and not just a proof-of-concept FPGA project. We show experimental results for accuracy, latency, and power for several networks: In summary, we can achieve up to 4 times higher throughput, 3 times better power efficiency than the GPUs, and up to 20 times higher throughput than the latest CPUs. To our knowledge, we provide solutions faster than any previous FPGA-based solutions and comparable to any other top-of-the-shelves solutions.

Download Full-text

Argus CNN Accelerator Based on Kernel Clustering and Resource-Aware Pruning

Elektronika ir Elektrotechnika ◽

10.5755/j02.eie.28922 ◽

2021 ◽

Vol 27 (3) ◽

pp. 57-70

Author(s):

Damjan M. Rakanovic ◽

Vuk Vranjkovic ◽

Rastislav J. R. Struharik

Keyword(s):

Digital Signal Processor ◽

State Of The Art ◽

Digital Signal ◽

Pruning Algorithm ◽

Kernel Clustering ◽

Field Programmable ◽

Comparable Performance ◽

On Chip ◽

Resource Characteristics ◽

Resource Aware

Paper proposes a two-step Convolutional Neural Network (CNN) pruning algorithm and resource-efficient Field-programmable gate array (FPGA) CNN accelerator named “Argus”. The proposed CNN pruning algorithm first combines similar kernels into clusters, which are then pruned using the same regular pruning pattern. The pruning algorithm is carefully tailored for FPGAs, considering their resource characteristics. Regular sparsity results in high Multiply-accumulate (MAC) efficiency, reducing the amount of logic required to balance workloads among different MAC units. As a result, the Argus accelerator requires about 170 Look-up tables (LUTs) per Digital Signal Processor (DSP) block. This number is close to the average LUT/DPS ratio for various FPGA families, enabling balanced resource utilization when implementing Argus. Benchmarks conducted using Xilinx Zynq Ultrascale + Multi-Processor System-on-Chip (MPSoC) indicate that Argus is achieving up to 25 times higher frames per second than NullHop, 2 and 2.5 times higher than NEURAghe and Snowflake, respectively, and 2 times higher than NVDLA. Argus shows comparable performance to MIT’s Eyeriss v2 and Caffeine, requiring up to 3 times less memory bandwidth and utilizing 4 times fewer DSP blocks, respectively. Besides the absolute performance, Argus has at least 1.3 and 2 times better GOP/s/DSP and GOP/s/Block-RAM (BRAM) ratios, while being competitive in terms of GOP/s/LUT, compared to some of the state-of-the-art solutions.

Download Full-text

Fifty years of Electronic Hardware Implementations of First and Higher Order Neural Networks

Artificial Higher Order Neural Networks for Computer Science and Engineering ◽

10.4018/978-1-61520-711-4.ch012 ◽

2010 ◽

pp. 269-285 ◽

Cited By ~ 3

Author(s):

David R. Selviah ◽

Janti Shawash

Keyword(s):

Neural Networks ◽

Real Time ◽

High Speed ◽

Higher Order ◽

Low Latency ◽

Real Time Control ◽

Practical Applications ◽

Field Programmable ◽

On Chip ◽

Electronic Hardware

This chapter celebrates 50 years of first and higher order neural network (HONN) implementations in terms of the physical layout and structure of electronic hardware, which offers high speed, low latency, compact, low cost, low power, mass produced systems. Low latency is essential for practical applications in real time control for which software implementations running on CPUs are too slow. The literature review chapter traces the chronological development of electronic neural networks (ENN) discussing selected papers in detail from analog electronic hardware, through probabilistic RAM, generalizing RAM, custom silicon Very Large Scale Integrated (VLSI) circuit, Neuromorphic chips, pulse stream interconnected neurons to Application Specific Integrated circuits (ASICs) and Zero Instruction Set Chips (ZISCs). Reconfigurable Field Programmable Gate Arrays (FPGAs) are given particular attention as the most recent generation incorporate Digital Signal Processing (DSP) units to provide full System on Chip (SoC) capability offering the possibility of real-time, on-line and on-chip learning.

Download Full-text

ANALYSIS OF EFFECTS OF USING 9/7 WAVELET COEFFICIENTS IN MULTI-RESOLUTION ANALYSIS

SMART MOVES JOURNAL IJOSCIENCE ◽

10.24113/ijoscience.v2i1.68 ◽

2016 ◽

Vol 2 (1) ◽

Author(s):

Manish Sharma ◽

Prof. Sonu Lal

Keyword(s):

High Speed ◽

Utilization Efficiency ◽

Parallel Structure ◽

Discrete Wavelet ◽

Distributed Arithmetic ◽

Fpga Design ◽

Multi Resolution Analysis ◽

Field Programmable ◽

On Chip ◽

Area Efficient

Conventional distributed arithmetic (DA) is popular in field programmable gate array (FPGA) design, and it features on-chip ROM to achieve high speed and regularity. In this paper, we describe high speed area efficient 1-D discrete wavelet transform (DWT) using 9/7 filter based new efficient distributed arithmetic (NEDA) Technique. Being area efficient architecture free of ROM, multiplication, and subtraction, NEDA can also expose the redundancy existing in the adder array consisting of entries of 0 and 1. This architecture supports any size of image pixel value and any level of decomposition. The parallel structure has 100% hardware utilization efficiency.

Download Full-text

Deploying Multi-tenant FPGAs within Linux-based Cloud Infrastructure

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3474058 ◽

2022 ◽

Vol 15 (2) ◽

pp. 1-31

Author(s):

Joel Mandebi Mbongue ◽

Danielle Tchuinkou Kwadjo ◽

Alex Shuping ◽

Christophe Bobda

Keyword(s):

Software Architecture ◽

Hardware Acceleration ◽

Maximum Frequency ◽

Cloud Infrastructure ◽

Fpga Design ◽

Data Movement ◽

Field Programmable ◽

Minimal Data ◽

On Chip ◽

Cloud Users

Cloud deployments now increasingly exploit Field-Programmable Gate Array (FPGA) accelerators as part of virtual instances. While cloud FPGAs are still essentially single-tenant, the growing demand for efficient hardware acceleration paves the way to FPGA multi-tenancy. It then becomes necessary to explore architectures, design flows, and resource management features that aim at exposing multi-tenant FPGAs to the cloud users. In this article, we discuss a hardware/software architecture that supports provisioning space-shared FPGAs in Kernel-based Virtual Machine (KVM) clouds. The proposed hardware/software architecture introduces an FPGA organization that improves hardware consolidation and support hardware elasticity with minimal data movement overhead. It also relies on VirtIO to decrease communication latency between hardware and software domains. Prototyping the proposed architecture with a Virtex UltraScale+ FPGA demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization, which is one of the goals of virtualization. Overall, our FPGA design achieved about 2× higher maximum frequency than the state of the art and a bandwidth reaching up to 28 Gbps on 32-bit data width.

Download Full-text

Pre-Configured Deep Convolutional Neural Networks with Various Time-Frequency Representations for Biometrics from ECG Signals

Applied Sciences ◽

10.3390/app9224810 ◽

2019 ◽

Vol 9 (22) ◽

pp. 4810 ◽

Cited By ~ 3

Author(s):

Yeong-Hyeon Byeon ◽

Keun-Chang Kwak

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Digital Signal ◽

Electrical Signal ◽

The Body ◽

Deep Convolutional Neural Networks ◽

Time Frequency ◽

Ecg Signals ◽

Security Applications ◽

Frequency Transformations

We evaluated electrocardiogram (ECG) biometrics using pre-configured models of convolutional neural networks (CNNs) with various time-frequency representations. Biometrics technology records a person’s physical or behavioral characteristics in a digital signal via a sensor and analyzes it to identify the person. An ECG signal is obtained by detecting and amplifying a minute electrical signal flowing on the skin using a noninvasive electrode when the heart muscle depolarizes at each heartbeat. In biometrics, the ECG is especially advantageous in security applications because the heart is located within the body and moves while the subject is alive. However, a few body states generate noisy biometrics. The analysis of signals in the frequency domain has a robust effect on the noise. As the ECG is noise-sensitive, various studies have applied time-frequency transformations that are robust to noise, with CNNs achieving a good performance in image classification. Studies have applied time-frequency representations of the 1D ECG signals to 2D CNNs using transforms like MFCC (mel frequency cepstrum coefficient), spectrogram, log spectrogram, mel spectrogram, and scalogram. CNNs have various pre-configured models such as VGGNet, GoogLeNet, ResNet, and DenseNet. Combinations of the time-frequency representations and pre-configured CNN models have not been investigated. In this study, we employed the PTB (Physikalisch-Technische Bundesanstalt)-ECG and CU (Chosun University)-ECG databases. The MFCC accuracies were 0.45%, 2.60%, 3.90%, and 0.25% higher than the spectrogram, log spectrogram, mel spectrogram, and scalogram accuracies, respectively. The Xception accuracies were 3.91%, 0.84%, and 1.14% higher than the VGGNet-19, ResNet-101, and DenseNet-201 accuracies, respectively.

Download Full-text

Research of a Mixed-Signal Programmable SoC Based on FPAA

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.556-562.1741 ◽

2014 ◽

Vol 556-562 ◽

pp. 1741-1744

Author(s):

Jun Deng ◽

Hua Yong Tan ◽

Lun Cai Liu ◽

Lin Tao Liu

Keyword(s):

Digital Signal ◽

Intermediate Frequency ◽

Mixed Signal ◽

Programmable Analog ◽

Baseband Signal ◽

Field Programmable ◽

Baseband Signal Processing ◽

Good Potential ◽

On Chip ◽

Field Programmable Analog Array

This paper presents a novel architecture for mixed-signal SoC, which integrates a Field Programmable Analog Array (FPAA) into a SoC based on 32-bit RISC CPU. The FPAA unit can be configured as Filter, Comparator, Gain Amplifier, and so on. The proposed mixed-signal SoC can transform the intermediate frequency (IF) analog signal to baseband digital signal and realize the real-time baseband signal processing, besides this, which can transmit the modulated IF signals which are converted from baseband signals by digital up-conversion (DUC). The proposed mixed-signal SoC is a transceiver on chip actually, due to the internal integrated IPs, such as ADC, DAC, DDC and DUC, which can provide smaller board area, lower power consumption and the system cost for the product development of transceiver. This design will have a good potential for wireless communication applications.

Download Full-text

Reconfigurable co-processor architecture with limited numerical precision to accelerate deep convolutional neural networks

2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP) ◽

10.1109/asap.2018.8445087 ◽

2018 ◽

Author(s):

Sasindu Wijeratne ◽

Sandaruwan Jayaweera ◽

Mahesh Dananjaya ◽

Ajith Pasqual

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Processor Architecture ◽

Deep Convolutional Neural Networks ◽

Numerical Precision

Download Full-text

Fast Convolutional Neural Networks in Low Density FPGAs Using Zero-Skipping and Weight Pruning

Electronics ◽

10.3390/electronics8111321 ◽

2019 ◽

Vol 8 (11) ◽

pp. 1321 ◽

Cited By ~ 3

Author(s):

Mário P. Véstias ◽

Rui Policarpo Duarte ◽

José T. de Sousa ◽

Horácio C. Neto

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Convolutional Neural Networks ◽

Low Density ◽

Embedded Computing ◽

System Architectures ◽

Field Programmable ◽

New Processing ◽

On Chip ◽

Weight Pruning

Edge devices are becoming smarter with the integration of machine learning methods, such as deep learning, and are therefore used in many application domains where decisions have to be made without human intervention. Deep learning and, in particular, convolutional neural networks (CNN) are more efficient than previous algorithms for several computer vision applications such as security and surveillance, where image and video analysis are required. This better efficiency comes with a cost of high computation and memory requirements. Hence, running CNNs in embedded computing devices is a challenge for both algorithm and hardware designers. New processing devices, dedicated system architectures and optimization of the networks have been researched to deal with these computation requirements. In this paper, we improve the inference execution times of CNNs in low density FPGAs (Field-Programmable Gate Arrays) using fixed-point arithmetic, zero-skipping and weight pruning. The developed architecture supports the execution of large CNNs in FPGA devices with reduced on-chip memory and computing resources. With the proposed architecture, it is possible to infer an image in AlexNet in 2.9 ms in a ZYNQ7020 and 1.0 ms in a ZYNQ7045 with less than 1% accuracy degradation. These results improve previous state-of-the-art architectures for CNN inference.

Download Full-text

FPGA Implementation of a Pyramidal Weightless Neural Networks Learning System

International Journal of Neural Systems ◽

10.1142/s012906570300156x ◽

2003 ◽

Vol 13 (04) ◽

pp. 225-237 ◽

Cited By ~ 5

Author(s):

Raida Al-Alawi

Keyword(s):

Neural Networks ◽

Learning System ◽

Probabilistic Logic ◽

Training Algorithm ◽

Probabilistic Search ◽

Fast Convergence Rate ◽

Field Programmable ◽

And Performance ◽

On Chip ◽

Search Interval

A hardware architecture of a Probabilistic Logic Neuron (PLN) is presented. The suggested model facilitates the on-chip learning of pyramidal Weightless Neural Networks using a modified probabilistic search reward/penalty training algorithm. The penalization strategy of the training algorithm depends on a predefined parameter called the probabilistic search interval. A complete Weightless Neural Network (WNN) learning system is modeled and implemented on Xilinx XC4005E Field Programmable Gate Array (FPGA), allowing its architecture to be configurable. Various experiments have been conducted to examine the feasibility and performance of the WNN learning system. Results show that the system has a fast convergence rate and good generalization ability.

Download Full-text

Improving Performance-Power-Programmability in Space Avionics with Edge Devices: VBN on Myriad2 SoC

ACM Transactions on Embedded Computing Systems ◽

10.1145/3440885 ◽

2021 ◽

Vol 20 (3) ◽

pp. 1-23

Author(s):

Vasileios Leon ◽

George Lentaris ◽

Evangelos Petrongonas ◽

Dimitrios Soudris ◽

Gianluca Furano ◽

...

Keyword(s):

Digital Signal ◽

System On Chip ◽

Core System ◽

Proximity Operations ◽

Space Instruments ◽

Field Programmable ◽

Commercial Off The Shelf ◽

And Performance ◽

On Chip ◽

Key Enabling Technologies

The advent of powerful edge devices and AI algorithms has already revolutionized many terrestrial applications; however, for both technical and historical reasons, the space industry is still striving to adopt these key enabling technologies in new mission concepts. In this context, the current work evaluates an heterogeneous multi-core system-on-chip processor for use on-board future spacecraft to support novel, computationally demanding digital signal processors and AI functionalities. Given the importance of low power consumption in satellites, we consider the Intel Movidius Myriad2 system-on-chip and focus on SW development and performance aspects. We design a methodology and framework to accommodate efficient partitioning, mapping, parallelization, code optimization, and tuning of complex algorithms. Furthermore, we propose an avionics architecture combining this commercial off-the-shelf chip with a field programmable gate array device to facilitate, among others, interfacing with traditional space instruments via SpaceWire transcoding. We prototype our architecture in the lab targeting vision-based navigation tasks. We implement a representative computer vision pipeline to track the 6D pose of ENVISAT using megapixel images during hypothetical spacecraft proximity operations. Overall, we achieve 2.6 to 4.9 FPS with only 0.8 to 1.1 W on Myriad2 , i.e., 10-fold acceleration versus modern rad-hard processors. Based on the results, we assess various benefits of utilizing Myriad2 instead of conventional field programmable gate arrays and CPUs.

Download Full-text