Platform Generation for Edge AI Devices with Custom Hardware Accelerators

This paper presents a platform that automatically generates custom hardware accelerators for convolutional neural networks (CNNs) implemented in field-programmable gate array (FPGA) devices. It includes a user interface for configuring and managing these accelerators. The herein-presented platform can perform all the processes necessary to design and test CNN accelerators from the CNN architecture description at both layer and internal parameter levels, training the desired architecture with any dataset and generating the configuration files required by the platform. With these files, it can synthesize the register-transfer level (RTL) and program the customized CNN accelerator into the FPGA device for testing, making it possible to generate custom CNN accelerators quickly and easily. All processes save the CNN architecture description are fully automatized and carried out by the platform, which manages third-party software to train the CNN and synthesize and program the generated RTL. The platform has been tested with the implementation of some of the CNN architectures found in the state-of-the-art for freely available datasets such as MNIST, CIFAR-10, and STL-10.

Download Full-text

Microthreading as a Novel Method for Close Coupling of Custom Hardware Accelerators to SVP Processors

2011 14th Euromicro Conference on Digital System Design ◽

10.1109/dsd.2011.73 ◽

2011 ◽

Author(s):

Jaroslav Sykora ◽

Leos Kafka ◽

Martin Danek ◽

Lukas Kohout

Keyword(s):

Hardware Accelerators ◽

Close Coupling ◽

Custom Hardware ◽

Novel Method

Download Full-text

Cycle-time aware architecture synthesis of custom hardware accelerators

Proceedings of the international conference on Compilers, architecture, and synthesis for embedded systems - CASES '02 ◽

10.1145/581630.581637 ◽

2002 ◽

Cited By ~ 8

Author(s):

Mukund Sivaraman ◽

Shail Aditya

Keyword(s):

Cycle Time ◽

Hardware Accelerators ◽

Custom Hardware ◽

Time Aware

Download Full-text

Bitwidth cognizant architecture synthesis of custom hardware accelerators

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems ◽

10.1109/43.959864 ◽

2001 ◽

Vol 20 (11) ◽

pp. 1355-1371 ◽

Cited By ~ 39

Author(s):

S. Mahlke ◽

R. Ravindran ◽

M. Schlansker ◽

R. Schreiber ◽

T. Sherwood

Keyword(s):

Hardware Accelerators ◽

Custom Hardware

Download Full-text

Numerical behavior of NVIDIA tensor cores

PeerJ Computer Science ◽

10.7717/peerj-cs.330 ◽

2021 ◽

Vol 7 ◽

pp. e330

Author(s):

Massimiliano Fasi ◽

Nicholas J. Higham ◽

Mantas Mikaitis ◽

Srikara Pranesh

Keyword(s):

Matrix Multiplication ◽

Floating Point ◽

Partial Sums ◽

Hardware Accelerators ◽

Floating Point Arithmetic ◽

Mixed Precision ◽

The Matrix ◽

Custom Hardware ◽

Intermediate Results ◽

Point Arithmetic

We explore the floating-point arithmetic implemented in the NVIDIA tensor cores, which are hardware accelerators for mixed-precision matrix multiplication available on the Volta, Turing, and Ampere microarchitectures. Using Volta V100, Turing T4, and Ampere A100 graphics cards, we determine what precision is used for the intermediate results, whether subnormal numbers are supported, what rounding mode is used, in which order the operations underlying the matrix multiplication are performed, and whether partial sums are normalized. These aspects are not documented by NVIDIA, and we gain insight by running carefully designed numerical experiments on these hardware units. Knowing the answers to these questions is important if one wishes to: (1) accurately simulate NVIDIA tensor cores on conventional hardware; (2) understand the differences between results produced by code that utilizes tensor cores and code that uses only IEEE 754-compliant arithmetic operations; and (3) build custom hardware whose behavior matches that of NVIDIA tensor cores. As part of this work we provide a test suite that can be easily adapted to test newer versions of the NVIDIA tensor cores as well as similar accelerators from other vendors, as they become available. Moreover, we identify a non-monotonicity issue affecting floating point multi-operand adders if the intermediate results are not normalized after each step.

Download Full-text

A Novel, Simulator for Heterogeneous Cloud Systems that Incorporate Custom Hardware Accelerators

IEEE Transactions on Multi-Scale Computing Systems ◽

10.1109/tmscs.2018.2879601 ◽

2018 ◽

Vol 4 (4) ◽

pp. 565-576 ◽

Cited By ~ 1

Author(s):

Nikolaos Tampouratzis ◽

Ioannis Papaefstathiou

Keyword(s):

Hardware Accelerators ◽

Cloud Systems ◽

Custom Hardware ◽

Heterogeneous Cloud

Download Full-text

Hardware accelerators for CAD

Computer-Aided Engineering Journal ◽

10.1049/cae.1989.0020 ◽

1989 ◽

Vol 6 (3) ◽

pp. 77

Author(s):

A.P. Ambler

Keyword(s):

Hardware Accelerators

Download Full-text

Leveraging Edge Intelligence for Video Analytics in Smart City Applications

Information ◽

10.3390/info12010014 ◽

2020 ◽

Vol 12 (1) ◽

pp. 14

Author(s):

Aluizio Rocha Neto ◽

Thiago P. Silva ◽

Thais Batista ◽

Flávia C. Delicato ◽

Paulo F. Pires ◽

...

Keyword(s):

Distributed System ◽

Smart City ◽

Large Scale ◽

Facial Recognition ◽

Public Spaces ◽

Hardware Accelerators ◽

Video Streams ◽

Video Analytics ◽

Learning Tasks ◽

Naive Approach

In smart city scenarios, the huge proliferation of monitoring cameras scattered in public spaces has posed many challenges to network and processing infrastructure. A few dozen cameras are enough to saturate the city’s backbone. In addition, most smart city applications require a real-time response from the system in charge of processing such large-scale video streams. Finding a missing person using facial recognition technology is one of these applications that require immediate action on the place where that person is. In this paper, we tackle these challenges presenting a distributed system for video analytics designed to leverage edge computing capabilities. Our approach encompasses architecture, methods, and algorithms for: (i) dividing the burdensome processing of large-scale video streams into various machine learning tasks; and (ii) deploying these tasks as a workflow of data processing in edge devices equipped with hardware accelerators for neural networks. We also propose the reuse of nodes running tasks shared by multiple applications, e.g., facial recognition, thus improving the system’s processing throughput. Simulations showed that, with our algorithm to distribute the workload, the time to process a workflow is about 33% faster than a naive approach.

Download Full-text

Exploring Fault-Energy Trade-offs in Approximate DNN Hardware Accelerators

2021 22nd International Symposium on Quality Electronic Design (ISQED) ◽

10.1109/isqed51717.2021.9424345 ◽

2021 ◽

Author(s):

Ayesha Siddique ◽

Kanad Basu ◽

Khaza Anuarul Hoque

Keyword(s):

Hardware Accelerators ◽

Trade Offs ◽

Energy Trade

Download Full-text

Development of Non Expensive Technologies for Precise Maneuvering of Completely Autonomous Unmanned Aerial Vehicles

Sensors ◽

10.3390/s21020391 ◽

2021 ◽

Vol 21 (2) ◽

pp. 391

Author(s):

Luca Bigazzi ◽

Stefano Gherardini ◽

Giacomo Innocenti ◽

Michele Basso

Keyword(s):

Unmanned Aerial Vehicles ◽

Degrees Of Freedom ◽

Vision System ◽

Video Stream ◽

Measurement Unit ◽

Absolute Position ◽

Light Load ◽

Aerial Vehicles ◽

Angular Velocities ◽

Custom Hardware

In this paper, solutions for precise maneuvering of an autonomous small (e.g., 350-class) Unmanned Aerial Vehicles (UAVs) are designed and implemented from smart modifications of non expensive mass market technologies. The considered class of vehicles suffers from light load, and, therefore, only a limited amount of sensors and computing devices can be installed on-board. Then, to make the prototype capable of moving autonomously along a fixed trajectory, a “cyber-pilot”, able on demand to replace the human operator, has been implemented on an embedded control board. This cyber-pilot overrides the commands thanks to a custom hardware signal mixer. The drone is able to localize itself in the environment without ground assistance by using a camera possibly mounted on a 3 Degrees Of Freedom (DOF) gimbal suspension. A computer vision system elaborates the video stream pointing out land markers with known absolute position and orientation. This information is fused with accelerations from a 6-DOF Inertial Measurement Unit (IMU) to generate a “virtual sensor” which provides refined estimates of the pose, the absolute position, the speed and the angular velocities of the drone. Due to the importance of this sensor, several fusion strategies have been investigated. The resulting data are, finally, fed to a control algorithm featuring a number of uncoupled digital PID controllers which work to bring to zero the displacement from the desired trajectory.

Download Full-text