Deploying Multi-tenant FPGAs within Linux-based Cloud Infrastructure

2022 ◽  
Vol 15 (2) ◽  
pp. 1-31
Author(s):  
Joel Mandebi Mbongue ◽  
Danielle Tchuinkou Kwadjo ◽  
Alex Shuping ◽  
Christophe Bobda

Cloud deployments now increasingly exploit Field-Programmable Gate Array (FPGA) accelerators as part of virtual instances. While cloud FPGAs are still essentially single-tenant, the growing demand for efficient hardware acceleration paves the way to FPGA multi-tenancy. It then becomes necessary to explore architectures, design flows, and resource management features that aim at exposing multi-tenant FPGAs to the cloud users. In this article, we discuss a hardware/software architecture that supports provisioning space-shared FPGAs in Kernel-based Virtual Machine (KVM) clouds. The proposed hardware/software architecture introduces an FPGA organization that improves hardware consolidation and support hardware elasticity with minimal data movement overhead. It also relies on VirtIO to decrease communication latency between hardware and software domains. Prototyping the proposed architecture with a Virtex UltraScale+ FPGA demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization, which is one of the goals of virtualization. Overall, our FPGA design achieved about 2× higher maximum frequency than the state of the art and a bandwidth reaching up to 28 Gbps on 32-bit data width.

Electronics ◽  
2021 ◽  
Vol 10 (18) ◽  
pp. 2272
Author(s):  
Safa Bouguezzi ◽  
Hana Ben Fredj ◽  
Tarek Belabed ◽  
Carlos Valderrama ◽  
Hassene Faiedh ◽  
...  

Convolutional Neural Networks (CNN) continue to dominate research in the area of hardware acceleration using Field Programmable Gate Arrays (FPGA), proving its effectiveness in a variety of computer vision applications such as object segmentation, image classification, face detection, and traffic signs recognition, among others. However, there are numerous constraints for deploying CNNs on FPGA, including limited on-chip memory, CNN size, and configuration parameters. This paper introduces Ad-MobileNet, an advanced CNN model inspired by the baseline MobileNet model. The proposed model uses an Ad-depth engine, which is an improved version of the depth-wise separable convolution unit. Moreover, we propose an FPGA-based implementation model that supports the Mish, TanhExp, and ReLU activation functions. The experimental results using the CIFAR-10 dataset show that our Ad-MobileNet has a classification accuracy of 88.76% while requiring little computational hardware resources. Compared to state-of-the-art methods, our proposed method has a fairly high recognition rate while using fewer computational hardware resources. Indeed, the proposed model helps to reduce hardware resources by more than 41% compared to that of the baseline model.


2016 ◽  
Vol 2 (1) ◽  
Author(s):  
Manish Sharma ◽  
Prof. Sonu Lal

Conventional distributed arithmetic (DA) is popular in field programmable gate array (FPGA) design, and it features on-chip ROM to achieve high speed and regularity. In this paper, we describe high speed area efficient 1-D discrete wavelet transform (DWT) using 9/7 filter based new efficient distributed arithmetic (NEDA) Technique. Being area efficient architecture free of ROM, multiplication, and subtraction, NEDA can also expose the redundancy existing in the adder array consisting of entries of 0 and 1. This architecture supports any size of image pixel value and any level of decomposition. The parallel structure has 100% hardware utilization efficiency.


2022 ◽  
Vol 15 (2) ◽  
pp. 1-29
Author(s):  
Paolo D'Alberto ◽  
Victor Wu ◽  
Aaron Ng ◽  
Rahul Nimaiyar ◽  
Elliott Delaye ◽  
...  

We present xDNN, an end-to-end system for deep-learning inference based on a family of specialized hardware processors synthesized on Field-Programmable Gate Array (FPGAs) and Convolution Neural Networks (CNN). We present a design optimized for low latency, high throughput, and high compute efficiency with no batching. The design is scalable and a parametric function of the number of multiply-accumulate units, on-chip memory hierarchy, and numerical precision. The design can produce a scale-down processor for embedded devices, replicated to produce more cores for larger devices, or resized to optimize efficiency. On Xilinx Virtex Ultrascale+ VU13P FPGA, we achieve 800 MHz that is close to the Digital Signal Processing maximum frequency and above 80% efficiency of on-chip compute resources. On top of our processor family, we present a runtime system enabling the execution of different networks for different input sizes (i.e., from 224× 224 to 2048× 1024). We present a compiler that reads CNNs from native frameworks (i.e., MXNet, Caffe, Keras, and Tensorflow), optimizes them, generates codes, and provides performance estimates. The compiler combines quantization information from the native environment and optimizations to feed the runtime with code as efficient as any hardware expert could write. We present tools partitioning a CNN into subgraphs for the division of work to CPU cores and FPGAs. Notice that the software will not change when or if the FPGA design becomes an ASIC, making our work vertical and not just a proof-of-concept FPGA project. We show experimental results for accuracy, latency, and power for several networks: In summary, we can achieve up to 4 times higher throughput, 3 times better power efficiency than the GPUs, and up to 20 times higher throughput than the latest CPUs. To our knowledge, we provide solutions faster than any previous FPGA-based solutions and comparable to any other top-of-the-shelves solutions.


Sensors ◽  
2021 ◽  
Vol 21 (8) ◽  
pp. 2637
Author(s):  
Ignacio Pérez ◽  
Miguel Figueroa

Convolutional neural networks (CNN) have been extensively employed for image classification due to their high accuracy. However, inference is a computationally-intensive process that often requires hardware acceleration to operate in real time. For mobile devices, the power consumption of graphics processors (GPUs) is frequently prohibitive, and field-programmable gate arrays (FPGA) become a solution to perform inference at high speed. Although previous works have implemented CNN inference on FPGAs, their high utilization of on-chip memory and arithmetic resources complicate their application on resource-constrained edge devices. In this paper, we present a scalable, low power, low resource-utilization accelerator architecture for inference on the MobileNet V2 CNN. The architecture uses a heterogeneous system with an embedded processor as the main controller, external memory to store network data, and dedicated hardware implemented on reconfigurable logic with a scalable number of processing elements (PE). Implemented on a XCZU7EV FPGA running at 200 MHz and using four PEs, the accelerator infers with 87% top-5 accuracy and processes an image of 224×224 pixels in 220 ms. It consumes 7.35 W of power and uses less than 30% of the logic and arithmetic resources used by other MobileNet FPGA accelerators.


2008 ◽  
Author(s):  
Michael Wirthlin ◽  
Brent Nelson ◽  
Brad Hutchings ◽  
Peter Athanas ◽  
Shawn Bohner

Author(s):  
Daqi Lin ◽  
Elena Vasiou ◽  
Cem Yuksel ◽  
Daniel Kopta ◽  
Erik Brunvand

Bounding volume hierarchies (BVH) are the most widely used acceleration structures for ray tracing due to their high construction and traversal performance. However, the bounding planes shared between parent and children bounding boxes is an inherent storage redundancy that limits further improvement in performance due to the memory cost of reading these redundant planes. Dual-split trees can create identical space partitioning as BVHs, but in a compact form using less memory by eliminating the redundancies of the BVH structure representation. This reduction in memory storage and data movement translates to faster ray traversal and better energy efficiency. Yet, the performance benefits of dual-split trees are undermined by the processing required to extract the necessary information from their compact representation. This involves bit manipulations and branching instructions which are inefficient in software. We introduce hardware acceleration for dual-split trees and show that the performance advantages over BVHs are emphasized in a hardware ray tracing context that can take advantage of such acceleration. We provide details on how the operations needed for decoding dual-split tree nodes can be implemented in hardware and present experiments in a number of scenes with different sizes using path tracing. In our experiments, we have observed up to 31% reduction in render time and 38% energy saving using dual-split trees as compared to binary BVHs representing identical space partitioning.


2016 ◽  
Vol 78 (7-5) ◽  
Author(s):  
Muhammad Amin Hashim ◽  
Yuan Wen Hau ◽  
Rabia Baktheri

This paper studies two different Electrocardiography (ECG) preprocessing algorithms, namely Pan and Tompkins (PT) and Derivative Based (DB) algorithm, which is crucial of QRS complex detection in cardiovascular disease detection. Both algorithms are compared in terms of QRS detection accuracy and computation timing performance, with implementation on System-on-Chip (SoC) based embedded system that prototype on Altera DE2-115 Field Programmable Gate Array (FPGA) platform as embedded software. Both algorithms are tested with 30 minutes ECG data from each of 48 different patient records obtain from MIT-BIH arrhythmia database. Results show that PT algorithm achieve 98.15% accuracy with 56.33 seconds computation while DB algorithm achieve 96.74% with only 22.14 seconds processing time. Based on the study, an optimized PT algorithm with improvement on Moving Windows Integrator (MWI) has been proposed to accelerate its computation. Result shows that the proposed optimized Moving Windows Integrator algorithm achieves 9.5 times speed up than original MWI while retaining its QRS detection accuracy. 


Sign in / Sign up

Export Citation Format

Share Document