ACM Transactions on Reconfigurable Technology and Systems

Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3470567 ◽

2022 ◽

Vol 15 (2) ◽

pp. 1-34

Author(s):

Tobias Alonso ◽

Lucian Petrica ◽

Mario Ruiz ◽

Jakoba Petri-Koenig ◽

Yaman Umuroglu ◽

...

Keyword(s):

Neural Network ◽

Field Programmable Gate Array ◽

Resource Partitioning ◽

Deep Neural Network ◽

Peer To Peer ◽

Performance Difference ◽

Performance Portability ◽

Field Programmable ◽

Runtime Infrastructure ◽

Roll Out

Customized compute acceleration in the datacenter is key to the wider roll-out of applications based on deep neural network (DNN) inference. In this article, we investigate how to maximize the performance and scalability of field-programmable gate array (FPGA)-based pipeline dataflow DNN inference accelerators (DFAs) automatically on computing infrastructures consisting of multi-die, network-connected FPGAs. We present Elastic-DF, a novel resource partitioning tool and associated FPGA runtime infrastructure that integrates with the DNN compiler FINN. Elastic-DF allocates FPGA resources to DNN layers and layers to individual FPGA dies to maximize the total performance of the multi-FPGA system. In the resulting Elastic-DF mapping, the accelerator may be instantiated multiple times, and each instance may be segmented across multiple FPGAs transparently, whereby the segments communicate peer-to-peer through 100 Gbps Ethernet FPGA infrastructure, without host involvement. When applied to ResNet-50, Elastic-DF provides a 44% latency decrease on Alveo U280. For MobileNetV1 on Alveo U200 and U280, Elastic-DF enables a 78% throughput increase, eliminating the performance difference between these cards and the larger Alveo U250. Elastic-DF also increases operating frequency in all our experiments, on average by over 20%. Elastic-DF therefore increases performance portability between different sizes of FPGA and increases the critical throughput per cost metric of datacenter inference.

Get full-text (via PubEx)

The Impact of Terrestrial Radiation on FPGAs in Data Centers

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3457198 ◽

2022 ◽

Vol 15 (2) ◽

pp. 1-21

Author(s):

Andrew M. Keller ◽

Michael J. Wirthlin

Keyword(s):

Data Centers ◽

Fault Injection ◽

Computer Networking ◽

Gate Arrays ◽

Large Numbers ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Radiation Induced ◽

Radiation Testing ◽

The Impact

Field programmable gate arrays (FPGAs) are used in large numbers in data centers around the world. They are used for cloud computing and computer networking. The most common type of FPGA used in data centers are re-programmable SRAM-based FPGAs. These devices offer potential performance and power consumption savings. A single device also carries a small susceptibility to radiation-induced soft errors, which can lead to unexpected behavior. This article examines the impact of terrestrial radiation on FPGAs in data centers. Results from artificial fault injection and accelerated radiation testing on several data-center-like FPGA applications are compared. A new fault injection scheme provides results that are more similar to radiation testing. Silent data corruption (SDC) is the most commonly observed failure mode followed by FPGA unavailable and host unresponsive. A hypothetical deployment of 100,000 FPGAs in Denver, Colorado, will experience upsets in configuration memory every half-hour on average and SDC failures every 0.5–11 days on average.

Get full-text (via PubEx)

Hardware Acceleration of High-Performance Computational Flow Dynamics Using High-Bandwidth Memory-Enabled Field-Programmable Gate Arrays

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3476229 ◽

2022 ◽

Vol 15 (2) ◽

pp. 1-35

Author(s):

Tom Hogervorst ◽

Răzvan Nane ◽

Giacomo Marchiori ◽

Tong Dong Qiu ◽

Markus Blatt ◽

...

Keyword(s):

High Performance ◽

Scientific Computing ◽

Hardware Acceleration ◽

Field Programmable Gate Arrays ◽

Gate Arrays ◽

Computational Flow Dynamics ◽

Field Programmable ◽

Programmable Gate Arrays ◽

High Bandwidth ◽

Reservoir Simulator

Scientific computing is at the core of many High-Performance Computing applications, including computational flow dynamics. Because of the utmost importance to simulate increasingly larger computational models, hardware acceleration is receiving increased attention due to its potential to maximize the performance of scientific computing. Field-Programmable Gate Arrays could accelerate scientific computing because of the possibility to fully customize the memory hierarchy important in irregular applications such as iterative linear solvers. In this article, we study the potential of using Field-Programmable Gate Arrays in High-Performance Computing because of the rapid advances in reconfigurable hardware, such as the increase in on-chip memory size, increasing number of logic cells, and the integration of High-Bandwidth Memories on board. To perform this study, we propose a novel Sparse Matrix-Vector multiplication unit and an ILU0 preconditioner tightly integrated with a BiCGStab solver kernel. We integrate the developed preconditioned iterative solver in Flow from the Open Porous Media project, a state-of-the-art open source reservoir simulator. Finally, we perform a thorough evaluation of the FPGA solver kernel in both stand-alone mode and integrated in the reservoir simulator, using the NORNE field, a real-world case reservoir model using a grid with more than 10 5 cells and using three unknowns per cell.

Get full-text (via PubEx)

Low-precision Floating-point Arithmetic for High-performance FPGA-based CNN Acceleration

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3474597 ◽

2022 ◽

Vol 15 (1) ◽

pp. 1-21

Author(s):

Chen Wu ◽

Mingyu Wang ◽

Xinyuan Chu ◽

Kun Wang ◽

Lei He

Keyword(s):

Fixed Point ◽

High Performance ◽

Good Accuracy ◽

Data Representation ◽

Floating Point ◽

Average Throughput ◽

Precision Data ◽

Content Type ◽

Point Arithmetic ◽

Better Than

Low-precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for deep CNNs and (2) needing 16-bit floating-point or 8-bit fixed-point for a good accuracy. In this article, we propose a low-precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitations. Without any re-training, LPFP finds an optimal 8-bit data representation with negligible top-1/top-5 accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder and one 3-bit adder, and therefore implement four 8-bit LPFP multiplications using one DSP48E1 of Xilinx Kintex-7 family or DSP48E2 of Xilinx Ultrascale/Ultrascale+ family, whereas one DSP can implement only two 8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by over existing FPGA accelerators. Particularly for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput by 3.5 and 27.5 and average throughput per DSP by 4.1 and 5 , respectively.

Get full-text (via PubEx)

A Real-Time Deep Learning OFDM Receiver

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3494049 ◽

2022 ◽

Vol 15 (3) ◽

pp. 1-25

Author(s):

Stefan Brennsteiner ◽

Tughrul Arslan ◽

John Thompson ◽

Andrew McCormick

Keyword(s):

Neural Network ◽

Neural Networks ◽

Real Time ◽

Field Programmable Gate Array ◽

Orthogonal Frequency Division Multiplexing ◽

Frequency Division Multiplexing ◽

Frequency Division ◽

Field Programmable ◽

Gate Array ◽

Fully Connected

Machine learning in the physical layer of communication systems holds the potential to improve performance and simplify design methodology. Many algorithms have been proposed; however, the model complexity is often unfeasible for real-time deployment. The real-time processing capability of these systems has not been proven yet. In this work, we propose a novel, less complex, fully connected neural network to perform channel estimation and signal detection in an orthogonal frequency division multiplexing system. The memory requirement, which is often the bottleneck for fully connected neural networks, is reduced by ≈ 27 times by applying known compression techniques in a three-step training process. Extensive experiments were performed for pruning and quantizing the weights of the neural network detector. Additionally, Huffman encoding was used on the weights to further reduce memory requirements. Based on this approach, we propose the first field-programmable gate array based, real-time capable neural network accelerator, specifically designed to accelerate the orthogonal frequency division multiplexing detector workload. The accelerator is synthesized for a Xilinx RFSoC field-programmable gate array, uses small-batch processing to increase throughput, efficiently supports branching neural networks, and implements superscalar Huffman decoders.

Get full-text (via PubEx)

Design and Evaluation of a Tunable PUF Architecture for FPGAs

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3491237 ◽

2022 ◽

Vol 15 (1) ◽

pp. 1-27

Author(s):

Franz-Josef Streit ◽

Paul Krüger ◽

Andreas Becher ◽

Stefan Wildermann ◽

Jürgen Teich

Keyword(s):

Signal Propagation ◽

Error Rates ◽

Operating Conditions ◽

Fine Tuning ◽

Worst Case ◽

Physical Unclonable Functions ◽

Reliability Characteristics ◽

Propagation Delays ◽

Temperature Impacts

FPGA-based Physical Unclonable Functions (PUF) have emerged as a viable alternative to permanent key storage by turning effects of inaccuracies during the manufacturing process of a chip into a unique, FPGA-intrinsic secret. However, many fixed PUF designs may suffer from unsatisfactory statistical properties in terms of uniqueness, uniformity, and robustness. Moreover, a PUF signature may alter over time due to aging or changing operating conditions, rendering a PUF insecure in the worst case. As a remedy, we propose CHOICE , a novel class of FPGA-based PUF designs with tunable uniqueness and reliability characteristics. By the use of addressable shift registers available on an FPGA, we show that a wide configuration space for adjusting a device-specific PUF response is obtained without any sacrifice of randomness. In particular, we demonstrate the concept of address-tunable propagation delays, whereby we are able to increase or decrease the probability of obtaining “ 1 ”s in the PUF response. Experimental evaluations on a group of six 28 nm Xilinx Artix-7 FPGAs show that CHOICE PUFs provide a large range of configurations to allow a fine-tuning to an average uniqueness between 49% and 51%, while simultaneously achieving bit error rates below 1.5%, thus outperforming state-of-the-art PUF designs. Moreover, with only a single FPGA slice per PUF bit, CHOICE is one of the smallest PUF designs currently available for FPGAs. It is well-known that signal propagation delays are affected by temperature, as the operating temperature impacts the internal currents of transistors that ultimately make up the circuit. We therefore comprehensively investigate how temperature variations affect the PUF response and demonstrate how the tunability of CHOICE enables us to determine configurations that show a high robustness to such variations. As a case study, we present a cryptographic key generation scheme based on CHOICE PUF responses as device-intrinsic secret and investigate the design objectives resource costs, performance, and temperature robustness to show the practicability of our approach.

Get full-text (via PubEx)

Note from the TRETS EiC about the new Journal-first track in FPT’21

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3501280 ◽

2022 ◽

Vol 15 (1) ◽

pp. 1-1

Author(s):

Deming Chen

Get full-text (via PubEx)

When Massive GPU Parallelism Ain’t Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3469661 ◽

2022 ◽

Vol 15 (1) ◽

pp. 1-35

Author(s):

Vladimir Rybalkin ◽

Jonas Ney ◽

Menbere Kina Tekleyohannes ◽

Norbert Wehn

Keyword(s):

Neural Network ◽

Neural Networks ◽

Energy Efficiency ◽

Semantic Segmentation ◽

Hardware Architecture ◽

Handwritten Digit Recognition ◽

Content Type ◽

One Dimensional ◽

Digit Recognition ◽

Handwritten Digit

Multidimensional Long Short-Term Memory (MD-LSTM) neural network is an extension of one-dimensional LSTM for data with more than one dimension. MD-LSTM achieves state-of-the-art results in various applications, including handwritten text recognition, medical imaging, and many more. However, its implementation suffers from the inherently sequential execution that tremendously slows down both training and inference compared to other neural networks. The main goal of the current research is to provide acceleration for inference of MD-LSTM. We advocate that Field-Programmable Gate Array (FPGA) is an alternative platform for deep learning that can offer a solution when the massive parallelism of GPUs does not provide the necessary performance required by the application. In this article, we present the first hardware architecture for MD-LSTM. We conduct a systematic exploration to analyze a tradeoff between precision and accuracy. We use a challenging dataset for semantic segmentation, namely historical document image binarization from the DIBCO 2017 contest and a well-known MNIST dataset for handwritten digit recognition. Based on our new architecture, we implement FPGA-based accelerators that outperform Nvidia Geforce RTX 2080 Ti with respect to throughput by up to 9.9 and Nvidia Jetson AGX Xavier with respect to energy efficiency by up to 48 . Our accelerators achieve higher throughput, energy efficiency, and resource efficiency than FPGA-based implementations of convolutional neural networks (CNNs) for semantic segmentation tasks. For the handwritten digit recognition task, our FPGA implementations provide higher accuracy and can be considered as a solution when accuracy is a priority. Furthermore, they outperform earlier FPGA implementations of one-dimensional LSTMs with respect to throughput, energy efficiency, and resource efficiency.

Get full-text (via PubEx)

Buffer Placement and Sizing for High-Performance Dataflow Circuits

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3477053 ◽

2022 ◽

Vol 15 (1) ◽

pp. 1-32

Author(s):

Lana Josipović ◽

Shabnam Sheikhha ◽

Andrea Guerrieri ◽

Paolo Ienne ◽

Jordi Cortadella

Keyword(s):

Performance Optimization ◽

Optimization Model ◽

High Performance ◽

Control Flow ◽

High Level Synthesis ◽

Software Applications ◽

Marked Graphs ◽

Variable Latency ◽

High Level ◽

Strong Contrast

Commercial high-level synthesis tools typically produce statically scheduled circuits. Yet, effective C-to-circuit conversion of arbitrary software applications calls for dataflow circuits, as they can handle efficiently variable latencies (e.g., caches), unpredictable memory dependencies, and irregular control flow. Dataflow circuits exhibit an unconventional property: registers (usually referred to as “buffers”) can be placed anywhere in the circuit without changing its semantics, in strong contrast to what happens in traditional datapaths. Yet, although functionally irrelevant, this placement has a significant impact on the circuit’s timing and throughput. In this work, we show how to strategically place buffers into a dataflow circuit to optimize its performance. Our approach extracts a set of choice-free critical loops from arbitrary dataflow circuits and relies on the theory of marked graphs to optimize the buffer placement and sizing. Our performance optimization model supports important high-level synthesis features such as pipelined computational units, units with variable latency and throughput, and if-conversion. We demonstrate the performance benefits of our approach on a set of dataflow circuits obtained from imperative code.

Get full-text (via PubEx)

Hipernetch: High-Performance FPGA Network Switch

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3477054 ◽

2022 ◽

Vol 15 (1) ◽

pp. 1-31

Author(s):

Philippos Papaphilippou ◽

Jiuxi Meng ◽

Nadeen Gebara ◽

Wayne Luk

Keyword(s):

High Frequency ◽

High Performance ◽

Data Centers ◽

Round Robin ◽

Switching Performance ◽

Wide Range ◽

Crossbar Switches ◽

High Bandwidth ◽

Network Switch ◽

Network Switches

We present Hipernetch, a novel FPGA-based design for performing high-bandwidth network switching. FPGAs have recently become more popular in data centers due to their promising capabilities for a wide range of applications. With the recent surge in transceiver bandwidth, they could further benefit the implementation and refinement of network switches used in data centers. Hipernetch replaces the crossbar with a “combined parallel round-robin arbiter”. Unlike a crossbar, the combined parallel round-robin arbiter is easy to pipeline, and does not require centralised iterative scheduling algorithms that try to fit too many steps in a single or a few FPGA cycles. The result is a network switch implementation on FPGAs operating at a high frequency and with a low port-to-port latency. Our proposed Hipernetch architecture additionally provides a competitive switching performance approaching output-queued crossbar switches. Our implemented Hipernetch designs exhibit a throughput that exceeds 100 Gbps per port for switches of up to 16 ports, reaching an aggregate throughput of around 1.7 Tbps.

Get full-text (via PubEx)

ACM Transactions on Reconfigurable Technology and Systems
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Association For Computing Machinery

Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning

The Impact of Terrestrial Radiation on FPGAs in Data Centers

Hardware Acceleration of High-Performance Computational Flow Dynamics Using High-Bandwidth Memory-Enabled Field-Programmable Gate Arrays

Low-precision Floating-point Arithmetic for High-performance FPGA-based CNN Acceleration

A Real-Time Deep Learning OFDM Receiver

Design and Evaluation of a Tunable PUF Architecture for FPGAs

Note from the TRETS EiC about the new Journal-first track in FPT’21

When Massive GPU Parallelism Ain’t Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network

Buffer Placement and Sizing for High-Performance Dataflow Circuits

Hipernetch: High-Performance FPGA Network Switch

Export Citation Format

ACM Transactions on Reconfigurable Technology and SystemsLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Association For Computing Machinery

Elastic-DF: Scaling Performance of DNN Inference in FPGA Clouds through Automatic Partitioning

The Impact of Terrestrial Radiation on FPGAs in Data Centers

Hardware Acceleration of High-Performance Computational Flow Dynamics Using High-Bandwidth Memory-Enabled Field-Programmable Gate Arrays

Low-precision Floating-point Arithmetic for High-performance FPGA-based CNN Acceleration

A Real-Time Deep Learning OFDM Receiver

Design and Evaluation of a Tunable PUF Architecture for FPGAs

Note from the TRETS EiC about the new Journal-first track in FPT’21

When Massive GPU Parallelism Ain’t Enough: A Novel Hardware Architecture of 2D-LSTM Neural Network

Buffer Placement and Sizing for High-Performance Dataflow Circuits

Hipernetch: High-Performance FPGA Network Switch

ACM Transactions on Reconfigurable Technology and Systems
Latest Publications