FPGAs in The Cloud

As cloud computing grows, the types of computational hardware available in the cloud are diversifying. Field Programmable Gate Arrays (FPGAs) are a relatively new addition to high-performance computing in the cloud, with the ability to accelerate a range of different applications, and the flexibility to offer different cloud computing models. A new and growing configuration is to have the FPGAs directly connected to the network and thus reduce the latency in delivering data to processing elements. We survey the state-of-the-art in FPGAs in the cloud and present the Open Cloud Testbed (OCT), a testbed for research and experimentation into new cloud platforms, which includes network-attached FPGAs in the cloud.

Download Full-text

A Fast Approach for Generating Efficient Parsers on FPGAs

Symmetry ◽

10.3390/sym11101265 ◽

2019 ◽

Vol 11 (10) ◽

pp. 1265 ◽

Cited By ~ 1

Author(s):

Zhuang Cao ◽

Huiguo Zhang ◽

Junnan Li ◽

Mei Wen ◽

Chunyuan Zhang

Keyword(s):

High Performance ◽

State Of The Art ◽

Field Programmable Gate Arrays ◽

Hardware Architecture ◽

Clock Rate ◽

Gate Arrays ◽

Fast Approach ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Vhdl Code

The development of modern networking requires that high-performance network processors be designed quickly and efficiently to support new protocols. As a very important part of the processor, the parser parses the headers of the packets—this is the precondition for further processing and finally forwarding these packets. This paper presents a framework designed to transform P4 programs to VHDL and to generate parsers on Field Programmable Gate Arrays (FPGAs). The framework includes a pipeline-based hardware architecture and a back-end compiler. The hardware architecture comprises many components with varying functionality, each of which has its own optimized VHDL template. By using the output of a standard frontend P4 compiler, our proposed compiler extracts the parameters and relationships from within the used components, which can then be mapped to corresponding templates by configuring, optimizing, and instantiating them. Finally, these templates are connected to output VHDL code. When a prototype of this framework is implemented and evaluated, the results demonstrate that the throughputs of the generated parsers achieve nearly 320 Gbps at a clock rate of around 300 MHz. Compared with state-of-the-art solutions, our proposed parsers achieve an average of twice the throughput when similar amounts of resources are being used.

Download Full-text

Field‐programmable gate arrays and quantum Monte Carlo: Power efficient coprocessing for scalable high‐performance computing

International Journal of Quantum Chemistry ◽

10.1002/qua.25853 ◽

2019 ◽

Vol 119 (12) ◽

pp. e25853 ◽

Cited By ~ 1

Author(s):

Salvatore Cardamone ◽

Jonathan R. R. Kimmitt ◽

Hugh G. A. Burton ◽

Timothy J. Todman ◽

Shurui Li ◽

...

Keyword(s):

Monte Carlo ◽

High Performance Computing ◽

Quantum Monte Carlo ◽

High Performance ◽

Field Programmable Gate Arrays ◽

Power Efficient ◽

Gate Arrays ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Performance Computing

Download Full-text

BPR-TCAM—Block and Partial Reconfiguration based TCAM on Xilinx FPGAs

Electronics ◽

10.3390/electronics9020353 ◽

2020 ◽

Vol 9 (2) ◽

pp. 353 ◽

Cited By ~ 1

Author(s):

Anees Ullah ◽

Ali Zahir ◽

Noaman A. Khan ◽

Waleed Ahmad ◽

Alexis Ramos ◽

...

Keyword(s):

Resource Utilization ◽

High Speed ◽

State Of The Art ◽

Field Programmable Gate Arrays ◽

Partial Reconfiguration ◽

Gate Arrays ◽

Content Addressable Memories ◽

Field Programmable ◽

Programmable Gate Arrays

Field Programmable Gate Arrays (FPGAs) based Ternary Content Addressable Memories (TCAMs) are widely used in high-speed networking applications.However, TCAMs are not present on state-of-the-art FPGAs and need to be emulated on SRAM-based memories (i.e., LUTRAMs and Block RAMs) which requires a large amount of FPGA resources. In this paper, we present an efficient methodology to implement FPGA-based TCAMs with significant resource savings compared to existing schemes. The proposed methodology exploits the fracturable nature of Look Up Tables (LUTs) and the built-in slice carry-chains for simultaneous mapping of two rules and its matching logic to a single FPGA slice. Multiple slices can be stacked together to build deeper and wider TCAMs in a modular way. The combination of all these techniques results in significant savings in resource utilization compared to existing approaches.

Download Full-text

High-Performance Spectral Element Methods on Field-Programmable Gate Arrays : Implementation, Evaluation, and Future Projection

2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS) ◽

10.1109/ipdps49936.2021.00116 ◽

2021 ◽

Author(s):

Martin Karp ◽

Artur Podobas ◽

Niclas Jansson ◽

Tobias Kenter ◽

Christian Plessl ◽

...

Keyword(s):

High Performance ◽

Field Programmable Gate Arrays ◽

Spectral Element ◽

Future Projection ◽

Spectral Element Methods ◽

Gate Arrays ◽

Implementation Evaluation ◽

Field Programmable ◽

Programmable Gate Arrays

Download Full-text

The use of field programmable gate arrays in high performance radar signal processing applications

Record of the IEEE 2000 International Radar Conference [Cat. No. 00CH37037] ◽

10.1109/radar.2000.851946 ◽

2002 ◽

Cited By ~ 7

Author(s):

R. Stapleton ◽

K. Merranko ◽

C. Parris ◽

J. Alter

Keyword(s):

Signal Processing ◽

High Performance ◽

Field Programmable Gate Arrays ◽

Radar Signal ◽

Radar Signal Processing ◽

Gate Arrays ◽

Field Programmable ◽

Programmable Gate Arrays

Download Full-text

Design of high performance system-on-chips using Field Programmable Gate Arrays (FPGA)

2014 International Conference on Communication and Signal Processing ◽

10.1109/iccsp.2014.6949862 ◽

2014 ◽

Cited By ~ 3

Author(s):

M. Rubini ◽

C. Rajasekaran

Keyword(s):

High Performance ◽

Field Programmable Gate Arrays ◽

Gate Arrays ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Performance System

Download Full-text

A modular 0.8 mu m technology for high performance dielectric antifuse field programmable gate arrays

1993 International Symposium on VLSI Technology, Systems, and Applications Proceedings of Technical Papers ◽

10.1109/vtsa.1993.263650 ◽

2002 ◽

Author(s):

J. Chen ◽

S. Eltoukhy ◽

S. Yen ◽

R. Wang ◽

F. Issaq ◽

...

Keyword(s):

High Performance ◽

Field Programmable Gate Arrays ◽

Gate Arrays ◽

Field Programmable ◽

Programmable Gate Arrays

Download Full-text

RTN: Reparameterized Ternary Network

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5912 ◽

2020 ◽

Vol 34 (04) ◽

pp. 4780-4787

Author(s):

Yuhang Li ◽

Xin Dong ◽

Sai Qian Zhang ◽

Haoli Bai ◽

Yuanpeng Chen ◽

...

Keyword(s):

Deep Neural Networks ◽

State Of The Art ◽

Hardware Acceleration ◽

Field Programmable Gate Arrays ◽

Accuracy Improvement ◽

Gate Arrays ◽

Resource Limited ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Speed Up

To deploy deep neural networks on resource-limited devices, quantization has been widely explored. In this work, we study the extremely low-bit networks which have tremendous speed-up, memory saving with quantized activation and weights. We first bring up three omitted issues in extremely low-bit networks: the squashing range of quantized values; the gradient vanishing during backpropagation and the unexploited hardware acceleration of ternary networks. By reparameterizing quantized activation and weights vector with full precision scale and offset for fixed ternary vector, we decouple the range and magnitude from direction to extenuate above problems. Learnable scale and offset can automatically adjust the range of quantized values and sparsity without gradient vanishing. A novel encoding and computation pattern are designed to support efficient computing for our reparameterized ternary network (RTN). Experiments on ResNet-18 for ImageNet demonstrate that the proposed RTN finds a much better efficiency between bitwidth and accuracy and achieves up to 26.76% relative accuracy improvement compared with state-of-the-art methods. Moreover, we validate the proposed computation pattern on Field Programmable Gate Arrays (FPGA), and it brings 46.46 × and 89.17 × savings on power and area compared with the full precision convolution.

Download Full-text

Hardware Acceleration of High-Performance Computational Flow Dynamics Using High-Bandwidth Memory-Enabled Field-Programmable Gate Arrays

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3476229 ◽

2022 ◽

Vol 15 (2) ◽

pp. 1-35

Author(s):

Tom Hogervorst ◽

Răzvan Nane ◽

Giacomo Marchiori ◽

Tong Dong Qiu ◽

Markus Blatt ◽

...

Keyword(s):

High Performance ◽

Scientific Computing ◽

Hardware Acceleration ◽

Field Programmable Gate Arrays ◽

Gate Arrays ◽

Computational Flow Dynamics ◽

Field Programmable ◽

Programmable Gate Arrays ◽

High Bandwidth ◽

Reservoir Simulator

Scientific computing is at the core of many High-Performance Computing applications, including computational flow dynamics. Because of the utmost importance to simulate increasingly larger computational models, hardware acceleration is receiving increased attention due to its potential to maximize the performance of scientific computing. Field-Programmable Gate Arrays could accelerate scientific computing because of the possibility to fully customize the memory hierarchy important in irregular applications such as iterative linear solvers. In this article, we study the potential of using Field-Programmable Gate Arrays in High-Performance Computing because of the rapid advances in reconfigurable hardware, such as the increase in on-chip memory size, increasing number of logic cells, and the integration of High-Bandwidth Memories on board. To perform this study, we propose a novel Sparse Matrix-Vector multiplication unit and an ILU0 preconditioner tightly integrated with a BiCGStab solver kernel. We integrate the developed preconditioned iterative solver in Flow from the Open Porous Media project, a state-of-the-art open source reservoir simulator. Finally, we perform a thorough evaluation of the FPGA solver kernel in both stand-alone mode and integrated in the reservoir simulator, using the NORNE field, a real-world case reservoir model using a grid with more than 10 5 cells and using three unknowns per cell.

Download Full-text