Enhanced Technology Mapping for FPGAs with Exploration of Cell Configurations

2015 ◽  
Vol 24 (03) ◽  
pp. 1550039 ◽  
Author(s):  
Grace Zgheib ◽  
Iyad Ouaiss

In the state-of-the-art field-programmable gate arrays (FPGAs), logic circuits are synthesized and mapped on clusters of look-up tables. However, arithmetic operations benefit from an existing dedicated adder along with a carry chain used to ensure a fast carry propagation. This carry chain is a dedicated wire available in the architecture of the FPGA and is as such independent of the external programmable routing resources. In this paper, we propose a variable-structure Boolean matching technology mapper with embedded decomposition techniques to map nonarithmetic logic functions on carry chains. Previously synthesized and mapped logic functions are adapted so that their outputs are routed using the dedicated carry chains instead of the external programmable interconnects. The experimental results show a reduction in the used routing resources as well as the circuit area when using this Boolean matching-based mapper on the Altera Stratix-III FPGA.

2019 ◽  
Vol 28 (08) ◽  
pp. 1950131 ◽  
Author(s):  
Alexander Barkalov ◽  
Larysa Titarenko ◽  
Sławomir Chmielewski

A method is proposed targeting the decrease of the number of look-up tables (LUTs) in logic circuits of field programmable gate arrays (FPGA)-based Mealy finite state machines. The method is based on constructing a partition for the set of output variables. It diminishes the number of additional variables encoding the collections of output variables (COVs). A formal method is proposed for finding the partition. An example of synthesis is given, as well as the results of investigations. The investigations were conducted for standard benchmarks.


Electronics ◽  
2020 ◽  
Vol 9 (2) ◽  
pp. 353 ◽  
Author(s):  
Anees Ullah ◽  
Ali Zahir ◽  
Noaman A. Khan ◽  
Waleed Ahmad ◽  
Alexis Ramos ◽  
...  

Field Programmable Gate Arrays (FPGAs) based Ternary Content Addressable Memories (TCAMs) are widely used in high-speed networking applications.However, TCAMs are not present on state-of-the-art FPGAs and need to be emulated on SRAM-based memories (i.e., LUTRAMs and Block RAMs) which requires a large amount of FPGA resources. In this paper, we present an efficient methodology to implement FPGA-based TCAMs with significant resource savings compared to existing schemes. The proposed methodology exploits the fracturable nature of Look Up Tables (LUTs) and the built-in slice carry-chains for simultaneous mapping of two rules and its matching logic to a single FPGA slice. Multiple slices can be stacked together to build deeper and wider TCAMs in a modular way. The combination of all these techniques results in significant savings in resource utilization compared to existing approaches.


2017 ◽  
Vol 2017 ◽  
pp. 1-9 ◽  
Author(s):  
Ali Asghar ◽  
Muhammad Mazher Iqbal ◽  
Waqar Ahmed ◽  
Mujahid Ali ◽  
Husain Parvez ◽  
...  

In modern SRAM based Field Programmable Gate Arrays, a Look-Up Table (LUT) is the principal constituent logic element which can realize every possible Boolean function. However, this flexibility of LUTs comes with a heavy area penalty. A part of this area overhead comes from the increased amount of configuration memory which rises exponentially as the LUT size increases. In this paper, we first present a detailed analysis of a previously proposed FPGA architecture which allows sharing of LUTs memory (SRAM) tables among NPN-equivalent functions, to reduce the area as well as the number of configuration bits. We then propose several methods to improve the existing architecture. A new clustering technique has been proposed which packs NPN-equivalent functions together inside a Configurable Logic Block (CLB). We also make use of a recently proposed high performance Boolean matching algorithm to perform NPN classification. To enhance area savings further, we evaluate the feasibility of more than two LUTs sharing the same SRAM table. Consequently, this work explores the SRAM table sharing approach for a range of LUT sizes (4–7), while varying the cluster sizes (4–16). Experimental results on MCNC benchmark circuits set show an overall area reduction of ~7% while maintaining the same critical path delay.


2017 ◽  
Vol 26 (07) ◽  
pp. 1750125 ◽  
Author(s):  
Małgorzata Kołopieńczyk ◽  
Larysa Titarenko ◽  
Alexander Barkalov

The complexity of algorithms implemented in digital systems grows. Methods are developed for most effective use of both hardware resources and energy. For engineers the problem of hardware resources optimization in design of control units is still an important issue. The standard way of implementing the control unit as a finite-state machine (FSM) is not satisfactory as it consumes considerable amounts of field-programmable gate arrays (FPGA) resources. This paper is devoted to the design of a Moore FSM in FPGA structure using look-up tables and embedded memory blocks (EMB) elements. The problem background is discussed. The method of the design of Moore FSM logic circuits with EMB based on splitting the set of logical conditions and the encoding of logical conditions is presented. Examples of design and research results are given.


Symmetry ◽  
2019 ◽  
Vol 11 (10) ◽  
pp. 1265 ◽  
Author(s):  
Zhuang Cao ◽  
Huiguo Zhang ◽  
Junnan Li ◽  
Mei Wen ◽  
Chunyuan Zhang

The development of modern networking requires that high-performance network processors be designed quickly and efficiently to support new protocols. As a very important part of the processor, the parser parses the headers of the packets—this is the precondition for further processing and finally forwarding these packets. This paper presents a framework designed to transform P4 programs to VHDL and to generate parsers on Field Programmable Gate Arrays (FPGAs). The framework includes a pipeline-based hardware architecture and a back-end compiler. The hardware architecture comprises many components with varying functionality, each of which has its own optimized VHDL template. By using the output of a standard frontend P4 compiler, our proposed compiler extracts the parameters and relationships from within the used components, which can then be mapped to corresponding templates by configuring, optimizing, and instantiating them. Finally, these templates are connected to output VHDL code. When a prototype of this framework is implemented and evaluated, the results demonstrate that the throughputs of the generated parsers achieve nearly 320 Gbps at a clock rate of around 300 MHz. Compared with state-of-the-art solutions, our proposed parsers achieve an average of twice the throughput when similar amounts of resources are being used.


2020 ◽  
Vol 34 (04) ◽  
pp. 4780-4787
Author(s):  
Yuhang Li ◽  
Xin Dong ◽  
Sai Qian Zhang ◽  
Haoli Bai ◽  
Yuanpeng Chen ◽  
...  

To deploy deep neural networks on resource-limited devices, quantization has been widely explored. In this work, we study the extremely low-bit networks which have tremendous speed-up, memory saving with quantized activation and weights. We first bring up three omitted issues in extremely low-bit networks: the squashing range of quantized values; the gradient vanishing during backpropagation and the unexploited hardware acceleration of ternary networks. By reparameterizing quantized activation and weights vector with full precision scale and offset for fixed ternary vector, we decouple the range and magnitude from direction to extenuate above problems. Learnable scale and offset can automatically adjust the range of quantized values and sparsity without gradient vanishing. A novel encoding and computation pattern are designed to support efficient computing for our reparameterized ternary network (RTN). Experiments on ResNet-18 for ImageNet demonstrate that the proposed RTN finds a much better efficiency between bitwidth and accuracy and achieves up to 26.76% relative accuracy improvement compared with state-of-the-art methods. Moreover, we validate the proposed computation pattern on Field Programmable Gate Arrays (FPGA), and it brings 46.46 × and 89.17 × savings on power and area compared with the full precision convolution.


Sign in / Sign up

Export Citation Format

Share Document