High Performance Low Cost Implementation of FPGA-Based Fractional-Order Operators

Author(s):  
Cindy X. Jiang ◽  
Tom T. Hartley ◽  
Joan E. Carletta

Hardware implementation of fractional-order differentiators and integrators requires careful consideration of issues of system quality, hardware cost, and speed. This paper proposes using field programmable gate arrays (FPGAs) to implement fractional-order systems, and demonstrates the advantages that FPGAs provide. As an illustration, the fundamental operators to a real power is approximated via the binomial expansion of the backward difference. The resulting high-order FIR filter is implemented in a pipelined multiplierless architecture on a low-cost Spartan-3 FPGA. Unlike common digital implementations in which all filter coefficients have the same word length, this approach exploits variable word length for each coefficient. Our system requires twenty percent less hardware than a system of comparable quality generated by Xilinx’s System Generator on its most area-efficient multiplierless setting. The work shows an effective way to implement a high quality, high throughput approximation to a fractional-order system, while maintaining less cost than traditional FPGA-based designs.

2019 ◽  
Vol 214 ◽  
pp. 07029
Author(s):  
David Ojika ◽  
Ann Gordon-Ross ◽  
Herman Lam ◽  
Bhavesh Patel

Field-programmable gate arrays (FPGAs) have largely been used in communication and high-performance computing and given the recent advances in big data and emerging trends in cloud computing (e.g., serverless [18]), FPGAs are increasingly being introduced into these domains (e.g., Microsoft’s datacenters [6] and Amazon Web Services [10]). To address these domains’ processing needs, recent research has focused on using FPGAs to accelerate workloads, ranging from analytics and machine learning to databases and network function virtualization. In this paper, we present an ongoing effort to realize a high-performance FPGA-as-a-microservice (FaaM) architecture for the cloud. We discuss some of the technical challenges and propose several solutions for efficiently integrating FPGAs into virtualized environments. Our case study deploying a multithreaded, multi-user compression as a microservice using the FaaM architecture indicate that microservices-based FPGA acceleration can sustain high-performance compared to straightforward implementation with minimal to no communication overhead despite the hardware abstraction.


Computers ◽  
2020 ◽  
Vol 9 (3) ◽  
pp. 70
Author(s):  
Carolina Fernández ◽  
Sergio Giménez ◽  
Eduard Grasa ◽  
Steve Bunch

The lack of high-performance RINA (Recursive InterNetwork Architecture) implementations to date makes it hard to experiment with RINA as an underlay networking fabric solution for different types of networks, and to assess RINA’s benefits in practice on scenarios with high traffic loads. High-performance router implementations typically require dedicated hardware support, such as FPGAs (Field Programmable Gate Arrays) or specialized ASICs (Application Specific Integrated Circuit). With the advance of hardware programmability in recent years, new possibilities unfold to prototype novel networking technologies. In particular, the use of the P4 programming language for programmable ASICs holds great promise for developing a RINA router. This paper details the design and part of the implementation of the first P4-based RINA interior router, which reuses the layer management components of the IRATI Linux-based RINA implementation and implements the data-transfer components using a P4 program. We also describe the configuration and testing of our initial deployment scenarios, using ancillary open-source tools such as the P4 reference test software switch (BMv2) or the P4Runtime API.


2017 ◽  
Vol 2017 ◽  
pp. 1-9 ◽  
Author(s):  
Ali Asghar ◽  
Muhammad Mazher Iqbal ◽  
Waqar Ahmed ◽  
Mujahid Ali ◽  
Husain Parvez ◽  
...  

In modern SRAM based Field Programmable Gate Arrays, a Look-Up Table (LUT) is the principal constituent logic element which can realize every possible Boolean function. However, this flexibility of LUTs comes with a heavy area penalty. A part of this area overhead comes from the increased amount of configuration memory which rises exponentially as the LUT size increases. In this paper, we first present a detailed analysis of a previously proposed FPGA architecture which allows sharing of LUTs memory (SRAM) tables among NPN-equivalent functions, to reduce the area as well as the number of configuration bits. We then propose several methods to improve the existing architecture. A new clustering technique has been proposed which packs NPN-equivalent functions together inside a Configurable Logic Block (CLB). We also make use of a recently proposed high performance Boolean matching algorithm to perform NPN classification. To enhance area savings further, we evaluate the feasibility of more than two LUTs sharing the same SRAM table. Consequently, this work explores the SRAM table sharing approach for a range of LUT sizes (4–7), while varying the cluster sizes (4–16). Experimental results on MCNC benchmark circuits set show an overall area reduction of ~7% while maintaining the same critical path delay.


he paper concerns the construction scheme of Direct Digital Synthesis (DDS) generator based on widely developed Field Programmable Gate Arrays (FPGA) technology. based on (DDS) it generates sine wave that frequency and phase is manageable is designed with direct digital synthesis(DDS) technology. It is showed that the design based on FPGA with DDS is dependable and practicable. The output wave by test reaches the essential aims, easy control and high performance. The DDS produce sinusoidal signal owns the features of modest circuit, easy to be measured, unchanging performance, high frequency conversion speed and fine accuracy etc. And its output frequency falls within the range of 0Hz ~ 150KHz with 5 Hz of steps


2008 ◽  
Vol 6 ◽  
pp. 113-118 ◽  
Author(s):  
O. A. Pfänder ◽  
R. Nopper ◽  
H.-J. Pfleiderer ◽  
S. Zhou ◽  
A. Bermak

Abstract. Binary multiplication continues to be one of the essential arithmetic operations in digital circuits. Even though field-programmable gate arrays (FPGAs) are becoming more and more powerful these days, the vendors cannot avoid implementing multiplications with high word-lengths using embedded blocks instead of configurable logic. But on the other hand, the circuit's efficiency decreases if the provided word-length of the hard-wired multipliers exceeds the precision requirements of the algorithm mapped into the FPGA. Thus it is beneficial to use multiplier blocks with configurable word-length, optimized for area, speed and power dissipation, e.g. regarding digital signal processing (DSP) applications. In this contribution, we present different approaches and structures for the realization of a multiplication with variable precision and perform an objective comparison. This includes one approach based on a modified Baugh and Wooley algorithm and three structures using Booth's arithmetic operand recoding with different array structures. All modules have the option to compute signed two's complement fix-point numbers either as an individual computing unit or interconnected to a superior array. Therefore, a high throughput at low precision through parallelism, or a high precision through concatenation can be achieved.


2019 ◽  
Vol 9 (13) ◽  
pp. 2705
Author(s):  
Chenggang Yan ◽  
Chen Hu ◽  
Jianhui Wu

In this paper, a digital-to-time converter (DTC) based on the three delay lines (3D) Vernier principle is proposed and implemented with field programmable gate arrays (FPGAs). Based on the 3D Vernier principle, the DTC is realized by three period approximate phase locked loops (PLLs). The theoretical fine resolution of the proposed DTC is improved by calculating the period difference two times. The achieved resolution of the proposed DTC is 203 fs realized with an Altera Stratix III FPGA chip, which is about tenfold higher than traditional FPGA-DTC implemented with the same series FPGAs. The worst absolute differential nonlinearity (DNL) and integral nonlinearity (INL) are verified smaller than 0.88 least significant bit (LSB) and 4.4 LSB, respectively. By optimized computation logic, there are only 448 adaptive look-up-tables (ALUTs), 237 registers and three phase locked loops (PLLs) utilized for circuit implementation. Experimental results prove that the proposed DTC features high resolution with low cost.


Sign in / Sign up

Export Citation Format

Share Document