A Reconfigurable System Approach to the Direct Kinematics of a 5D.o.fRobotic Manipulator

Hardware acceleration in high performance computer systems has a particular interest for many engineering and scientific applications in which a large number of arithmetic operations and transcendental functions must be computed. In this paper a hardware architecture for computing direct kinematics of robot manipulators with 5 degrees of freedom (5D.o.f) using floating-point arithmetic is presented for 32, 43, and 64 bit-width representations and it is implemented in Field Programmable Gate Arrays (FPGAs). The proposed architecture has been developed using several floating-point libraries for arithmetic and transcendental functions operators, allowing the designer to select (pre-synthesis) a suitable bit-width representation according to the accuracy and dynamic range, as well as the area, elapsed time and power consumption requirements of the application. Synthesis results demonstrate the effectiveness and high performance of the implemented cores on commercial FPGAs. Simulation results have been addressed in order to compute the Mean Square Error (MSE), using the Matlab as statistical estimator, validating the correct behavior of the implemented cores. Additionally, the processing time of the hardware architecture was compared with the same formulation implemented in software, using the PowerPC (FPGA embedded processor), demonstrating that the hardware architecture speeds-up by factor of 1298 the software implementation.

Download Full-text

Implementation of Embedded Floating Point Arithmetic Units on FPGA

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.550.126 ◽

2014 ◽

Vol 550 ◽

pp. 126-136

Author(s):

N. Ramya Rani

Keyword(s):

High Speed ◽

High Performance ◽

Floating Point ◽

Double Precision ◽

Embedded Computing ◽

Floating Point Arithmetic ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Arithmetic Units ◽

Point Arithmetic

:Floating point arithmetic plays a major role in scientific and embedded computing applications. But the performance of field programmable gate arrays (FPGAs) used for floating point applications is poor due to the complexity of floating point arithmetic. The implementation of floating point units on FPGAs consumes a large amount of resources and that leads to the development of embedded floating point units in FPGAs. Embedded applications like multimedia, communication and DSP algorithms use floating point arithmetic in processing graphics, Fourier transformation, coding, etc. In this paper, methodologies are presented for the implementation of embedded floating point units on FPGA. The work is focused with the aim of achieving high speed of computations and to reduce the power for evaluating expressions. An application that demands high performance floating point computation can achieve better speed and density by incorporating embedded floating point units. Additionally this paper describes a comparative study of the design of single precision and double precision pipelined floating point arithmetic units for evaluating expressions. The modules are designed using VHDL simulation in Xilinx software and implemented on VIRTEX and SPARTAN FPGAs.

Download Full-text

A Fast Approach for Generating Efficient Parsers on FPGAs

Symmetry ◽

10.3390/sym11101265 ◽

2019 ◽

Vol 11 (10) ◽

pp. 1265 ◽

Cited By ~ 1

Author(s):

Zhuang Cao ◽

Huiguo Zhang ◽

Junnan Li ◽

Mei Wen ◽

Chunyuan Zhang

Keyword(s):

High Performance ◽

State Of The Art ◽

Field Programmable Gate Arrays ◽

Hardware Architecture ◽

Clock Rate ◽

Gate Arrays ◽

Fast Approach ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Vhdl Code

The development of modern networking requires that high-performance network processors be designed quickly and efficiently to support new protocols. As a very important part of the processor, the parser parses the headers of the packets—this is the precondition for further processing and finally forwarding these packets. This paper presents a framework designed to transform P4 programs to VHDL and to generate parsers on Field Programmable Gate Arrays (FPGAs). The framework includes a pipeline-based hardware architecture and a back-end compiler. The hardware architecture comprises many components with varying functionality, each of which has its own optimized VHDL template. By using the output of a standard frontend P4 compiler, our proposed compiler extracts the parameters and relationships from within the used components, which can then be mapped to corresponding templates by configuring, optimizing, and instantiating them. Finally, these templates are connected to output VHDL code. When a prototype of this framework is implemented and evaluated, the results demonstrate that the throughputs of the generated parsers achieve nearly 320 Gbps at a clock rate of around 300 MHz. Compared with state-of-the-art solutions, our proposed parsers achieve an average of twice the throughput when similar amounts of resources are being used.

Download Full-text

Hardware Acceleration of High-Performance Computational Flow Dynamics Using High-Bandwidth Memory-Enabled Field-Programmable Gate Arrays

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3476229 ◽

2022 ◽

Vol 15 (2) ◽

pp. 1-35

Author(s):

Tom Hogervorst ◽

Răzvan Nane ◽

Giacomo Marchiori ◽

Tong Dong Qiu ◽

Markus Blatt ◽

...

Keyword(s):

High Performance ◽

Scientific Computing ◽

Hardware Acceleration ◽

Field Programmable Gate Arrays ◽

Gate Arrays ◽

Computational Flow Dynamics ◽

Field Programmable ◽

Programmable Gate Arrays ◽

High Bandwidth ◽

Reservoir Simulator

Scientific computing is at the core of many High-Performance Computing applications, including computational flow dynamics. Because of the utmost importance to simulate increasingly larger computational models, hardware acceleration is receiving increased attention due to its potential to maximize the performance of scientific computing. Field-Programmable Gate Arrays could accelerate scientific computing because of the possibility to fully customize the memory hierarchy important in irregular applications such as iterative linear solvers. In this article, we study the potential of using Field-Programmable Gate Arrays in High-Performance Computing because of the rapid advances in reconfigurable hardware, such as the increase in on-chip memory size, increasing number of logic cells, and the integration of High-Bandwidth Memories on board. To perform this study, we propose a novel Sparse Matrix-Vector multiplication unit and an ILU0 preconditioner tightly integrated with a BiCGStab solver kernel. We integrate the developed preconditioned iterative solver in Flow from the Open Porous Media project, a state-of-the-art open source reservoir simulator. Finally, we perform a thorough evaluation of the FPGA solver kernel in both stand-alone mode and integrated in the reservoir simulator, using the NORNE field, a real-world case reservoir model using a grid with more than 10 5 cells and using three unknowns per cell.

Download Full-text

Hardware Acceleration for Finite Element Electromagnetics: Efficient Sparse Matrix Floating-Point Computations with Field Programmable Gate Arrays

2006 12th Biennial IEEE Conference on Electromagnetic Field Computation ◽

10.1109/cefc-06.2006.1633187 ◽

2006 ◽

Cited By ~ 1

Author(s):

Y.E. Kurdi ◽

W.J. Gross ◽

D. Giannacopoulos

Keyword(s):

Finite Element ◽

Sparse Matrix ◽

Hardware Acceleration ◽

Field Programmable Gate Arrays ◽

Floating Point ◽

Gate Arrays ◽

Field Programmable ◽

Programmable Gate Arrays

Download Full-text

BLOCK FLOATING POINT FFT IMPLEMENTATION FOR DMT xDSL SYSTEMS

Journal of Circuits System and Computers ◽

10.1142/s0218126604001878 ◽

2004 ◽

Vol 13 (05) ◽

pp. 1147-1164

Author(s):

Th. ZAHARIADIS ◽

S. APOSTOLACOS ◽

I. GRAMMATIKAKIS ◽

D. MEXIS ◽

N. ZERVOS ◽

...

Keyword(s):

High Speed ◽

Dynamic Range ◽

Digital Signal ◽

Floating Point ◽

Digital Subscriber Line ◽

Discrete Multitone ◽

Field Programmable ◽

Signal Processing Algorithms ◽

Programmable Gate Arrays ◽

Signal Processors

The development of multiple Discrete Multitone (DMT) Digital Subscriber Line (DSL) flavors on a single platform can benefit considerably by a programmable architecture, which feature Digital Signal Processors (DSP) and Field Programmable Gate Arrays (FPGA), especially when fast prototyping is targeted. However, the flexibility assumed to be offered by algorithmic partitioning does not automatically and proportionally simplify the digital signal processing algorithms, unless the effects of overflow/saturation in intermediate processing stages are carefully studied. The effects of overflow/saturation in intermediate stages is very critical throughout the design process, since the operations involved are nonlinear in nature and affect the most significant bits of the computational process. This paper presents an efficient soft-core implementation of a Block Floating Point FFT (BLFP) algorithm, designed for a Very high-speed DSL (VDSL) DMT systems and for the full variety of other xDSL DMT flavors, as the latter demand an extended dynamic range to achieve performance that may otherwise be only warranted by costly floating-point chip implementations.

Download Full-text

High-performance magneto-rheological clutches for direct-drive actuation: Design and development

Journal of Intelligent Material Systems and Structures ◽

10.1177/1045389x211006902 ◽

2021 ◽

pp. 1045389X2110069

Author(s):

Sergey Pisetskiy ◽

Mehrdad Kermani

Keyword(s):

High Performance ◽

Degrees Of Freedom ◽

Dynamic Range ◽

Complete Analysis ◽

Mass Ratio ◽

Direct Drive ◽

Element Analysis ◽

Prototype Development ◽

Magneto Rheological ◽

Hall Sensors

This paper presents an improved design, complete analysis, and prototype development of high torque-to-mass ratio Magneto-Rheological (MR) clutches. The proposed MR clutches are intended as the main actuation mechanism of a robotic manipulator with five degrees of freedom. Multiple steps to increase the toque-to-mass ratio of the clutch are evaluated and implemented in one design. First, we focus on the Hall sensors’ configuration. Our proposed MR clutches feature embedded Hall sensors for the indirect torque measurement. A new arrangement of the sensors with no effect on the magnetic reluctance of the clutch is presented. Second, we improve the magnetization of the MR clutch. We utilize a new hybrid design that features a combination of an electromagnetic coil and a permanent magnet for improved torque-to-mass ratio. Third, the gap size reduction in the hybrid MR clutch is introduced and the effect of such reduction on maximum torque and the dynamic range of MR clutch is investigated. Finally, the design for a pair of MR clutches with a shared magnetic core for antagonistic actuation of the robot joint is presented and experimentally validated. The details of each approach are discussed and the results of the finite element analysis are used to highlight the required engineering steps and to demonstrate the improvements achieved. Using the proposed design, several prototypes of the MR clutch with various torque capacities ranging from 15 to 200 N·m are developed, assembled, and tested. The experimental results demonstrate the performance of the proposed design and validate the accuracy of the analysis used for the development.

Download Full-text

High Performance Low Cost Implementation of FPGA-Based Fractional-Order Operators

Volume 6: 5th International Conference on Multibody Systems, Nonlinear Dynamics, and Control, Parts A, B, and C ◽

10.1115/detc2005-84796 ◽

2005 ◽

Cited By ~ 3

Author(s):

Cindy X. Jiang ◽

Tom T. Hartley ◽

Joan E. Carletta

Keyword(s):

Fractional Order ◽

Word Length ◽

High Performance ◽

Low Cost ◽

Careful Consideration ◽

Order System ◽

System Quality ◽

Gate Arrays ◽

Field Programmable ◽

Programmable Gate Arrays

Hardware implementation of fractional-order differentiators and integrators requires careful consideration of issues of system quality, hardware cost, and speed. This paper proposes using field programmable gate arrays (FPGAs) to implement fractional-order systems, and demonstrates the advantages that FPGAs provide. As an illustration, the fundamental operators to a real power is approximated via the binomial expansion of the backward difference. The resulting high-order FIR filter is implemented in a pipelined multiplierless architecture on a low-cost Spartan-3 FPGA. Unlike common digital implementations in which all filter coefficients have the same word length, this approach exploits variable word length for each coefficient. Our system requires twenty percent less hardware than a system of comparable quality generated by Xilinx’s System Generator on its most area-efficient multiplierless setting. The work shows an effective way to implement a high quality, high throughput approximation to a fractional-order system, while maintaining less cost than traditional FPGA-based designs.

Download Full-text

Embedded GPU Implementation for High-Performance Ultrasound Imaging

Electronics ◽

10.3390/electronics10080884 ◽

2021 ◽

Vol 10 (8) ◽

pp. 884

Author(s):

Stefano Rossi ◽

Enrico Boni

Keyword(s):

High Performance ◽

Graphics Processing Unit ◽

Digital Signal ◽

Processing Unit ◽

Embedded Computing ◽

Field Programmable ◽

Peripheral Component Interconnect ◽

Programmable Gate Arrays ◽

Graphics Processing ◽

Signal Processors

Methods of increasing complexity are currently being proposed for ultrasound (US) echographic signal processing. Graphics Processing Unit (GPU) resources allowing massive exploitation of parallel computing are ideal candidates for these tasks. Many high-performance US instruments, including open scanners like ULA-OP 256, have an architecture based only on Field-Programmable Gate Arrays (FPGAs) and/or Digital Signal Processors (DSPs). This paper proposes the implementation of the embedded NVIDIA Jetson Xavier AGX module on board ULA-OP 256. The system architecture was revised to allow the introduction of a new Peripheral Component Interconnect Express (PCIe) communication channel, while maintaining backward compatibility with all other embedded computing resources already on board. Moreover, the Input/Output (I/O) peripherals of the module make the ultrasound system independent, freeing the user from the need to use an external controlling PC.

Download Full-text

An Efficient FPGA-Based Convolutional Neural Network for Classification: Ad-MobileNet

Electronics ◽

10.3390/electronics10182272 ◽

2021 ◽

Vol 10 (18) ◽

pp. 2272

Author(s):

Safa Bouguezzi ◽

Hana Ben Fredj ◽

Tarek Belabed ◽

Carlos Valderrama ◽

Hassene Faiedh ◽

...

Keyword(s):

Recognition Rate ◽

Hardware Acceleration ◽

Implementation Model ◽

Gate Arrays ◽

Proposed Model ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Computer Vision Applications ◽

On Chip ◽

Segmentation Image

Convolutional Neural Networks (CNN) continue to dominate research in the area of hardware acceleration using Field Programmable Gate Arrays (FPGA), proving its effectiveness in a variety of computer vision applications such as object segmentation, image classification, face detection, and traffic signs recognition, among others. However, there are numerous constraints for deploying CNNs on FPGA, including limited on-chip memory, CNN size, and configuration parameters. This paper introduces Ad-MobileNet, an advanced CNN model inspired by the baseline MobileNet model. The proposed model uses an Ad-depth engine, which is an improved version of the depth-wise separable convolution unit. Moreover, we propose an FPGA-based implementation model that supports the Mish, TanhExp, and ReLU activation functions. The experimental results using the CIFAR-10 dataset show that our Ad-MobileNet has a classification accuracy of 88.76% while requiring little computational hardware resources. Compared to state-of-the-art methods, our proposed method has a fairly high recognition rate while using fewer computational hardware resources. Indeed, the proposed model helps to reduce hardware resources by more than 41% compared to that of the baseline model.

Download Full-text

A Quantized CNN-Based Microfluidic Lensless-Sensing Mobile Blood-Acquisition and Analysis System

Sensors ◽

10.3390/s19235103 ◽

2019 ◽

Vol 19 (23) ◽

pp. 5103 ◽

Cited By ~ 1

Author(s):

Liao ◽

Yu ◽

Tian ◽

Li ◽

Keyword(s):

Neural Network ◽

Group Structure ◽

Image Acquisition ◽

Cell Segmentation ◽

Floating Point ◽

Prototype System ◽

Processing Efficiency ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Analysis System

This paper proposes a microfluidic lensless-sensing mobile blood-acquisition and analysis system. For a better tradeoff between accuracy and hardware cost, an integer-only quantization algorithm is proposed. Compared with floating-point inference, the proposed quantization algorithm makes a tradeoff that enables miniaturization while maintaining high accuracy. The quantization algorithm allows the convolutional neural network (CNN) inference to be carried out using integer arithmetic and facilitates hardware implementation with area and power savings. A dual configuration register group structure is also proposed to reduce the interval idle time between every neural network layer in order to improve the CNN processing efficiency. We designed a CNN accelerator architecture for the integer-only quantization algorithm and the dual configuration register group and implemented them in field-programmable gate arrays (FPGA). A microfluidic chip and mobile lensless sensing cell image acquisition device were also developed, then combined with the CNN accelerator to build the mobile lensless microfluidic blood image-acquisition and analysis prototype system. We applied the cell segmentation and cell classification CNN in the system and the classification accuracy reached 98.44%. Compared with the floating-point method, the accuracy dropped by only 0.56%, but the area decreased by 45%. When the system is implemented with the maximum frequency of 100 MHz in the FPGA, a classification speed of 17.9 frames per second (fps) can be obtained. The results show that the quantized CNN microfluidic lensless-sensing blood-acquisition and analysis system fully meets the needs of current portable medical devices, and is conducive to promoting the transformation of artificial intelligence (AI)-based blood cell acquisition and analysis work from large servers to portable cell analysis devices, facilitating rapid early analysis of diseases.

Download Full-text