Implementation of FFT on General-Purpose Architectures for FPGA

This paper describes two general-purpose architectures targeted to Field Programmable Gate Array (FPGA) implementation. The first architecture is based on the coupling of a coarse-grain reconfigurable array with a general-purpose processor core. The second architecture is a homogeneous multi-processor system-on-chip (MP-SoC). Both architectures have been mapped onto two different Altera FPGA devices, a StratixII and a StratixIV. Although mapping onto the StratixIV results in higher operating frequencies, the capabilities of the device are not fully exploited. The implementation of a FFT on the two platforms shows a considerable speed-up in comparison with a single-processor reference architecture. The speed-up is higher in the reconfigurable solution but the MP-SoC provides an easier programming interface that is completely based on C language. The authors’ approach proves that implementing a programmable architecture on FPGA and then programming it using a high-level software language is a viable alternative to designing a dedicated hardware block with a hardware description language (HDL) and mapping it on FPGA.

Download Full-text

Implementation of FFT on General-Purpose Architectures for FPGA

International Journal of Embedded and Real-Time Communication Systems ◽

10.4018/jertcs.2010070102 ◽

2010 ◽

Vol 1 (3) ◽

pp. 24-43

Author(s):

Fabio Garzia ◽

Roberto Airoldi ◽

Jari Nurmi

Keyword(s):

General Purpose ◽

Reference Architecture ◽

Processor Core ◽

General Purpose Processor ◽

Programmable Architecture ◽

Field Programmable ◽

Speed Up ◽

Hardware Description ◽

On Chip ◽

High Level

This paper describes two general-purpose architectures targeted to Field Programmable Gate Array (FPGA) implementation. The first architecture is based on the coupling of a coarse-grain reconfigurable array with a general-purpose processor core. The second architecture is a homogeneous multi-processor system-on-chip (MP-SoC). Both architectures have been mapped onto two different Altera FPGA devices, a StratixII and a StratixIV. Although mapping onto the StratixIV results in higher operating frequencies, the capabilities of the device are not fully exploited. The implementation of a FFT on the two platforms shows a considerable speed-up in comparison with a single-processor reference architecture. The speed-up is higher in the reconfigurable solution but the MP-SoC provides an easier programming interface that is completely based on C language. The authors’ approach proves that implementing a programmable architecture on FPGA and then programming it using a high-level software language is a viable alternative to designing a dedicated hardware block with a hardware description language (HDL) and mapping it on FPGA.

Download Full-text

SoC-FPGA systems for the acquisition and processing of electroencephalographic signals

International Journal of Reconfigurable and Embedded Systems (IJRES) ◽

10.11591/ijres.v10.i3.pp237-248 ◽

2021 ◽

Vol 10 (3) ◽

pp. 237

Author(s):

Matias Javier Oliva ◽

Pablo Andrés García ◽

Enrique Mario Spinelli ◽

Alejandro Luis Veiga

Keyword(s):

Embedded System ◽

Real Time ◽

General Purpose ◽

System Response ◽

Single Chip ◽

Real Time Processing ◽

General Purpose Processor ◽

Time Operation ◽

Electroencephalographic Signals ◽

High Level

<span lang="EN-US">Real-time acquisition and processing of electroencephalographic signals have promising applications in the implementation of brain-computer interfaces. These devices allow the user to control a device without performing motor actions, and are usually made up of a biopotential acquisition stage and a personal computer (PC). This structure is very flexible and appropriate for research, but for final users it is necessary to migrate to an embedded system, eliminating the PC from the scheme. The strict real-time processing requirements of such systems justify the choice of a system on a chip field-programmable gate arrays (SoC-FPGA) for its implementation. This article proposes a platform for the acquisition and processing of electroencephalographic signals using this type of device, which combines the parallelism and speed capabilities of an FPGA with the simplicity of a general-purpose processor on a single chip. In this scheme, the FPGA is in charge of the real-time operation, acquiring and processing the signals, while the processor solves the high-level tasks, with the interconnection between processing elements solved by buses integrated into the chip. The proposed scheme was used to implement a brain-computer interface based on steady-state visual evoked potentials, which was used to command a speller. The first tests of the system show that a selection time of 5 seconds per command can be achieved. The time delay between the user’s selection and the system response has been estimated at 343 µs.</span>

Download Full-text

An Empirical Investigation on System and Statement Level Parallelism Strategies for Accelerating Scatter Search Using Handel-C and Impulse-C

VLSI Design ◽

10.1155/2012/793196 ◽

2012 ◽

Vol 2012 ◽

pp. 1-11

Author(s):

M. Walton ◽

O. Ahmed ◽

G. Grewal ◽

S. Areibi

Keyword(s):

Optimization Problems ◽

Scatter Search ◽

Population Based ◽

Field Programmable ◽

Speed Up ◽

Time Required ◽

High Level ◽

Level Parallelism ◽

Code Optimizations ◽

Established Population

Scatter Search is an effective and established population-based metaheuristic that has been used to solve a variety of hard optimization problems. However, the time required to find high-quality solutions can become prohibitive as problem sizes grow. In this paper, we present a hardware implementation of Scatter Search on a field-programmable gate array (FPGA). Our objective is to improve the run time of Scatter Search by exploiting the potentially massive performance benefits that are available through the native parallelism in hardware. When implementing Scatter Search we employ two different high-level languages (HLLs): Handel-C and Impulse-C. Our empirical results show that by effectively exploiting source-code optimizations, data parallelism, and pipelining, a 28x speed up over software can be achieved.

Download Full-text

An Auto-Programming Approach to Vulkan

10.20948/graphicon-2021-3027-150-165 ◽

2021 ◽

Author(s):

Vladimir Alexandrovich Frolov ◽

Vadim Sanzharov ◽

Vladimir Alexandrovich Galaktionov ◽

Alexandr Scherbakov

Keyword(s):

Performance Studies ◽

General Purpose ◽

Software Implementation ◽

Programming Approach ◽

Fine Grained ◽

Speed Up ◽

Cross Platform ◽

Increase Productivity ◽

And Performance ◽

High Level

We propose a novel high-level approach for software development on GPU using Vulkan API. Our goal is to speed-up development and performance studies for complex algorithms on GPU, which is quite difficult and laborious for Vulkan due to large number of HW features low level details. The proposed approach uses auto programming to translate ordinary C++ to optimized Vulkan implementation with automatic shaders generation, resource binding and fine-grained barriers placement. Our model is not general-purpose programming, but is extendible and customer-focused. For a single C++ input our tool can generate multiple different implementations of algorithm in Vulkan for different cases or types of hardware. For example, we automatically detect reduction in C++ source code and then generate several variants of parallel reduction on GPU: with optimization for different warp size, with or without atomics, using or not subgroup operations. Another example is GPU ray tracing applications for which we can generate different variants: pure software implementation in compute shader, using hardware accelerated ray queries, using full RTX pipeline. The goal of our work is to increase productivity of developers who are forced to use Vulkan due to various required hardware features in their software but still do care about cross-platform ability of the developed software and want to debug their algorithm logic on the CPU. Therefore, we assume that the user will take generated code and integrate it with hand-written Vulkan code.

Download Full-text

HLS Based Approach to Develop an Implementable HDR Algorithm

Electronics ◽

10.3390/electronics7110332 ◽

2018 ◽

Vol 7 (11) ◽

pp. 332 ◽

Cited By ~ 1

Author(s):

Rappy Saha ◽

Partha Banik ◽

Ki-Doo Kim

Keyword(s):

Dynamic Range ◽

Signal To Noise Ratio ◽

Simple Algorithm ◽

Structural Similarity ◽

High Dynamic Range ◽

Field Programmable ◽

Hardware Description ◽

On Chip ◽

High Level ◽

Removal Technique

Hardware suitability of an algorithm can only be verified when the algorithm is actually implemented in the hardware. By hardware, we indicate system on chip (SoC) where both processor and field-programmable gate array (FPGA) are available. Our goal is to develop a simple algorithm that can be implemented on hardware where high-level synthesis (HLS) will reduce the tiresome work of manual hardware description language (HDL) optimization. We propose an algorithm to achieve high dynamic range (HDR) image from a single low dynamic range (LDR) image. We use highlight removal technique for this purpose. Our target is to develop parameter free simple algorithm that can be easily implemented on hardware. For this purpose, we use statistical information of the image. While software development is verified with state of the art, the HLS approach confirms that the proposed algorithm is implementable to hardware. The performance of the algorithm is measured using four no-reference metrics. According to the measurement of the structural similarity (SSIM) index metric and peak signal-to-noise ratio (PSNR), hardware simulated output is at least 98.87 percent and 39.90 dB similar to the software simulated output. Our approach is novel and effective in the development of hardware implementable HDR algorithm from a single LDR image using the HLS tool.

Download Full-text

Continuous Gravitational-Wave Data Analysis with General Purpose Computing on Graphic Processing Units

Universe ◽

10.3390/universe7070218 ◽

2021 ◽

Vol 7 (7) ◽

pp. 218

Author(s):

Iuri La Rosa ◽

Pia Astone ◽

Sabrina D’Antonio ◽

Sergio Frasca ◽

Paola Leaci ◽

...

Keyword(s):

Data Analysis ◽

General Purpose ◽

Gpu Programming ◽

Computational Power ◽

Graphic Processing Units ◽

New Approach ◽

Multicore System ◽

Speed Up ◽

High Level ◽

Graphic Processing

We present a new approach to searching for Continuous gravitational Waves (CWs) emitted by isolated rotating neutron stars, using the high parallel computing efficiency and computational power of modern Graphic Processing Units (GPUs). Specifically, in this paper the porting of one of the algorithms used to search for CW signals, the so-called FrequencyHough transform, on the TensorFlow framework, is described. The new code has been fully tested and its performance on GPUs has been compared to those in a CPU multicore system of the same class, showing a factor of 10 speed-up. This demonstrates that GPU programming with general purpose libraries (the those of the TensorFlow framework) of a high-level programming language can provide a significant improvement of the performance of data analysis, opening new perspectives on wide-parameter searches for CWs.

Download Full-text

FPGA–Based Efficient Hardware/Software Co–Design for Industrial Systems with Consideration of Output Selection

Journal of Electrical Engineering ◽

10.1515/jee-2016-0022 ◽

2016 ◽

Vol 67 (3) ◽

pp. 150-159 ◽

Cited By ~ 1

Author(s):

Kyriakos M. Deliparaschos ◽

Konstantinos Michail ◽

Argyrios C. Zolotas ◽

Spyros G. Tzafestas

Keyword(s):

System Modeling ◽

Robustness Analysis ◽

Sensor Selection ◽

Linear Quadratic ◽

Industrial Systems ◽

Field Programmable ◽

Speed Up ◽

Hardware Description ◽

Selection Framework ◽

High Level

Abstract This work presents a field programmable gate array (FPGA)-based embedded software platform coupled with a software-based plant, forming a hardware-in-the-loop (HIL) that is used to validate a systematic sensor selection framework. The systematic sensor selection framework combines multi-objective optimization, linear-quadratic-Gaussian (LQG)-type control, and the nonlinear model of a maglev suspension. A robustness analysis of the closed-loop is followed (prior to implementation) supporting the appropriateness of the solution under parametric variation. The analysis also shows that quantization is robust under different controller gains. While the LQG controller is implemented on an FPGA, the physical process is realized in a high-level system modeling environment. FPGA technology enables rapid evaluation of the algorithms and test designs under realistic scenarios avoiding heavy time penalty associated with hardware description language (HDL) simulators. The HIL technique facilitates significant speed-up in the required execution time when compared to its software-based counterpart model.

Download Full-text

An Efficient FPGA Implementation of Richardson-Lucy Deconvolution Algorithm for Hyperspectral Images

Electronics ◽

10.3390/electronics10040504 ◽

2021 ◽

Vol 10 (4) ◽

pp. 504

Author(s):

Karine Avagian ◽

Milica Orlandić

Keyword(s):

State Of The Art ◽

Hyperspectral Images ◽

Image Size ◽

Spectral Bands ◽

Deconvolution Algorithm ◽

Spread Function ◽

Field Programmable ◽

Speed Up ◽

On Chip ◽

The Individual

This paper proposes an implementation of a Richardson-Lucy (RL) deconvolution method to reduce the spatial degradation in hyperspectral images during the image acquisition process. The degradation, modeled by convolution with a point spread function (PSF), is reduced by applying both standard and accelerated RLdeconvolution algorithms on the individual images in spectral bands. Boundary conditions are introduced to maintain a constant image size without distorting the estimated image boundaries. The RL deconvolution algorithm is implemented on a field-programmable gate array (FPGA)-based Xilinx Zynq-7020 System-on-Chip (SoC). The proposed architecture is parameterized with respect to the image size and configurable with respect to the algorithm variant, the number of iterations, and the kernel size by setting the dedicated configuration registers. A speed-up by factors of 61 and 21 are reported compared to software-only and FPGA-based state-of-the-art implementations, respectively.

Download Full-text

Fast FPGA Implementation for Computing the Pixel Purity Index of Hyperspectral Images

Journal of Circuits System and Computers ◽

10.1142/s0218126618500457 ◽

2017 ◽

Vol 27 (03) ◽

pp. 1850045 ◽

Cited By ~ 3

Author(s):

Jie Guo ◽

Yunsong Li ◽

Kai Liu ◽

Jie Lei ◽

Keyan Wang

Keyword(s):

Hyperspectral Image ◽

Fpga Implementation ◽

Endmember Extraction ◽

Overall Design ◽

Real Time Analysis ◽

Sensing Applications ◽

Field Programmable ◽

On Chip ◽

High Level ◽

Fast Field

The pixel purity index (PPI) algorithm is one of the most popular endmember extraction algorithms employed in hyperspectral image unmixing, which is too time-consuming to obtain real-time analysis in remote sensing applications. The fast field programmable gate array (FPGA) implementation for computing the PPI is proposed in this reported work. The parallel strategy by skewers consumes lower I/O bandwidth and on-chip memory capacity, and the Xilinx Vivado high-level-synthesis (HLS) tool speeds up our architecture design and implementation. The overall design can be simple to implement, and makes the FPGA hardware appealing for on-board hyperspectral unmixing.

Download Full-text