scholarly journals Novel Dynamic Partial Reconfiguration Implementation of K-Means Clustering on FPGAs: Comparative Results with GPPs and GPUs

2012 ◽  
Vol 2012 ◽  
pp. 1-15 ◽  
Author(s):  
Hanaa M. Hussain ◽  
Khaled Benkrid ◽  
Ali Ebrahim ◽  
Ahmet T. Erdogan ◽  
Huseyin Seker

K-means clustering has been widely used in processing large datasets in many fields of studies. Advancement in many data collection techniques has been generating enormous amounts of data, leaving scientists with the challenging task of processing them. Using General Purpose Processors (GPPs) to process large datasets may take a long time; therefore many acceleration methods have been proposed in the literature to speed up the processing of such large datasets. In this work, a parameterized implementation of the K-means clustering algorithm in Field Programmable Gate Array (FPGA) is presented and compared with previous FPGA implementation as well as recent implementations on Graphics Processing Units (GPUs) and GPPs. The proposed FPGA has higher performance in terms of speedup over previous GPP and GPU implementations (two orders and one order of magnitude, resp.). In addition, the FPGA implementation is more energy efficient than GPP and GPU (615x and 31x, resp.). Furthermore, three novel implementations of the K-means clustering based on dynamic partial reconfiguration (DPR) are presented offering high degree of flexibility to dynamically reconfigure the FPGA. The DPR implementations achieved speedups in reconfiguration time between 4x to 15x.

Processes ◽  
2020 ◽  
Vol 8 (9) ◽  
pp. 1199
Author(s):  
Ravie Chandren Muniyandi ◽  
Ali Maroosi

Long-timescale simulations of biological processes such as photosynthesis or attempts to solve NP-hard problems such as traveling salesman, knapsack, Hamiltonian path, and satisfiability using membrane systems without appropriate parallelization can take hours or days. Graphics processing units (GPU) deliver an immensely parallel mechanism to compute general-purpose computations. Previous studies mapped one membrane to one thread block on GPU. This is disadvantageous given that when the quantity of objects for each membrane is small, the quantity of active thread will also be small, thereby decreasing performance. While each membrane is designated to one thread block, the communication between thread blocks is needed for executing the communication between membranes. Communication between thread blocks is a time-consuming process. Previous approaches have also not addressed the issue of GPU occupancy. This study presents a classification algorithm to manage dependent objects and membranes based on the communication rate associated with the defined weighted network and assign them to sub-matrices. Thus, dependent objects and membranes are allocated to the same threads and thread blocks, thereby decreasing communication between threads and thread blocks and allowing GPUs to maintain the highest occupancy possible. The experimental results indicate that for 48 objects per membrane, the algorithm facilitates a 93-fold increase in processing speed compared to a 1.6-fold increase with previous algorithms.


Computation ◽  
2020 ◽  
Vol 8 (2) ◽  
pp. 50
Author(s):  
Stephan Lenz ◽  
Martin Geier ◽  
Manfred Krafczyk

The simulation of fire is a challenging task due to its occurrence on multiple space-time scales and the non-linear interaction of multiple physical processes. Current state-of-the-art software such as the Fire Dynamics Simulator (FDS) implements most of the required physics, yet a significant drawback of this implementation is its limited scalability on modern massively parallel hardware. The current paper presents a massively parallel implementation of a Gas Kinetic Scheme (GKS) on General Purpose Graphics Processing Units (GPGPUs) as a potential alternative modeling and simulation approach. The implementation is validated for turbulent natural convection against experimental data. Subsequently, it is validated for two simulations of fire plumes, including a small-scale table top setup and a fire on the scale of a few meters. We show that the present GKS achieves comparable accuracy to the results obtained by FDS. Yet, due to the parallel efficiency on dedicated hardware, our GKS implementation delivers a reduction of wall-clock times of more than an order of magnitude. This paper demonstrates the potential of explicit local schemes in massively parallel environments for the simulation of fire.


Author(s):  
Tze Hon Tan ◽  
Chia Yee Ooi ◽  
Muhammad Nadzir Marsono

The recent emergence of 5G network enables mass wireless sensors deployment for internet-of-things (IoT) applications. In many cases, IoT sensors in monitoring and data collection applications are required to operate continuously and active at all time (24/7) to ensure all data are sampled without loss. Field-programmable gate array (FPGA)-based systems exhibit a balanced processing throughput and datapath flexibility. Specifically, datapath flexibility is acquired from the FPGA-based system architecture that supports dynamic partial reconfiguration feature. However, device functional update can cause interruption to the application servicing, especially in an FPGA-based system. This paper presents a standalone FPGA-based system architecture that allows remote functional update without causing service interruption by adopting a redundancy mechanism in the application datapath. By utilizing dynamic partial reconfiguration, only the updating datapath is temporarily inactive while the rest of the circuitry, including the redundant datapath, remain active. Hence, there is no service interruption and downtime when a remote functional update takes place due to the existence of redundant application datapath, which is critical for network and communication systems. The proposed architecture has a significant impact for application in FPGA-based systems that have little or no tolerance in service interruption.


2011 ◽  
Vol 2011 ◽  
pp. 1-25 ◽  
Author(s):  
R. Al-Haddad ◽  
R. Oreifej ◽  
R. A. Ashraf ◽  
R. F. DeMara

As reconfigurable devices' capacities and the complexity of applications that use them increase, the need forself-relianceof deployed systems becomes increasingly prominent. Organic computing paradigms have been proposed for fault-tolerant systems because they promote behaviors that allow complex digital systems to adapt and survive in demanding environments. In this paper, we develop asustainable modular adaptive redundancy technique (SMART)composed of a two-layered organic system. The hardware layer is implemented on a XilinxVirtex-4Field Programmable Gate Array (FPGA) to provide self-repair using a novel approach calledreconfigurable adaptive redundancy system (RARS). The software layer supervises the organic activities on the FPGA and extends the self-healing capabilities through application-independent, intrinsic, and evolutionary repair techniques that leverage the benefits of dynamic partial reconfiguration (PR). SMART was evaluated using a Sobel edge-detection application and was shown to tolerate stressful sequences of injected transient and permanent faults while reducing dynamic power consumption by 30% compared to conventionaltriple modular redundancy (TMR)techniques, with nominal impact on the fault-tolerance capabilities. Moreover, PR is employed to keep the system on line while under repair and also to reduce repair time. Experiments have shown a 27.48% decrease in repair time when PR is employed compared to the full bitstream configuration case.


Author(s):  
Wei-Wen Lin ◽  
Jih-Sheng Shen ◽  
Pao-Ann Hsiung

With the progress of technology, more and more intellectual properties (IPs) can be integrated into one single chip. The performance bottleneck has shifted from the computation in individual IPs to the communication among IPs. A Network-on-Chip (NoC) was proposed to provide high scalability and parallel communication. An ASIC-implemented NoC lacks flexibility and has a high non-recurring engineering (NRE) cost. As an alternative, we can implement an NoC in a Field Programmable Gate Arrays (FPGA). In addition, FPGA devices can support dynamic partial reconfiguration such that the hardware circuits can be configured into an FPGA at run time when necessary, without interfering hardware circuits that are already running. Such an FPGA-based NoC, namely reconfigurable NoC (RNoC), is more flexible and the NRE cost of FPGA-based NoC is also much lower than that of an ASIC-based NoC. Because of dynamic partial reconfiguration, there are several issues in the RNoC design. We focus on how communication between hardware and software can be made efficient for RNoC. We implement three communication architectures for RNoC namely single output FIFO-based architecture, multiple output FIFO-based architecture, and shared memory-based architecture. The average communication memory overhead is less on the single output FIFO-based architecture and the shared memory-based architecture than on the multiple output FIFO-based architecture when the lifetime interval is smaller than 0.5. In the performance analysis, some real applications are applied. Real application examples show that performance of the multiple output FIFO-based architecture is more efficient by as much as 1.789 times than the performance of the single output FIFO-based architecture. The performance of the shared memory-based architecture is more efficient by as much as 1.748 times than the performance of the single output FIFO-based architecture.


Author(s):  
Noopur Astik

Dynamic partial reconfiguration has evolved as a very prominent state of art for efficient area utilization of <em>Field Programmable Gate Array</em> (FPGA) as well as significant reduction in its overall power consumption when properly used to lessen the idle logic on FPGA. It provides desired results even as the computational complexity increases in the field of Digital Signal Processing. This paper explains Dynamic Partial Reconfiguration (DPR) with an example of Finite Impulse response (FIR) filter of order 10. Initially RTL coding for Direct Form FIR structure is written in Verilog in fixed point format for low pass and high pass filter modules using ISE Design suite. Functioning of the both the modules is verified individually through hardware co-simulation on ZYBO (Zynq Board) from Digilent using Black Box from System Generator. Finally dynamic partial reconfigurable FIR filter with low pass and high pass as reconfigurable modules is implemented on ZYBO using PlanAhead tool. Final comparison of resource utilization with and without DPR is presented


Author(s):  
S. M. Ord ◽  
B. Crosse ◽  
D. Emrich ◽  
D. Pallot ◽  
R. B. Wayth ◽  
...  

AbstractThe Murchison Widefield Array is a Square Kilometre Array Precursor. The telescope is located at the Murchison Radio–astronomy Observatory in Western Australia. The MWA consists of 4 096 dipoles arranged into 128 dual polarisation aperture arrays forming a connected element interferometer that cross-correlates signals from all 256 inputs. A hybrid approach to the correlation task is employed, with some processing stages being performed by bespoke hardware, based on Field Programmable Gate Arrays, and others by Graphics Processing Units housed in general purpose rack mounted servers. The correlation capability required is approximately 8 tera floating point operations per second. The MWA has commenced operations and the correlator is generating 8.3 TB day−1 of correlation products, that are subsequently transferred 700 km from the MRO to Perth (WA) in real-time for storage and offline processing. In this paper, we outline the correlator design, signal path, and processing elements and present the data format for the internal and external interfaces.


Author(s):  
Genoveva Vargas-Solar ◽  
Md Sahil Hassan ◽  
Ali Akoglu

This paper targets the execution of data science (DS) pipelines supported by data processing, transmission and sharing across several resources executing greedy processes. Current data science pipelines environments provide various infrastructure services with computing resources such as general-purpose processors (GPP), Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs) and Tensor Processing Unit (TPU) coupled with platform and software services to design, run and maintain DS pipelines. These one-fits-all solutions impose the complete externalization of data pipeline tasks. However, some tasks can be executed in the edge, and the backend can provide just in time resources to ensure ad-hoc and elastic execution environments.This paper introduces an innovative composable “Just in Time Architecture” for configuring DCs for Data Science Pipelines (JITA-4DS) and associated resource management techniques. JITA-4DS is a cross-layer management system that is aware of both the application characteristics and the underlying infrastructures to break the barriers between applications, middleware/operating system, and hardware layers. Vertical integration of these layers is needed for building a customizable Virtual Data Center (VDC) to meet the dynamically changing data science pipelines’ requirements such as performance, availability, and energy consumption. Accordingly, the paper shows an experimental simulation devoted to run data science workloads and determine the best strategies for scheduling the allocation of resources implemented by JITA-4DS.


Sign in / Sign up

Export Citation Format

Share Document