A GPU-Based Fault Simulator for Small-Delay Faults

In this paper, we explore the implementation of fault simulator for small-delay faults on Graphics Processing Unit (GPU). Nowadays the size of integrated circuit is getting smaller and smaller, the clock frequency has become faster and faster, which leads to the effects of small delay fault on chip and is also increasingly obvious. Small delay simulation has become highly important, it is directly related to the accuracy of product and its time to market. At the same time, small delay simulation is a very time consuming process, which requires constantly looking for ways to accelerate the simulation. In recent years, GPU has been used to accelerate the programs of intensive computation in many areas and has achieved very good results. Based on these two points, we consider combining the parallelism of small delay simulation with the high parallel computing ability of GPU to accelerate small delay simulation. Experimental results indicate that our approach is on average 42 when compared to the traditional fault simulation engine.

Download Full-text

A Tile-Based EGPU with a Fused Universal Processing Engine and Graphics Coprocessor Cluster

Journal of Sensors ◽

10.1155/2016/7281031 ◽

2016 ◽

Vol 2016 ◽

pp. 1-9

Author(s):

Yang Wang ◽

Li Zhou ◽

Tao Sun ◽

Yanhu Chen ◽

Lei Wang ◽

...

Keyword(s):

Power Consumption ◽

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Clock Frequency ◽

Embedded Devices ◽

On Chip ◽

Graphics Processing ◽

Graphics Rendering ◽

Processing Engine

As various applied sensors have been integrated into embedded devices, the Embedded Graphics Processing Unit (EGPU) has assumed more processing tasks, which requires an EGPU with higher performance. A tile-based EGPU is proposed that can be used in both general-purpose computing and 3D graphics rendering. With fused, scalable, and hierarchical parallelism architecture, the EGPU has the ability to address nearly 100 million vertices or fragments and achieves 1 GFLOPS per second at a clock frequency of 200 MHz. A fused and scalable architecture, constituted by Universal Processing Engine (UPE) and Graphics Coprocessor Cluster (GCC), ensures that the EGPU can adapt to various graphic processing scenes and situations, achieving more efficient rendering. Moreover, hierarchical parallelism is implemented via the UPE. Additionally, tiling brings a significant reduction in both system memory bandwidth and power consumption. A 0.18 µm technology library is used for timing and power analysis. The area of the proposed EGPU is 6.5 mm∗6.5 mm, and its power consumption is approximately 349.318 mW. Experimental results demonstrate that the proposed EGPU can be used in a System on Chip (SoC) configuration connected to sensors to accelerate its processing and create a proper balance between performance and cost.

Download Full-text

HIGH PRECISION INTEGER ADDITION, SUBTRACTION AND MULTIPLICATION WITH A GRAPHICS PROCESSING UNIT

Parallel Processing Letters ◽

10.1142/s0129626410000259 ◽

2010 ◽

Vol 20 (04) ◽

pp. 293-306 ◽

Cited By ~ 8

Author(s):

NIALL EMMART ◽

CHARLES WEEMS

Keyword(s):

High Precision ◽

Graphics Processing Unit ◽

Divide And Conquer ◽

Processing Unit ◽

Good Potential ◽

Addition And Subtraction ◽

Integer Multiplication ◽

On Chip ◽

Graphics Processing ◽

Gpu Architecture

In this paper we evaluate the potential for using an NVIDIA graphics processing unit (GPU) to accelerate high precision integer multiplication, addition, and subtraction. The reported peak vector performance for a typical GPU appears to offer good potential for accelerating such a computation. Because of limitations in the on-chip memory, the high cost of kernel launches, and the nature of the architecture's support for parallelism, we used a hybrid algorithmic approach to obtain good performance on multiplication. On the GPU itself we adapt the Strassen FFT algorithm to multiply 32KB chunks, while on the CPU we adapt the Karatsuba divide-and-conquer approach to optimize application of the GPU's partial multiplies, which are viewed as "digits" by our implementation of Karatsuba. Even with this approach, the result is at best a factor of three increase in performance, compared with using the GMP package on a 64-bit CPU at a comparable technology node. Our implementations of addition and subtraction achieve up to a factor of eight improvement. We identify the issues that limit performance and discuss the likely impact of planned advances in GPU architecture.

Download Full-text

An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution

Electronics ◽

10.3390/electronics8030281 ◽

2019 ◽

Vol 8 (3) ◽

pp. 281 ◽

Cited By ~ 11

Author(s):

Bing Liu ◽

Danyin Zou ◽

Lei Feng ◽

Shou Feng ◽

Ping Fu ◽

...

Keyword(s):

Graphics Processing Unit ◽

Processing Unit ◽

Network Layers ◽

Ping Pong ◽

Parameter Configuration ◽

Field Programmable ◽

Hardware Resource ◽

Roofline Model ◽

On Chip ◽

Graphics Processing

The Convolutional Neural Network (CNN) has been used in many fields and has achieved remarkable results, such as image classification, face detection, and speech recognition. Compared to GPU (graphics processing unit) and ASIC, a FPGA (field programmable gate array)-based CNN accelerator has great advantages due to its low power consumption and reconfigurable property. However, FPGA’s extremely limited resources and CNN’s huge amount of parameters and computational complexity pose great challenges to the design. Based on the ZYNQ heterogeneous platform and the coordination of resource and bandwidth issues with the roofline model, the CNN accelerator we designed can accelerate both standard convolution and depthwise separable convolution with a high hardware resource rate. The accelerator can handle network layers of different scales through parameter configuration and maximizes bandwidth and achieves full pipelined by using a data stream interface and ping-pong on-chip cache. The experimental results show that the accelerator designed in this paper can achieve 17.11GOPS for 32bit floating point when it can also accelerate depthwise separable convolution, which has obvious advantages compared with other designs.

Download Full-text

Fast iterative solvers for large compressed-sparse row linear systems on graphics processing unit

Pollack Periodica ◽

10.1556/pollack.10.2015.1.1 ◽

2015 ◽

Vol 10 (1) ◽

pp. 3-18 ◽

Cited By ~ 1

Author(s):

Frédéric Magoulès ◽

Abal-Kassim Cheik Ahamed ◽

Roman Putanowicz

Keyword(s):

Linear Systems ◽

Graphics Processing Unit ◽

Iterative Solvers ◽

Processing Unit ◽

Compressed Sparse Row ◽

Graphics Processing

Download Full-text

Performance Analysis and Optimization of Graphics Processing Unit

SSRN Electronic Journal ◽

10.2139/ssrn.3350249 ◽

2019 ◽

Author(s):

Lokendra Singh Umrao ◽

Jay Prakash Pandey

Keyword(s):

Performance Analysis ◽

Graphics Processing Unit ◽

Processing Unit ◽

Graphics Processing

Download Full-text

Implementing wide baseline matching algorithms on a graphics processing unit.

10.2172/921737 ◽

2007 ◽

Author(s):

Fredrick H. Rothganger ◽

Kurt W. Larson ◽

Antonio Ignacio Gonzales ◽

Daniel S. Myers

Keyword(s):

Graphics Processing Unit ◽

Processing Unit ◽

Wide Baseline Matching ◽

Graphics Processing

Download Full-text

Two Decades of 4D-QSAR: A Dying Art or Staging a Comeback?

International Journal of Molecular Sciences ◽

10.3390/ijms22105212 ◽

2021 ◽

Vol 22 (10) ◽

pp. 5212

Author(s):

Andrzej Bak

Keyword(s):

Molecular Conformation ◽

Graphics Processing Unit ◽

Processing Unit ◽

Diverse Range ◽

Current State ◽

Gpu Clusters ◽

Pharmacophore Hypothesis ◽

Rising Power ◽

Graphics Processing ◽

Ligand Conformation

A key question confronting computational chemists concerns the preferable ligand geometry that fits complementarily into the receptor pocket. Typically, the postulated ‘bioactive’ 3D ligand conformation is constructed as a ‘sophisticated guess’ (unnecessarily geometry-optimized) mirroring the pharmacophore hypothesis—sometimes based on an erroneous prerequisite. Hence, 4D-QSAR scheme and its ‘dialects’ have been practically implemented as higher level of model abstraction that allows the examination of the multiple molecular conformation, orientation and protonation representation, respectively. Nearly a quarter of a century has passed since the eminent work of Hopfinger appeared on the stage; therefore the natural question occurs whether 4D-QSAR approach is still appealing to the scientific community? With no intention to be comprehensive, a review of the current state of art in the field of receptor-independent (RI) and receptor-dependent (RD) 4D-QSAR methodology is provided with a brief examination of the ‘mainstream’ algorithms. In fact, a myriad of 4D-QSAR methods have been implemented and applied practically for a diverse range of molecules. It seems that, 4D-QSAR approach has been experiencing a promising renaissance of interests that might be fuelled by the rising power of the graphics processing unit (GPU) clusters applied to full-atom MD-based simulations of the protein-ligand complexes.

Download Full-text

Parallelization of Global Sequence Alignment on Graphics Processing Unit

2020 International Conference on Communications, Computing, Cybersecurity, and Informatics (CCCI) ◽

10.1109/ccci49893.2020.9256747 ◽

2020 ◽

Author(s):

Kailash W. Kalare ◽

Mohammad S. Obaidat ◽

Jitendra V. Tembhurne ◽

Chandrashekhar Meshram ◽

Kuei-Fang Hsiao

Keyword(s):

Sequence Alignment ◽

Graphics Processing Unit ◽

Processing Unit ◽

Graphics Processing

Download Full-text

Graphics processing unit acceleration of the island model genetic algorithm using the CUDA programming platform

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.6286 ◽

2021 ◽

Author(s):

Dylan M. Janssen ◽

Wayne Pullan ◽

Alan Wee‐Chung Liew

Keyword(s):

Genetic Algorithm ◽

Graphics Processing Unit ◽

Island Model ◽

Processing Unit ◽

Cuda Programming ◽

Graphics Processing

Download Full-text

Real-time, High-resolution Depth Upsampling on Embedded Accelerators

ACM Transactions on Embedded Computing Systems ◽

10.1145/3436878 ◽

2021 ◽

Vol 20 (3) ◽

pp. 1-22

Author(s):

David Langerman ◽

Alan George

Keyword(s):

High Resolution ◽

Low Power ◽

Real Time ◽

Mixed Reality ◽

Graphics Processing Unit ◽

Processing Unit ◽

Reconfigurable Logic ◽

Depth Sensors ◽

Time Requirements ◽

Graphics Processing

High-resolution, low-latency apps in computer vision are ubiquitous in today’s world of mixed-reality devices. These innovations provide a platform that can leverage the improving technology of depth sensors and embedded accelerators to enable higher-resolution, lower-latency processing for 3D scenes using depth-upsampling algorithms. This research demonstrates that filter-based upsampling algorithms are feasible for mixed-reality apps using low-power hardware accelerators. The authors parallelized and evaluated a depth-upsampling algorithm on two different devices: a reconfigurable-logic FPGA embedded within a low-power SoC; and a fixed-logic embedded graphics processing unit. We demonstrate that both accelerators can meet the real-time requirements of 11 ms latency for mixed-reality apps. 1

Download Full-text