HIGH PRECISION INTEGER ADDITION, SUBTRACTION AND MULTIPLICATION WITH A GRAPHICS PROCESSING UNIT

In this paper we evaluate the potential for using an NVIDIA graphics processing unit (GPU) to accelerate high precision integer multiplication, addition, and subtraction. The reported peak vector performance for a typical GPU appears to offer good potential for accelerating such a computation. Because of limitations in the on-chip memory, the high cost of kernel launches, and the nature of the architecture's support for parallelism, we used a hybrid algorithmic approach to obtain good performance on multiplication. On the GPU itself we adapt the Strassen FFT algorithm to multiply 32KB chunks, while on the CPU we adapt the Karatsuba divide-and-conquer approach to optimize application of the GPU's partial multiplies, which are viewed as "digits" by our implementation of Karatsuba. Even with this approach, the result is at best a factor of three increase in performance, compared with using the GMP package on a 64-bit CPU at a comparable technology node. Our implementations of addition and subtraction achieve up to a factor of eight improvement. We identify the issues that limit performance and discuss the likely impact of planned advances in GPU architecture.

Download Full-text

High precision integer multiplication with a graphics processing unit

2010 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum (IPDPSW) ◽

10.1109/ipdpsw.2010.5470814 ◽

2010 ◽

Author(s):

Niall Emmart ◽

Charles Weems

Keyword(s):

High Precision ◽

Graphics Processing Unit ◽

Processing Unit ◽

Integer Multiplication ◽

Graphics Processing

Download Full-text

Speed up big integer multiplication in the Block Wiedemann on graphics processing unit

Design, Manufacturing and Mechatronics ◽

10.1142/9789813208322_0005 ◽

2017 ◽

Author(s):

Peng-Bo Wu ◽

Jing-Fei Jiang ◽

Yang Zhao

Keyword(s):

Graphics Processing Unit ◽

Processing Unit ◽

Speed Up ◽

Integer Multiplication ◽

Graphics Processing

Download Full-text

Scalable Graphics Processing Unit–Based Multiscale Linear Solvers for Reservoir Simulation

SPE Journal ◽

10.2118/203939-pa ◽

2021 ◽

pp. 1-20

Author(s):

A. M. Manea ◽

T. Almani

Keyword(s):

Shared Memory ◽

Reservoir Simulation ◽

Graphics Processing Unit ◽

Parallel Architecture ◽

Multiscale Methods ◽

Massively Parallel ◽

Processing Unit ◽

Multicore Architecture ◽

Graphics Processing ◽

Gpu Architecture

Summary In this work, the scalability of two key multiscale solvers for the pressure equation arising from incompressible flow in heterogeneous porous media, namely, the multiscale finite volume (MSFV) solver, and the restriction-smoothed basis multiscale (MsRSB) solver, are investigated on the graphics processing unit (GPU) massively parallel architecture. The robustness and scalability of both solvers are compared against their corresponding carefully optimized implementation on the shared-memory multicore architecture in a structured problem setting. Although several components in MSFV and MsRSB algorithms are directly parallelizable, their scalability on the GPU architecture depends heavily on the underlying algorithmic details and data-structure design of every step, where one needs to ensure favorable control and data flow on the GPU, while extracting enough parallel work for a massively parallel environment. In addition, the type of algorithm chosen for each step greatly influences the overall robustness of the solver. Thus, we extend the work on the parallel multiscale methods of Manea et al. (2016) to map the MSFV and MsRSB special kernels to the massively parallel GPU architecture. The scalability of our optimized parallel MSFV and MsRSB GPU implementations are demonstrated using highly heterogeneous structured 3D problems derived from the SPE10 Benchmark (Christie and Blunt 2001). Those problems range in size from millions to tens of millions of cells. For both solvers, the multicore implementations are benchmarked on a shared-memory multicore architecture consisting of two packages of Intel® Cascade Lake Xeon Gold 6246 central processing unit (CPU), whereas the GPU implementations are benchmarked on a massively parallel architecture consisting of NVIDIA Volta V100 GPUs. We compare the multicore implementations to the GPU implementations for both the setup and solution stages. Finally, we compare the parallel MsRSB scalability to the scalability of MSFV on the multicore (Manea et al. 2016) and GPU architectures. To the best of our knowledge, this is the first parallel implementation and demonstration of these versatile multiscale solvers on the GPU architecture. NOTE: This paper is published as part of the 2021 SPE Reservoir Simulation Conference Special Issue.

Download Full-text

A GPU-Based Fault Simulator for Small-Delay Faults

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.753-755.2235 ◽

2013 ◽

Vol 753-755 ◽

pp. 2235-2242

Author(s):

Ming Ming Peng ◽

Ji Shun Kuang

Keyword(s):

Integrated Circuit ◽

Graphics Processing Unit ◽

Fault Simulation ◽

Processing Unit ◽

Delay Faults ◽

Clock Frequency ◽

Simulation Engine ◽

Small Delay ◽

On Chip ◽

Graphics Processing

In this paper, we explore the implementation of fault simulator for small-delay faults on Graphics Processing Unit (GPU). Nowadays the size of integrated circuit is getting smaller and smaller, the clock frequency has become faster and faster, which leads to the effects of small delay fault on chip and is also increasingly obvious. Small delay simulation has become highly important, it is directly related to the accuracy of product and its time to market. At the same time, small delay simulation is a very time consuming process, which requires constantly looking for ways to accelerate the simulation. In recent years, GPU has been used to accelerate the programs of intensive computation in many areas and has achieved very good results. Based on these two points, we consider combining the parallelism of small delay simulation with the high parallel computing ability of GPU to accelerate small delay simulation. Experimental results indicate that our approach is on average 42 when compared to the traditional fault simulation engine.

Download Full-text

A GPU acceleration of 3-D Fourier reconstruction in cryo-EM

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019832958 ◽

2019 ◽

Vol 33 (5) ◽

pp. 948-959 ◽

Cited By ~ 3

Author(s):

David Střelák ◽

Carlos Óscar S Sorzano ◽

José María Carazo ◽

Jiří Filipovič

Keyword(s):

Graphics Processing Unit ◽

Internal Symmetry ◽

Processing Unit ◽

Volume Reconstruction ◽

Optimization Parameters ◽

Fourier Reconstruction ◽

Race Conditions ◽

Graphics Processing ◽

Gpu Architecture ◽

Memory Pattern

Cryo-electron microscopy is a popular method for macromolecules structure determination. Reconstruction of a 3-D volume from raw data obtained from a microscope is highly computationally demanding. Thus, acceleration of the reconstruction has a great practical value. In this article, we introduce a novel graphics processing unit (GPU)-friendly algorithm for direct Fourier reconstruction, one of the main computational bottlenecks in the 3-D volume reconstruction pipeline for some experimental cases (particularly those with a large number of images and a high internal symmetry). Contrary to the state of the art, our algorithm uses a gather memory pattern, improving cache locality and removing race conditions in parallel writing into the 3-D volume. We also introduce a finely tuned CUDA implementation of our algorithm, using auto-tuning to search for a combination of optimization parameters maximizing performance on a given GPU architecture. Our CUDA implementation is integrated in widely used software Xmipp, version 3.19, reaching 11.4× speedup compared to the original parallel CPU implementation using GPU with comparable power consumption. Moreover, we have reached 31.7× speedup using four GPUs and 2.14×–5.96× speedup compared to optimized GPU implementation based on a scatter memory pattern.

Download Full-text

Digital design of a dedicated graphics processing unit (GPU) architecture for microcontrollers

2014 International Conference on Electronics and Communication Systems (ICECS) ◽

10.1109/ecs.2014.6892836 ◽

2014 ◽

Cited By ~ 1

Author(s):

Saad Zafar ◽

Sushant Kataria ◽

Abhishek Sharma

Keyword(s):

Graphics Processing Unit ◽

Digital Design ◽

Processing Unit ◽

Graphics Processing ◽

Gpu Architecture

Download Full-text

A Tile-Based EGPU with a Fused Universal Processing Engine and Graphics Coprocessor Cluster

Journal of Sensors ◽

10.1155/2016/7281031 ◽

2016 ◽

Vol 2016 ◽

pp. 1-9

Author(s):

Yang Wang ◽

Li Zhou ◽

Tao Sun ◽

Yanhu Chen ◽

Lei Wang ◽

...

Keyword(s):

Power Consumption ◽

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Clock Frequency ◽

Embedded Devices ◽

On Chip ◽

Graphics Processing ◽

Graphics Rendering ◽

Processing Engine

As various applied sensors have been integrated into embedded devices, the Embedded Graphics Processing Unit (EGPU) has assumed more processing tasks, which requires an EGPU with higher performance. A tile-based EGPU is proposed that can be used in both general-purpose computing and 3D graphics rendering. With fused, scalable, and hierarchical parallelism architecture, the EGPU has the ability to address nearly 100 million vertices or fragments and achieves 1 GFLOPS per second at a clock frequency of 200 MHz. A fused and scalable architecture, constituted by Universal Processing Engine (UPE) and Graphics Coprocessor Cluster (GCC), ensures that the EGPU can adapt to various graphic processing scenes and situations, achieving more efficient rendering. Moreover, hierarchical parallelism is implemented via the UPE. Additionally, tiling brings a significant reduction in both system memory bandwidth and power consumption. A 0.18 µm technology library is used for timing and power analysis. The area of the proposed EGPU is 6.5 mm∗6.5 mm, and its power consumption is approximately 349.318 mW. Experimental results demonstrate that the proposed EGPU can be used in a System on Chip (SoC) configuration connected to sensors to accelerate its processing and create a proper balance between performance and cost.

Download Full-text

On the issue of fuzzy timing estimations of the algorithms running at GPU and CPU architectures

E3S Web of Conferences ◽

10.1051/e3sconf/201913501082 ◽

2019 ◽

Vol 135 ◽

pp. 01082 ◽

Cited By ~ 1

Author(s):

Oleg Agibalov ◽

Nikolay Ventsov

Keyword(s):

Genetic Algorithm ◽

Genetic Algorithms ◽

Graphics Processing Unit ◽

Processing Unit ◽

Running Time ◽

Average Value ◽

Central Processing ◽

Graphics Processing ◽

Number Of Individuals ◽

Gpu Architecture

We consider the task of comparing fuzzy estimates of the execution parameters of genetic algorithms implemented at GPU (graphics processing unit’ GPU) and CPU (central processing unit) architectures. Fuzzy estimates are calculated based on the averaged dependencies of the genetic algorithms running time at GPU and CPU architectures from the number of individuals in the populations processed by the algorithm. The analysis of the averaged dependences of the genetic algorithms running time at GPU and CPU-architectures showed that it is possible to process 10’000 chromosomes at GPU-architecture or 5’000 chromosomes at CPUarchitecture by genetic algorithm in approximately 2’500 ms. The following is correct for the cases under consideration: “Genetic algorithms (GA) are performed in approximately 2, 500 ms (on average), ” and a sections of fuzzy sets, with a = 0.5, correspond to the intervals [2, 000.2399] for GA performed at the GPU-architecture, and [1, 400.1799] for GA performed at the CPU-architecture. Thereby, it can be said that in this case, the actual execution time of the algorithm at the GPU architecture deviates in a lesser extent from the average value than at the CPU.

Download Full-text

Efficient Prefix Scan for the GPU-Based Implementation of Random Forest

Advances in Social Networking and Online Communities - Handbook of Research on Interactive Information Quality in Expanding Social Network Communications ◽

10.4018/978-1-4666-7377-9.ch009 ◽

2015 ◽

pp. 140-151

Author(s):

Bojan Novak

Keyword(s):

Random Forest ◽

Graphics Processing Unit ◽

Processing Unit ◽

Random Forest Algorithm ◽

Central Processing ◽

Split Point ◽

Parallel Scan ◽

Graphics Processing ◽

Gpu Architecture ◽

Gpu Implementation

The random forest ensemble learning with the Graphics Processing Unit (GPU) version of prefix scan method is presented. The efficiency of the implementation of the random forest algorithm depends critically on the scan (prefix sum) algorithm. The prefix scan is used in the depth-first implementation of optimal split point computation. Described are different implementations of the prefix scan algorithms. The speeds of the algorithms depend on three factors: the algorithm itself, which could be improved, the programming skills, and the compiler. In parallel environments, things are even more complicated and depend on the programmer´s knowledge of the Central Processing Unit (CPU) or the GPU architecture. An efficient parallel scan algorithm that avoids bank conflicts is crucial for the prefix scan implementation. In our tests, multicore CPU and GPU implementation based on NVIDIA´s CUDA is compared.

Download Full-text

History and Evolution of GPU Architecture

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Emerging Research Surrounding Power Consumption and Performance Issues in Utility Computing ◽

10.4018/978-1-4666-8853-7.ch006 ◽

2016 ◽

pp. 109-135

Author(s):

Prashanta Kumar Das ◽

Ganesh Chandra Deka

Keyword(s):

Image Processing ◽

Graphics Processing Unit ◽

Hardware Architecture ◽

Central Processing Unit ◽

Processing Unit ◽

3D Image ◽

Central Processing ◽

Graphics Processing ◽

Graphics Engine ◽

Gpu Architecture

The Graphics Processing Unit (GPU) is a specialized and highly parallel microprocessor designed to offload 2D/3D image from the Central Processing Unit (CPU) to expedite image processing. The modern GPU is not only a powerful graphics engine, but also a parallel programmable processor with high precision and powerful features. It is forcasted that by 2020, 48 Core GPU will be available while by 2030 GPU with 3000 core is likely to be available.This chapter describes the chronology of evolution of GPU hardware architecture and the future ahead.

Download Full-text