HIGH PRECISION INTEGER ADDITION, SUBTRACTION AND MULTIPLICATION WITH A GRAPHICS PROCESSING UNIT

2010 ◽  
Vol 20 (04) ◽  
pp. 293-306 ◽  
Author(s):  
NIALL EMMART ◽  
CHARLES WEEMS

In this paper we evaluate the potential for using an NVIDIA graphics processing unit (GPU) to accelerate high precision integer multiplication, addition, and subtraction. The reported peak vector performance for a typical GPU appears to offer good potential for accelerating such a computation. Because of limitations in the on-chip memory, the high cost of kernel launches, and the nature of the architecture's support for parallelism, we used a hybrid algorithmic approach to obtain good performance on multiplication. On the GPU itself we adapt the Strassen FFT algorithm to multiply 32KB chunks, while on the CPU we adapt the Karatsuba divide-and-conquer approach to optimize application of the GPU's partial multiplies, which are viewed as "digits" by our implementation of Karatsuba. Even with this approach, the result is at best a factor of three increase in performance, compared with using the GMP package on a 64-bit CPU at a comparable technology node. Our implementations of addition and subtraction achieve up to a factor of eight improvement. We identify the issues that limit performance and discuss the likely impact of planned advances in GPU architecture.

SPE Journal ◽  
2021 ◽  
pp. 1-20
Author(s):  
A. M. Manea ◽  
T. Almani

Summary In this work, the scalability of two key multiscale solvers for the pressure equation arising from incompressible flow in heterogeneous porous media, namely, the multiscale finite volume (MSFV) solver, and the restriction-smoothed basis multiscale (MsRSB) solver, are investigated on the graphics processing unit (GPU) massively parallel architecture. The robustness and scalability of both solvers are compared against their corresponding carefully optimized implementation on the shared-memory multicore architecture in a structured problem setting. Although several components in MSFV and MsRSB algorithms are directly parallelizable, their scalability on the GPU architecture depends heavily on the underlying algorithmic details and data-structure design of every step, where one needs to ensure favorable control and data flow on the GPU, while extracting enough parallel work for a massively parallel environment. In addition, the type of algorithm chosen for each step greatly influences the overall robustness of the solver. Thus, we extend the work on the parallel multiscale methods of Manea et al. (2016) to map the MSFV and MsRSB special kernels to the massively parallel GPU architecture. The scalability of our optimized parallel MSFV and MsRSB GPU implementations are demonstrated using highly heterogeneous structured 3D problems derived from the SPE10 Benchmark (Christie and Blunt 2001). Those problems range in size from millions to tens of millions of cells. For both solvers, the multicore implementations are benchmarked on a shared-memory multicore architecture consisting of two packages of Intel® Cascade Lake Xeon Gold 6246 central processing unit (CPU), whereas the GPU implementations are benchmarked on a massively parallel architecture consisting of NVIDIA Volta V100 GPUs. We compare the multicore implementations to the GPU implementations for both the setup and solution stages. Finally, we compare the parallel MsRSB scalability to the scalability of MSFV on the multicore (Manea et al. 2016) and GPU architectures. To the best of our knowledge, this is the first parallel implementation and demonstration of these versatile multiscale solvers on the GPU architecture. NOTE: This paper is published as part of the 2021 SPE Reservoir Simulation Conference Special Issue.


2013 ◽  
Vol 753-755 ◽  
pp. 2235-2242
Author(s):  
Ming Ming Peng ◽  
Ji Shun Kuang

In this paper, we explore the implementation of fault simulator for small-delay faults on Graphics Processing Unit (GPU). Nowadays the size of integrated circuit is getting smaller and smaller, the clock frequency has become faster and faster, which leads to the effects of small delay fault on chip and is also increasingly obvious. Small delay simulation has become highly important, it is directly related to the accuracy of product and its time to market. At the same time, small delay simulation is a very time consuming process, which requires constantly looking for ways to accelerate the simulation. In recent years, GPU has been used to accelerate the programs of intensive computation in many areas and has achieved very good results. Based on these two points, we consider combining the parallelism of small delay simulation with the high parallel computing ability of GPU to accelerate small delay simulation. Experimental results indicate that our approach is on average 42 when compared to the traditional fault simulation engine.


Author(s):  
David Střelák ◽  
Carlos Óscar S Sorzano ◽  
José María Carazo ◽  
Jiří Filipovič

Cryo-electron microscopy is a popular method for macromolecules structure determination. Reconstruction of a 3-D volume from raw data obtained from a microscope is highly computationally demanding. Thus, acceleration of the reconstruction has a great practical value. In this article, we introduce a novel graphics processing unit (GPU)-friendly algorithm for direct Fourier reconstruction, one of the main computational bottlenecks in the 3-D volume reconstruction pipeline for some experimental cases (particularly those with a large number of images and a high internal symmetry). Contrary to the state of the art, our algorithm uses a gather memory pattern, improving cache locality and removing race conditions in parallel writing into the 3-D volume. We also introduce a finely tuned CUDA implementation of our algorithm, using auto-tuning to search for a combination of optimization parameters maximizing performance on a given GPU architecture. Our CUDA implementation is integrated in widely used software Xmipp, version 3.19, reaching 11.4× speedup compared to the original parallel CPU implementation using GPU with comparable power consumption. Moreover, we have reached 31.7× speedup using four GPUs and 2.14×–5.96× speedup compared to optimized GPU implementation based on a scatter memory pattern.


2016 ◽  
Vol 2016 ◽  
pp. 1-9
Author(s):  
Yang Wang ◽  
Li Zhou ◽  
Tao Sun ◽  
Yanhu Chen ◽  
Lei Wang ◽  
...  

As various applied sensors have been integrated into embedded devices, the Embedded Graphics Processing Unit (EGPU) has assumed more processing tasks, which requires an EGPU with higher performance. A tile-based EGPU is proposed that can be used in both general-purpose computing and 3D graphics rendering. With fused, scalable, and hierarchical parallelism architecture, the EGPU has the ability to address nearly 100 million vertices or fragments and achieves 1 GFLOPS per second at a clock frequency of 200 MHz. A fused and scalable architecture, constituted by Universal Processing Engine (UPE) and Graphics Coprocessor Cluster (GCC), ensures that the EGPU can adapt to various graphic processing scenes and situations, achieving more efficient rendering. Moreover, hierarchical parallelism is implemented via the UPE. Additionally, tiling brings a significant reduction in both system memory bandwidth and power consumption. A 0.18 µm technology library is used for timing and power analysis. The area of the proposed EGPU is 6.5 mm∗6.5 mm, and its power consumption is approximately 349.318 mW. Experimental results demonstrate that the proposed EGPU can be used in a System on Chip (SoC) configuration connected to sensors to accelerate its processing and create a proper balance between performance and cost.


2019 ◽  
Vol 135 ◽  
pp. 01082 ◽  
Author(s):  
Oleg Agibalov ◽  
Nikolay Ventsov

We consider the task of comparing fuzzy estimates of the execution parameters of genetic algorithms implemented at GPU (graphics processing unit’ GPU) and CPU (central processing unit) architectures. Fuzzy estimates are calculated based on the averaged dependencies of the genetic algorithms running time at GPU and CPU architectures from the number of individuals in the populations processed by the algorithm. The analysis of the averaged dependences of the genetic algorithms running time at GPU and CPU-architectures showed that it is possible to process 10’000 chromosomes at GPU-architecture or 5’000 chromosomes at CPUarchitecture by genetic algorithm in approximately 2’500 ms. The following is correct for the cases under consideration: “Genetic algorithms (GA) are performed in approximately 2, 500 ms (on average), ” and a sections of fuzzy sets, with a = 0.5, correspond to the intervals [2, 000.2399] for GA performed at the GPU-architecture, and [1, 400.1799] for GA performed at the CPU-architecture. Thereby, it can be said that in this case, the actual execution time of the algorithm at the GPU architecture deviates in a lesser extent from the average value than at the CPU.


Author(s):  
Bojan Novak

The random forest ensemble learning with the Graphics Processing Unit (GPU) version of prefix scan method is presented. The efficiency of the implementation of the random forest algorithm depends critically on the scan (prefix sum) algorithm. The prefix scan is used in the depth-first implementation of optimal split point computation. Described are different implementations of the prefix scan algorithms. The speeds of the algorithms depend on three factors: the algorithm itself, which could be improved, the programming skills, and the compiler. In parallel environments, things are even more complicated and depend on the programmer´s knowledge of the Central Processing Unit (CPU) or the GPU architecture. An efficient parallel scan algorithm that avoids bank conflicts is crucial for the prefix scan implementation. In our tests, multicore CPU and GPU implementation based on NVIDIA´s CUDA is compared.


Author(s):  
Prashanta Kumar Das ◽  
Ganesh Chandra Deka

The Graphics Processing Unit (GPU) is a specialized and highly parallel microprocessor designed to offload 2D/3D image from the Central Processing Unit (CPU) to expedite image processing. The modern GPU is not only a powerful graphics engine, but also a parallel programmable processor with high precision and powerful features. It is forcasted that by 2020, 48 Core GPU will be available while by 2030 GPU with 3000 core is likely to be available.This chapter describes the chronology of evolution of GPU hardware architecture and the future ahead.


Sign in / Sign up

Export Citation Format

Share Document