Scalable Graphics Processing Unit–Based Multiscale Linear Solvers for Reservoir Simulation

Parallel Architecture ◽

Multiscale Methods ◽

Massively Parallel ◽

Processing Unit ◽

Multicore Architecture ◽

Graphics Processing ◽

Gpu Architecture

Summary In this work, the scalability of two key multiscale solvers for the pressure equation arising from incompressible flow in heterogeneous porous media, namely, the multiscale finite volume (MSFV) solver, and the restriction-smoothed basis multiscale (MsRSB) solver, are investigated on the graphics processing unit (GPU) massively parallel architecture. The robustness and scalability of both solvers are compared against their corresponding carefully optimized implementation on the shared-memory multicore architecture in a structured problem setting. Although several components in MSFV and MsRSB algorithms are directly parallelizable, their scalability on the GPU architecture depends heavily on the underlying algorithmic details and data-structure design of every step, where one needs to ensure favorable control and data flow on the GPU, while extracting enough parallel work for a massively parallel environment. In addition, the type of algorithm chosen for each step greatly influences the overall robustness of the solver. Thus, we extend the work on the parallel multiscale methods of Manea et al. (2016) to map the MSFV and MsRSB special kernels to the massively parallel GPU architecture. The scalability of our optimized parallel MSFV and MsRSB GPU implementations are demonstrated using highly heterogeneous structured 3D problems derived from the SPE10 Benchmark (Christie and Blunt 2001). Those problems range in size from millions to tens of millions of cells. For both solvers, the multicore implementations are benchmarked on a shared-memory multicore architecture consisting of two packages of Intel® Cascade Lake Xeon Gold 6246 central processing unit (CPU), whereas the GPU implementations are benchmarked on a massively parallel architecture consisting of NVIDIA Volta V100 GPUs. We compare the multicore implementations to the GPU implementations for both the setup and solution stages. Finally, we compare the parallel MsRSB scalability to the scalability of MSFV on the multicore (Manea et al. 2016) and GPU architectures. To the best of our knowledge, this is the first parallel implementation and demonstration of these versatile multiscale solvers on the GPU architecture. NOTE: This paper is published as part of the 2021 SPE Reservoir Simulation Conference Special Issue.

2010 Ninth International Workshop on Parallel and Distributed Methods in Verification, and Second International Workshop on High Performance Computational Systems Biology ◽

Fast Parallel Markov Clustering in Bioinformatics Using Massively Parallel Graphics Processing Unit Computing

10.1109/pdmc-hibi.2010.23 ◽

2010 ◽

Cited By ~ 5

Author(s):

Alhadi Bustamam ◽

Kevin Burrage ◽

Nicholas A. Hamilton

Keyword(s):

Massively Parallel ◽

Processing Unit ◽

Markov Clustering ◽

Parallel Graphics ◽

Massively Parallel Signal Processing Using the Graphics Processing Unit for Real-Time Brain-Computer Interface Feature Extraction

Frontiers in Neuroengineering ◽

10.3389/neuro.16.011.2009 ◽

2009 ◽

Vol 2 ◽

Cited By ~ 26

Author(s):

J. Adam Wilson

Keyword(s):

Signal Processing ◽

Feature Extraction ◽

Real Time ◽

Brain Computer Interface ◽

Massively Parallel ◽

Computer Interface ◽

Processing Unit ◽

Graphics Processing ◽

Parallel Signal Processing

Accelerating Monte Carlo Simulation for Radiotherapy Dose Calculation using a Massively Parallel Graphics Processing Unit

International Journal of Radiation Oncology*Biology*Physics ◽

10.1016/j.ijrobp.2010.07.1863 ◽

2010 ◽

Vol 78 (3) ◽

pp. S804-S805

Author(s):

Y. Zhuge ◽

H. Xie ◽

R.W. Miller

Keyword(s):

Monte Carlo Simulation ◽

Monte Carlo ◽

Dose Calculation ◽

Massively Parallel ◽

Processing Unit ◽

Parallel Graphics ◽

Radiotherapy Dose ◽

Massively parallel hybrid algorithm on embedded graphics processing unit for unmanned aerial vehicle path planning

International Journal of Digital Signals and Smart Systems ◽

10.1504/ijdsss.2018.090875 ◽

2018 ◽

Vol 2 (1) ◽

pp. 68 ◽

Cited By ~ 1

Author(s):

Vincent Roberge ◽

Mohammed Tarbouchi

Keyword(s):

Path Planning ◽

Unmanned Aerial Vehicle ◽

Hybrid Algorithm ◽

Massively Parallel ◽

Processing Unit ◽

Parallel Hybrid ◽

Aerial Vehicle ◽

Vehicle Path ◽

Accelerating Monte Carlo simulations of photon transport in a voxelized geometry using a massively parallel graphics processing unit

Medical Physics ◽

10.1118/1.3231824 ◽

2009 ◽

Vol 36 (11) ◽

pp. 4878-4880 ◽

Cited By ~ 167

Author(s):

Andreu Badal ◽

Aldo Badano

Keyword(s):

Monte Carlo ◽

Monte Carlo Simulations ◽

Massively Parallel ◽

Processing Unit ◽

Photon Transport ◽

Parallel Graphics ◽

HIGH PRECISION INTEGER ADDITION, SUBTRACTION AND MULTIPLICATION WITH A GRAPHICS PROCESSING UNIT

Parallel Processing Letters ◽

10.1142/s0129626410000259 ◽

2010 ◽

Vol 20 (04) ◽

pp. 293-306 ◽

Cited By ~ 8

Author(s):

NIALL EMMART ◽

CHARLES WEEMS

Keyword(s):

High Precision ◽

Divide And Conquer ◽

Processing Unit ◽

Good Potential ◽

Addition And Subtraction ◽

Integer Multiplication ◽

On Chip ◽

Graphics Processing ◽

Gpu Architecture

In this paper we evaluate the potential for using an NVIDIA graphics processing unit (GPU) to accelerate high precision integer multiplication, addition, and subtraction. The reported peak vector performance for a typical GPU appears to offer good potential for accelerating such a computation. Because of limitations in the on-chip memory, the high cost of kernel launches, and the nature of the architecture's support for parallelism, we used a hybrid algorithmic approach to obtain good performance on multiplication. On the GPU itself we adapt the Strassen FFT algorithm to multiply 32KB chunks, while on the CPU we adapt the Karatsuba divide-and-conquer approach to optimize application of the GPU's partial multiplies, which are viewed as "digits" by our implementation of Karatsuba. Even with this approach, the result is at best a factor of three increase in performance, compared with using the GMP package on a 64-bit CPU at a comparable technology node. Our implementations of addition and subtraction achieve up to a factor of eight improvement. We identify the issues that limit performance and discuss the likely impact of planned advances in GPU architecture.

Parallel Reservoir Simulation with OpenACC and Domain Decomposition

Algorithms ◽

10.3390/a11120213 ◽

2018 ◽

Vol 11 (12) ◽

pp. 213 ◽

Cited By ~ 1

Author(s):

Zhijiang Kang ◽

Ze Deng ◽

Wei Han ◽

Dongmei Zhang

Keyword(s):

Domain Decomposition ◽

Reservoir Simulation ◽

Domain Decomposition Method ◽

Simulation Method ◽

Processing Unit ◽

Reservoir Simulations ◽

Device Architecture ◽

Important Approach ◽

Parallel reservoir simulation is an important approach to solving real-time reservoir management problems. Recently, there is a new trend of using a graphics processing unit (GPU) to parallelize the reservoir simulations. Current GPU-aided reservoir simulations focus on compute unified device architecture (CUDA). Nevertheless, CUDA is not functionally portable across devices and incurs high amount of code. Meanwhile, domain decomposition is not well used for GPU-based reservoir simulations. In order to address the problems, we propose a parallel method with OpenACC to accelerate serial code and reduce the time and effort during porting an application to GPU. Furthermore, the GPU-aided domain decomposition is developed to accelerate the efficiency of reservoir simulation. The experimental results indicate that (1) the proposed GPU-aided approach can outperform the CPU-based one up to about two times, meanwhile with the help of OpenACC, the workload of the transplant code was reduced significantly by about 22 percent of the source code, (2) the domain decomposition method can further improve the execution efficiency up to 1.7×. The proposed parallel reservoir simulation method is a efficient tool to accelerate reservoir simulation.

Massively parallel implementation of cyclic LDPC codes on a general purpose graphics processing unit

2009 IEEE Workshop on Signal Processing Systems ◽

10.1109/sips.2009.5336268 ◽

2009 ◽

Cited By ~ 10

Author(s):

Hyunwoo Ji ◽

Junho Cho ◽

Wonyong Sung

Keyword(s):

Parallel Implementation ◽

Ldpc Codes ◽

General Purpose ◽

Massively Parallel ◽

Processing Unit ◽

A GPU acceleration of 3-D Fourier reconstruction in cryo-EM

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019832958 ◽

2019 ◽

Vol 33 (5) ◽

pp. 948-959 ◽

Cited By ~ 3

Author(s):

David Střelák ◽

Carlos Óscar S Sorzano ◽

José María Carazo ◽

Jiří Filipovič

Keyword(s):

Internal Symmetry ◽

Processing Unit ◽

Volume Reconstruction ◽

Optimization Parameters ◽

Fourier Reconstruction ◽

Race Conditions ◽

Graphics Processing ◽

Gpu Architecture ◽

Memory Pattern

Cryo-electron microscopy is a popular method for macromolecules structure determination. Reconstruction of a 3-D volume from raw data obtained from a microscope is highly computationally demanding. Thus, acceleration of the reconstruction has a great practical value. In this article, we introduce a novel graphics processing unit (GPU)-friendly algorithm for direct Fourier reconstruction, one of the main computational bottlenecks in the 3-D volume reconstruction pipeline for some experimental cases (particularly those with a large number of images and a high internal symmetry). Contrary to the state of the art, our algorithm uses a gather memory pattern, improving cache locality and removing race conditions in parallel writing into the 3-D volume. We also introduce a finely tuned CUDA implementation of our algorithm, using auto-tuning to search for a combination of optimization parameters maximizing performance on a given GPU architecture. Our CUDA implementation is integrated in widely used software Xmipp, version 3.19, reaching 11.4× speedup compared to the original parallel CPU implementation using GPU with comparable power consumption. Moreover, we have reached 31.7× speedup using four GPUs and 2.14×–5.96× speedup compared to optimized GPU implementation based on a scatter memory pattern.

Accelerating adaptive inverse distance weighting interpolation algorithm on a graphics processing unit

Royal Society Open Science ◽

10.1098/rsos.170436 ◽

2017 ◽

Vol 4 (9) ◽

pp. 170436 ◽

Cited By ~ 9

Author(s):

Gang Mei ◽

Liangliang Xu ◽

Nengxiong Xu

Keyword(s):

Shared Memory ◽

Inverse Distance Weighting ◽

Processing Unit ◽

Double Precision ◽

Distance Weighting ◽

Speed Up ◽

Graphics Processing ◽

Data Layouts ◽

Inverse Distance

This paper focuses on designing and implementing parallel adaptive inverse distance weighting (AIDW) interpolation algorithms by using the graphics processing unit (GPU). The AIDW is an improved version of the standard IDW, which can adaptively determine the power parameter according to the data points’ spatial distribution pattern and achieve more accurate predictions than those predicted by IDW. In this paper, we first present two versions of the GPU-accelerated AIDW, i.e. the naive version without profiting from the shared memory and the tiled version taking advantage of the shared memory. We also implement the naive version and the tiled version using two data layouts, structure of arrays and array of aligned structures, on both single and double precision. We then evaluate the performance of parallel AIDW by comparing it with its corresponding serial algorithm on three different machines equipped with the GPUs GT730M, M5000 and K40c. The experimental results indicate that: (i) there is no significant difference in the computational efficiency when different data layouts are employed; (ii) the tiled version is always slightly faster than the naive version; and (iii) on single precision the achieved speed-up can be up to 763 (on the GPU M5000), while on double precision the obtained highest speed-up is 197 (on the GPU K40c). To benefit the community, all source code and testing data related to the presented parallel AIDW algorithm are publicly available.