Industry-scale finite-difference elastic wave modeling on graphics processing units using the out-of-core technique

The difference in computational power between the few- and multicore architectures represented by central processing units (CPUs) and graphics processing units (GPUs) is significant today, and this difference is likely to increase in the years ahead. GPUs are, therefore, ever more popular for applications in computational physics, such as wave modeling. Finite-difference methods are popular for wave modeling and are well suited for the GPU architecture, but developing an efficient and capable GPU implementation is hindered by the limited size of the GPU memory. I revealed how the out-of-core technique can be used to circumvent the memory limit on the GPU, increasing the available memory to that of the CPU (the main memory) instead, with no significant computational overhead. This approach has several advantages over a parallel scheme in terms of applicability, flexibility, and hardware requirements. Choices in the numerical scheme — the numerical differentiators in particular — also greatly affect computational efficiency. These factors are considered explicitly for GPU implementations of wave modeling because GPUs are special purpose with a visible architecture.

Download Full-text

An Implicit Harmonic Balance Method in Graphics Processing Units for Oscillating Blades

Journal of Turbomachinery ◽

10.1115/1.4031918 ◽

2015 ◽

Vol 138 (3) ◽

Cited By ~ 7

Author(s):

Javier Crespo ◽

Roque Corral ◽

Jesus Pueblas

Keyword(s):

Finite Difference ◽

Harmonic Balance ◽

Graphics Processing Units ◽

Stokes Equations ◽

Computational Cost ◽

Harmonic Balance Method ◽

Transonic Compressor ◽

Central Processing ◽

Speed Up ◽

Graphics Processing

An implicit harmonic balance (HB) method for modeling the unsteady nonlinear periodic flow about vibrating airfoils in turbomachinery is presented. An implicit edge-based three-dimensional Reynolds-averaged Navier–Stokes equations (RANS) solver for unstructured grids, which runs both on central processing units (CPUs) and graphics processing units (GPUs), is used. The HB method performs a spectral discretization of the time derivatives and marches in pseudotime, a new system of equations where the unknowns are the variables at different time samples. The application of the method to vibrating airfoils is discussed. It is shown that a time-spectral scheme may achieve the same temporal accuracy at a much lower computational cost than a backward finite-difference method at the expense of using more memory. The performance of the implicit solver has been assessed with several application examples. A speed-up factor of 10 is obtained between the spectral and finite-difference version of the code, whereas an additional speed-up factor of 10 is obtained when the code is ported to GPUs, totalizing a speed factor of 100. The performance of the solver in GPUs has been assessed using the tenth standard aeroelastic configuration and a transonic compressor.

Download Full-text

Industry-scale finite-difference elastic wave modeling on graphics processing units using the out-of-core technique

Geophysics ◽

10.1190/geo-2015-0267.1 ◽

2016 ◽

Vol 81 (2) ◽

pp. T29-T37

Author(s):

Jon Marius Venstad

Keyword(s):

Finite Difference ◽

Elastic Wave ◽

Graphics Processing Units ◽

Wave Modeling ◽

Graphics Processing

Download Full-text

Optimized Finite Difference Methods for Seismic Acoustic Wave Modeling

Computational and Experimental Studies of Acoustic Waves ◽

10.5772/intechopen.71647 ◽

2018 ◽

Author(s):

Yanfei Wang ◽

Wenquan Liang

Keyword(s):

Finite Difference ◽

Acoustic Wave ◽

Finite Difference Methods ◽

Wave Modeling ◽

Difference Methods

Download Full-text

Heterogenous Computing on Iris Matching with OpenCL

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.850.129 ◽

2016 ◽

Vol 850 ◽

pp. 129-135

Author(s):

Buğra Şimşek ◽

Nursel Akçam

Keyword(s):

Graphics Processing Units ◽

Iris Recognition ◽

Heterogeneous Computing ◽

Hamming Distance ◽

Heterogeneous Systems ◽

Digital Signal ◽

Mobile Platforms ◽

Central Processing ◽

Field Programmable ◽

Graphics Processing

This study presents parallelization of Hamming Distance algorithm, which is used for iris comparison on iris recognition systems, for heterogeneous systems that can be included Central Processing Units (CPUs), Graphics Processing Units (GPUs), Digital Signal Processing (DSP) boards, Field Programmable Gate Array (FPGA) and some other mobile platforms with OpenCL. OpenCL allows to run same code on CPUs, GPUs, FPGAs and DSP boards. Heterogeneous computing refers to systems include different kind of devices (CPUs, GPUs, FPGAs and other accelerators). Heterogeneous computing gains performance or reduces power for suitable algorithms on these OpenCL supported devices. In this study, Hamming Distance algorithm has been coded with C++ as a sequential code and has been parallelized a designated method by us with OpenCL. Our OpenCL code has been executed on Nvidia GT430 GPU and Intel Xeon 5650 processor. The OpenCL code implementation demonstrates that speed up to 87 times with parallelization. Also our study differs from other studies, which accelerate iris matching, with regard to ensure heterogeneous computing by using OpenCL.

Download Full-text

The VOLNA-OP2 tsunami code (version 1.5)

Geoscientific Model Development ◽

10.5194/gmd-11-4621-2018 ◽

2018 ◽

Vol 11 (11) ◽

pp. 4621-4635 ◽

Cited By ~ 7

Author(s):

Istvan Z. Reguly ◽

Daniel Giles ◽

Devaraj Gopinathan ◽

Laure Quivy ◽

Joakim H. Beck ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Shallow Water Equation ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Central Processing ◽

Domain Specific ◽

Computing Platforms ◽

Graphics Processing ◽

Intel Xeon

Abstract. In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite-volume non-linear shallow-water equation (NSWE) solver built on the OP2 domain-specific language (DSL) for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high-performance computing platforms: central processing units (CPUs), the Intel Xeon Phi, and graphics processing units (GPUs). This is achieved in a way that the scientific code is kept separate from various parallel implementations, enabling easy maintainability. It has already been used in production for several years; here we discuss how it can be integrated into various workflows, such as a statistical emulator. The scalability of the code is demonstrated on three supercomputers, built with classical Xeon CPUs, the Intel Xeon Phi, and NVIDIA P100 GPUs. VOLNA-OP2 shows an ability to deliver productivity as well as performance and portability to its users across a number of platforms.

Download Full-text

Optimizing the computation of a parallel 3D finite difference algorithm for graphics processing units

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.3351 ◽

2014 ◽

Vol 27 (6) ◽

pp. 1591-1602 ◽

Cited By ~ 5

Author(s):

J. Porter-Sobieraj ◽

S. Cygert ◽

D. Kikoła ◽

J. Sikorski ◽

M. Słodkowski

Keyword(s):

Finite Difference ◽

Graphics Processing Units ◽

Graphics Processing

Download Full-text

TOWARDS FFT-BASED DIRECT NUMERICAL SIMULATIONS OF TURBULENT FLOWS ON A GPU

International Journal of Modeling Simulation and Scientific Computing ◽

10.1142/s1793962313500141 ◽

2013 ◽

Vol 05 (01) ◽

pp. 1350014 ◽

Cited By ~ 1

Author(s):

CATHERINE RUCKI ◽

ABHILASH J. CHANDY

Keyword(s):

Turbulent Flows ◽

Graphics Processing Units ◽

Turbulence Models ◽

Science And Engineering ◽

Computing Power ◽

Central Processing ◽

General Improvement ◽

Highly Turbulent Flows ◽

Graphics Processing ◽

Pseudo Spectral

The accurate simulation of turbulence and the implementation of corresponding turbulence models are both critical to the understanding of the complex physics behind turbulent flows in a variety of science and engineering applications. Despite the tremendous increase in the computing power of central processing units (CPUs), direct numerical simulation of highly turbulent flows is still not feasible due to the need for resolving the smallest length scale, and today's CPUs cannot keep pace with demand. The recent development of graphics processing units (GPU) has led to the general improvement in the performance of various algorithms. This study investigates the applicability of GPU technology in the context of fast-Fourier transform (FFT)-based pseudo-spectral methods for DNS of turbulent flows for the Taylor–Green vortex problem. They are implemented on a single GPU and a speedup of unto 31x is obtained in comparison to a single CPU.

Download Full-text

PI-FLAME: A parallel immune system simulator using the FLAME graphic processing unit environment

SIMULATION ◽

10.1177/0037549716673724 ◽

2016 ◽

Vol 93 (1) ◽

pp. 69-84 ◽

Cited By ~ 6

Author(s):

Shailesh Tamrakar ◽

Paul Richmond ◽

Roshan M D’Souza

Keyword(s):

Immune System ◽

Graphics Processing Units ◽

Processing Unit ◽

Human Immune System ◽

Innate And Adaptive Immunity ◽

Agent Based ◽

Central Processing ◽

Agent Simulation ◽

Study Population ◽

Graphics Processing

Agent-based models (ABMs) are increasingly being used to study population dynamics in complex systems, such as the human immune system. Previously, Folcik et al. (The basic immune simulator: an agent-based model to study the interactions between innate and adaptive immunity. Theor Biol Med Model 2007; 4: 39) developed a Basic Immune Simulator (BIS) and implemented it using the Recursive Porous Agent Simulation Toolkit (RePast) ABM simulation framework. However, frameworks such as RePast are designed to execute serially on central processing units and therefore cannot efficiently handle large model sizes. In this paper, we report on our implementation of the BIS using FLAME GPU, a parallel computing ABM simulator designed to execute on graphics processing units. To benchmark our implementation, we simulate the response of the immune system to a viral infection of generic tissue cells. We compared our results with those obtained from the original RePast implementation for statistical accuracy. We observe that our implementation has a 13× performance advantage over the original RePast implementation.

Download Full-text

An Accelerated 3D Navier–Stokes Solver for Flows in Turbomachines

Journal of Turbomachinery ◽

10.1115/1.4001192 ◽

2010 ◽

Vol 133 (2) ◽

Cited By ~ 43

Author(s):

Tobias Brandvik ◽

Graham Pullan

Keyword(s):

Graphics Processing Units ◽

Three Dimensional ◽

Navier Stokes ◽

Linear Scaling ◽

Test Case ◽

Processing Unit ◽

Central Processing ◽

Order Of Magnitude ◽

Graphics Processing ◽

Good Agreement

A new three-dimensional Navier–Stokes solver for flows in turbomachines has been developed. The new solver is based on the latest version of the Denton codes but has been implemented to run on graphics processing units (GPUs) instead of the traditional central processing unit. The change in processor enables an order-of-magnitude reduction in run-time due to the higher performance of the GPU. The scaling results for a 16 node GPU cluster are also presented, showing almost linear scaling for typical turbomachinery cases. For validation purposes, a test case consisting of a three-stage turbine with complete hub and casing leakage paths is described. Good agreement is obtained with previously published experimental results. The simulation runs in less than 10 min on a cluster with four GPUs.

Download Full-text

Controllers: An abstraction to ease the use of hardware accelerators

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017702962 ◽

2017 ◽

Vol 32 (6) ◽

pp. 838-853 ◽

Cited By ~ 4

Author(s):

Ana Moreton–Fernandez ◽

Hector Ortega–Arranz ◽

Arturo Gonzalez–Escribano

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Abstract Entity ◽

Hardware Accelerators ◽

Processing Unit ◽

Central Processing ◽

Computing Platforms ◽

Graphics Processing ◽

Performance Computing ◽

Selection Of

Nowadays the use of hardware accelerators, such as the graphics processing units or XeonPhi coprocessors, is key in solving computationally costly problems that require high performance computing. However, programming solutions for an efficient deployment for these kind of devices is a very complex task that relies on the manual management of memory transfers and configuration parameters. The programmer has to carry out a deep study of the particular data that needs to be computed at each moment, across different computing platforms, also considering architectural details. We introduce the controller concept as an abstract entity that allows the programmer to easily manage the communications and kernel launching details on hardware accelerators in a transparent way. This model also provides the possibility of defining and launching central processing unit kernels in multi-core processors with the same abstraction and methodology used for the accelerators. It internally combines different native programming models and technologies to exploit the potential of each kind of device. Additionally, the model also allows the programmer to simplify the proper selection of values for several configuration parameters that can be selected when a kernel is launched. This is done through a qualitative characterization process of the kernel code to be executed. Finally, we present the implementation of the controller model in a prototype library, together with its application in several case studies. Its use has led to reductions in the development and porting costs, with significantly low overheads in the execution times when compared to manually programmed and optimized solutions which directly use CUDA and OpenMP.

Download Full-text