Acceleration of a two-dimensional Euler flow solver using commodity graphics hardware

T Brandvik; G Pullan

doi:10.1243/09544062jmes813ft

Acceleration of a two-dimensional Euler flow solver using commodity graphics hardware

Proceedings of the Institution of Mechanical Engineers Part C Journal of Mechanical Engineering Science ◽

10.1243/09544062jmes813ft ◽

2007 ◽

Vol 221 (12) ◽

pp. 1745-1748 ◽

Cited By ~ 38

Author(s):

T Brandvik ◽

G Pullan

Keyword(s):

Programming Model ◽

Graphics Processing Unit ◽

Graphics Hardware ◽

Test Case ◽

Processing Unit ◽

Two Dimensional ◽

Flow Solver ◽

Fortran Implementation ◽

Euler Solver ◽

Graphics Processing

The implementation of a two-dimensional Euler solver on graphics hardware is described. The graphics processing unit is highly parallelized and uses a programming model that is well suited to flow computation. Results for a transonic turbine cascade test-case are presented. For large grids (106 nodes) a 40 times speed-up compared with a Fortran implementation on a contemporary CPU is observed.

AN EVALUATION OF MULTIPLE FEED-FORWARD NETWORKS ON GPUs

International Journal of Neural Systems ◽

10.1142/s0129065711002638 ◽

2011 ◽

Vol 21 (01) ◽

pp. 31-47 ◽

Cited By ~ 14

Author(s):

NOEL LOPES ◽

BERNARDETE RIBEIRO

Keyword(s):

Graphics Processing Unit ◽

Parallel Implementation ◽

Low Cost ◽

Back Propagation ◽

General Purpose ◽

Training System ◽

Graphics Hardware ◽

Processing Unit ◽

Data Parallel ◽

Graphics Processing

The Graphics Processing Unit (GPU) originally designed for rendering graphics and which is difficult to program for other tasks, has since evolved into a device suitable for general-purpose computations. As a result graphics hardware has become progressively more attractive yielding unprecedented performance at a relatively low cost. Thus, it is the ideal candidate to accelerate a wide variety of data parallel tasks in many fields such as in Machine Learning (ML). As problems become more and more demanding, parallel implementations of learning algorithms are crucial for a useful application. In particular, the implementation of Neural Networks (NNs) in GPUs can significantly reduce the long training times during the learning process. In this paper we present a GPU parallel implementation of the Back-Propagation (BP) and Multiple Back-Propagation (MBP) algorithms, and describe the GPU kernels needed for this task. The results obtained on well-known benchmarks show faster training times and improved performances as compared to the implementation in traditional hardware, due to maximized floating-point throughput and memory bandwidth. Moreover, a preliminary GPU based Autonomous Training System (ATS) is developed which aims at automatically finding high-quality NNs-based solutions for a given problem.

Implementation and Benchmarking of Two-Dimensional Vortex Interactions on a Graphics Processing Unit

Journal of Aerospace Information Systems ◽

10.2514/1.i010141 ◽

2014 ◽

Vol 11 (6) ◽

pp. 372-385 ◽

Cited By ~ 4

Author(s):

Christopher C. Chabalko ◽

Balakumar Balachandran

Keyword(s):

Graphics Processing Unit ◽

Processing Unit ◽

Two Dimensional ◽

Vortex Interactions ◽

Graphics Processing

Advanced Topics GPU Programming and CUDA Architecture

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Emerging Research Surrounding Power Consumption and Performance Issues in Utility Computing ◽

10.4018/978-1-4666-8853-7.ch008 ◽

2016 ◽

pp. 175-203

Author(s):

Mainak Adhikari ◽

Sukhendu Kar

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Programming Model ◽

Graphics Processing Unit ◽

Direct Access ◽

Gpu Programming ◽

Processing Unit ◽

Computing Platform ◽

Cuda Architecture ◽

Graphics Processing

Graphics processing unit (GPU), which typically handles computation only for computer graphics. Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs). CUDA gives program developers direct access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. This chapter first discuss some features and challenges of GPU programming and the effort to address some of the challenges with building and running GPU programming in high performance computing (HPC) environment. Finally this chapter point out the importance and standards of CUDA architecture.

Performance of a three-dimensional unstructured mesh compressible flow solver on NVIDIA Fermi-class graphics processing unit hardware

International Journal for Numerical Methods in Fluids ◽

10.1002/fld.3744 ◽

2012 ◽

Vol 72 (2) ◽

pp. 259-268 ◽

Cited By ~ 9

Author(s):

Jacob Waltz

Keyword(s):

Compressible Flow ◽

Unstructured Mesh ◽

Graphics Processing Unit ◽

Three Dimensional ◽

Processing Unit ◽

Flow Solver ◽

Graphics Processing

Block‐based two‐dimensional wavelet transform running on graphics processing unit

IET Computers & Digital Techniques ◽

10.1049/iet-cdt.2013.0141 ◽

2014 ◽

Vol 8 (5) ◽

pp. 229-236 ◽

Cited By ~ 8

Author(s):

Changhe Song ◽

Yunsong Li ◽

Jie Guo ◽

Jie Lei

Keyword(s):

Wavelet Transform ◽

Graphics Processing Unit ◽

Processing Unit ◽

Two Dimensional ◽

Block Based ◽

Graphics Processing

Single-shot two-dimensional spectroscopic magnetomotive optical coherence elastography with graphics processing unit acceleration

Optics Letters ◽

10.1364/ol.397900 ◽

2020 ◽

Vol 45 (15) ◽

pp. 4124

Author(s):

Pin-Chieh Huang ◽

Rishyashring R. Iyer ◽

Yuan-Zhi Liu ◽

Stephen A. Boppart

Keyword(s):

Graphics Processing Unit ◽

Processing Unit ◽

Single Shot ◽

Optical Coherence ◽

Two Dimensional ◽

Graphics Processing

Model description of a two-dimensional electrostatic particle-in-cell simulation parallelized with a graphics processing unit for plasma discharges

Plasma Research Express ◽

10.1088/2516-1067/ab0918 ◽

2019 ◽

Vol 1 (1) ◽

pp. 015016 ◽

Cited By ~ 2

Author(s):

Min Young Hur ◽

Jin Seok Kim ◽

In Cheol Song ◽

John P Verboncoeur ◽

Hae June Lee

Keyword(s):

Graphics Processing Unit ◽

Model Description ◽

Processing Unit ◽

Two Dimensional ◽

Plasma Discharges ◽

Particle In Cell ◽

Cell Simulation ◽

Graphics Processing

A graphics processing unit-accelerated meshless method for two-dimensional compressible flows

Engineering Applications of Computational Fluid Mechanics ◽

10.1080/19942060.2017.1317027 ◽

2017 ◽

Vol 11 (1) ◽

pp. 526-543 ◽

Cited By ~ 1

Author(s):

Jiale Zhang ◽

Hongquan Chen ◽

Cheng Cao

Keyword(s):

Meshless Method ◽

Compressible Flows ◽

Graphics Processing Unit ◽

Processing Unit ◽

Two Dimensional ◽

Graphics Processing

Optimization of K-Means Clustering on Graphics Processing Unit Using Compute Unified Device Architecture

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2017.6274 ◽

2017 ◽

Vol 14 (1) ◽

pp. 789-795

Author(s):

V Saveetha ◽

S Sophia

Keyword(s):

High Performance ◽

Programming Model ◽

Graphics Processing Unit ◽

Direct Access ◽

Communication Overhead ◽

Processing Unit ◽

Compute Unified Device Architecture ◽

Central Processing ◽

Device Architecture ◽

Graphics Processing

Parallel data clustering aims at using algorithms and methods to extract knowledge from fat databases in rational time using high performance architectures. The computational challenge faced by cluster analysis due to increasing capacity of data can be overcome by exploiting the power of these architectures. The recent development in parallel power of Graphics Processing Unit enables low cost high performance solutions for general purpose applications. The Compute Unified Device Architecture programming model provides application programming interface methods to handle data proficiently on Graphics Processing Unit for iterative clustering algorithms like K-Means. The existing Graphics Processing Unit based K-Means algorithms highly focus on improvising the speedup of the algorithms and fall short to handle the high time spent on transfer of data between the Central Processing Unit and Graphics Processing Unit. A competent K-Means algorithm is proposed in this paper to lessen the transfer time by introducing a novel approach to check the convergence of the algorithm and utilize the pinned memory for direct access. This algorithm outperforms the other algorithms by maximizing parallelism and utilizing the memory features. The relative speedups and the validity measure for the proposed algorithm is elevated when compared with K-Means on Graphics Processing Unit and K-Means using Flag on Graphics Processing Unit. Thus the planned approach proves that communication overhead can be reduced in K-Means clustering.

Accelerated Deconvolution of Radio Interferometric Images using Orthogonal Matching Pursuit and Graphics Hardware

Journal of Astronomical Instrumentation ◽

10.1142/s225117171750009x ◽

2017 ◽

Vol 06 (04) ◽

pp. 1750009

Author(s):

Jonathan Van Belle ◽

Richard Armstrong ◽

James Gain

Keyword(s):

Compressed Sensing ◽

Graphics Processing Unit ◽

Matching Pursuit ◽

Orthogonal Matching Pursuit ◽

Graphics Hardware ◽

Processing Unit ◽

Rigorous Framework ◽

Random Measurements ◽

Graphics Processing ◽

Sparse Set

Deconvolution of native radio interferometric images constitutes a major computational component of the imaging process. An efficient and robust deconvolution operation is essential for reconstruction of the true sky signal from measured telescopic data. The techniques of compressed sensing provide a mathematically-rigorous framework within which to implement deconvolution of images formed from a sparse set of nearly-random measurements. We present an accelerated implementation of the orthogonal matching pursuit (OMP) algorithm (a compressed sensing method) that makes use of graphics processing unit (GPU) hardware. We show that OMP correctly identifies more sources than CLEAN, identifying up to 82% of the sources in 100 test images, while CLEAN only identifies up to 61% of the sources. In addition, the residual after source extraction is [Formula: see text] times lower for OMP than for CLEAN. Furthermore, the graphics implementation of OMP performs around 23 times faster than a 4-core CPU.