Computational Fluid Dynamics Computations Using a Preconditioned Krylov Solver on Graphical Processing Units

2015 ◽  
Vol 138 (1) ◽  
Author(s):  
Amit Amritkar ◽  
Danesh Tafti

Graphical processing unit (GPU) computation in recent years has seen extensive growth due to advancement in both hardware and software stack. This has led to increase in the use of GPUs as accelerators across a broad spectrum of applications. This work deals with the use of general purpose GPUs for performing computational fluid dynamics (CFD) computations. The paper discusses strategies and findings on porting a large multifunctional CFD code to the GPU architecture. Within this framework, the most compute intensive segment of the software, the BiCGStab linear solver using additive Schwarz block preconditioners with point Jacobi iterative smoothing is optimized for the GPU platform using various techniques in CUDA Fortran. Representative turbulent channel and pipe flow are investigated for validation and benchmarking purposes. Both single and double precision calculations are highlighted. For a modest single block grid of 64 × 64 × 64, the turbulent channel flow computations showed a speedup of about eightfold in double precision and more than 13-fold for single precision on the NVIDIA Tesla GPU over a serial run on an Intel central processing unit (CPU). For the pipe flow consisting of 1.78 × 106 grid cells distributed over 36 mesh blocks, the gains were more modest at 4.5 and 6.5 for double and single precision, respectively.

Author(s):  
Amit Amritkar ◽  
Danesh Tafti

GPU computation in recent years has seen extensive growth due to advancement in both hardware and software stack. This has led to increase in the use of GPUs as accelerators across a broad spectrum of applications. This work deals with the use of general purpose GPUs for performing CFD computations. The paper discusses strategies and findings on porting a large multi-functional CFD code to the GPU architecture. Within this framework, the most compute intensive segment of the software, the BiCGSTAB linear solver using additive Schwarz block pre-conditioners with point Jacobi iterative smoothing is optimized for the GPU platform using various techniques in CUDA Fortran. Representative turbulent channel and pipe flow are investigated for validation and benchmarking purposes. Both single and double precision calculations are highlighted. It is found that the precision has a negligible effect on the accuracy of predicted turbulent statistics. However, it was found that single precision calculations led to instabilities in the initial convergence of the pressure equation if the convergence criterion was set at too low a value. This was remedied by limiting the number of iterations during the initial stages of the calculation. For a modest single block grid of 64×64×64, the turbulent channel flow computations showed a speedup of about 8 fold in double precision whereas it was more than 13 fold for single precision on the NVIDIA Tesla GPU. For the pipe flow consisting of 1.78 million grid cells distributed over 36 blocks, the gains were more modest at 4.5 and 6.5 for double and single precision respectively.


2013 ◽  
Vol 753-755 ◽  
pp. 2731-2735
Author(s):  
Wei Cao ◽  
Zheng Hua Wang ◽  
Chuan Fu Xu

The graphics processing unit (GPU) has evolved from configurable graphics processor to a powerful engine for high performance computer. In this paper, we describe the graphics pipeline of GPU, and introduce the history and evolution of GPU architecture. We also provide a summary of software environments used on GPU, from graphics APIs to non-graphics APIs. At last, we present the GPU computing in computational fluid dynamics applications, including the GPGPU computing for Navier-Stokes equations methods and the GPGPU computing for Lattice Boltzmann method.


2006 ◽  
Vol 129 (2) ◽  
pp. 221-231 ◽  
Author(s):  
André Burdet ◽  
Reza S. Abhari ◽  
Martin G. Rose

Computational fluid dynamics (CFD) has recently been used for the simulation of the aerothermodynamics of film cooling. The direct calculation of a single cooling hole requires substantial computational resources. A parametric study, for the optimization of the cooling system in real engines, is much too time consuming due to the large number of grid nodes required to cover all injection holes and plenum chambers. For these reasons, a hybrid approach is proposed, based on the modeling of the near film-cooling hole flow, tuned using experimental data, while computing directly the flow field in the blade-to-blade passage. A new injection film-cooling model is established, which can be embedded in a CFD code, to lower the central processing unit (CPU) cost and to reduce the simulation turnover time. The goal is to be able to simulate film-cooled turbine blades without having to explicitly mesh inside the holes and the plenum chamber. The stability, low CPU overhead level (1%) and accuracy of the proposed CFD-embedded film-cooling model are demonstrated in the ETHZ steady film-cooled flat-plate experiment presented in Part I (Bernsdorf, Rose, and Abhari, 2006, ASME J. Turbomach., 128, pp. 141–149) of this two-part paper. The prediction of film-cooling effectiveness using the CFD-embedded model is evaluated.


Author(s):  
M Franchetta ◽  
K O Suen ◽  
T G Bancroft

Underbonnet simulations are proving to be crucially important within a vehicle development programme, reducing test work and time-to-market. While computational fluid dynamics (CFD) simulations of steady forced flows have been demonstrated to be reliable, studies of transient convective flows in engine compartments are not yet carried out owing to high computing demands and lack of validated work. The present work assesses the practical feasibility of applying the CFD tool at the initial stage of a vehicle development programme for investigating the thermally driven flow in an engine bay under thermal soak. A computation procedure that enables pseudo time-marching CFD simulations to be performed with significantly reduced central processing unit (CPU) time usage is proposed. The methodology was initially tested on simple geometries and then implemented for investigating a simplified half-scale underbonnet compartment. The numerical results are compared with experimental data taken with thermocouples and with particle image velocimetry (PIV). The novel computation methodology is successful in efficiently providing detailed and time-accurate time-dependent thermal and flow predictions. Its application will extend the use of the CFD tool for transient investigations, enabling improvements to the component packaging of engine bays and the refinement of thermal management strategies with reduced need for in-territory testing.


2011 ◽  
Vol 14 (3) ◽  
pp. 603-612 ◽  
Author(s):  
P. A. Crous ◽  
J. E. van Zyl ◽  
Y. Roodt

The Engineering discipline has relied on computers to perform numerical calculations in many of its sub-disciplines over the last decades. The advent of graphical processing units (GPUs), parallel stream processors, has the potential to speed up generic simulations that facilitate engineering applications aside from traditional computer graphics applications, using GPGPU (general purpose programming on the GPU). The potential benefits of exploiting the GPU for general purpose computation require the program to be highly arithmetic intensive and also data independent. This paper looks at the specific application of the Conjugate Gradient method used in hydraulic network solvers on the GPU and compares the results to conventional central processing unit (CPU) implementations. The results indicate that the GPU becomes more efficient as the data set size increases. However, with the current hardware and the implementation of the Conjugate Gradient algorithm, the application of stream processing to hydraulic network solvers is only faster and more efficient for exceptionally large water distribution models, which are seldom found in practice.


2021 ◽  
Vol 8 (2) ◽  
pp. 169-180
Author(s):  
Mark Lin ◽  
Periklis Papadopoulos

Computational methods such as Computational Fluid Dynamics (CFD) traditionally yield a single output – a single number that is much like the result one would get if one were to perform a theoretical hand calculation. However, this paper will show that computation methods have inherent uncertainty which can also be reported statistically. In numerical computation, because many factors affect the data collected, the data can be quoted in terms of standard deviations (error bars) along with a mean value to make data comparison meaningful. In cases where two data sets are obscured by uncertainty, the two data sets are said to be indistinguishable. A sample CFD problem pertaining to external aerodynamics is copied and ran on 29 identical computers in a university computer lab. The expectation is that all 29 runs should return exactly the same result; unfortunately, in a few cases the result turns out to be different. This is attributed to the parallelization scheme which partitions the mesh to run in parallel on multiple cores of the computer. The distribution of the computational load is hardware-driven depending on the available resource of each computer at the time. Things, such as load-balancing among multiple Central Processing Unit (CPU) cores using Message Passing Interface (MPI) are transparent to the user. Software algorithm such as METIS or JOSTLE is used to automatically divide up the load between different processors. As such, the user has no control over the outcome of the CFD calculation even when the same problem is computed. Because of this, numerical uncertainty arises from parallel (multicore) computing. One way to resolve this issue is to compute problems using a single core, without mesh repartitioning. However, as this paper demonstrates even this is not straight forward. Keywords: numerical uncertainty, parallelization, load-balancing, automotive aerodynamics


2013 ◽  
Vol 1 ◽  
pp. 151-163 ◽  
Author(s):  
Nikolay A. Simakov ◽  
Maria G. Kurnikova

AbstractPoisson and Poisson-Boltzmann equations (PE and PBE) are widely used in molecular modeling to estimate the electrostatic contribution to the free energy of a system. In such applications, PE often needs to be solved multiple times for a large number of system configurations. This can rapidly become a highly demanding computational task. To accelerate such calculations we implemented a graphical processing unit (GPU) PE solver described in this work. The GPU solver performance is compared to that of our central processing unit (CPU) implementation of the solver. During the performance analysis the following three characteristics were studied: (1) precision associated with the modeled system discretization on the grid, (2) numeric precision associated with the floating point representation of real numbers (this is done via comparison of calculations with single precision (SP) and double precision (DP)), and (3) execution time. Two types of example calculations were carried out to evaluate the solver performance: (1) solvation energy of a single ion and a small protein (lysozyme), and (2) a single ion potential in a large ion-channel (α-hemolysin). In addition, influence of various boundary condition (BC) choices was analyzed, to determine the most appropriate BC for the systems that include a membrane, typically represented by a slab with the dielectric constant of low value. The implemented GPU PE solver is overall about 7 times faster than the CPU-based version (including all four cores). Therefore, a single computer equipped with multiple GPUs can offer a computational power comparable to that of a small cluster. Our calculations showed that DP versions of CPU and GPU solvers provide nearly identical results. SP versions of the solvers have very similar behavior: in the grid scale range of 1-4 grids/Å the difference between SP and DP versions is less than the difference stemming from the system discretization. We found that for the membrane protein, the use of a focusing technique with periodic boundary conditions in rough grid provides significantly better results than using a focusing technique with the electric potential set to zero at the boundaries.


1980 ◽  
Vol 24 (02) ◽  
pp. 101-113 ◽  
Author(s):  
Owen F. Hughes ◽  
Farrokh Mistree ◽  
Vedran Žanic

A practical, rationally based method is presented for the automated optimum design of ship structures. The method required the development of (a) a rapid, design-oriented finite-element program for the analysis of ship structures; (b) a comprehensive mathematical model for the evaluation of the capability of the structure; and (c) a cost-effective optimization algorithm for the solution of a large, highly constrained, nonlinear redesign problem. These developments have been incorporated into a program called SHIPOPT. The efficiency and robustness of the method is illustrated by using it to determine the optimum design of a complete cargo hold of a general-purpose cargo ship. The overall dimensions and the design loads are the same as those used in the design of the very successful SD14 series of ships. The redesign problem contains 94 variables, a nonlinear objective function, and over 500 constraints of which approximately half are non-linear. Program SHIPOPT required approximately eight minutes of central processing unit time on a CDC CYBER 171 to determine the optimum design.


Processes ◽  
2020 ◽  
Vol 8 (9) ◽  
pp. 1199
Author(s):  
Ravie Chandren Muniyandi ◽  
Ali Maroosi

Long-timescale simulations of biological processes such as photosynthesis or attempts to solve NP-hard problems such as traveling salesman, knapsack, Hamiltonian path, and satisfiability using membrane systems without appropriate parallelization can take hours or days. Graphics processing units (GPU) deliver an immensely parallel mechanism to compute general-purpose computations. Previous studies mapped one membrane to one thread block on GPU. This is disadvantageous given that when the quantity of objects for each membrane is small, the quantity of active thread will also be small, thereby decreasing performance. While each membrane is designated to one thread block, the communication between thread blocks is needed for executing the communication between membranes. Communication between thread blocks is a time-consuming process. Previous approaches have also not addressed the issue of GPU occupancy. This study presents a classification algorithm to manage dependent objects and membranes based on the communication rate associated with the defined weighted network and assign them to sub-matrices. Thus, dependent objects and membranes are allocated to the same threads and thread blocks, thereby decreasing communication between threads and thread blocks and allowing GPUs to maintain the highest occupancy possible. The experimental results indicate that for 48 objects per membrane, the algorithm facilitates a 93-fold increase in processing speed compared to a 1.6-fold increase with previous algorithms.


Sign in / Sign up

Export Citation Format

Share Document