Hybrid MPI and CUDA Parallelization for CFD Applications on Multi-GPU HPC Clusters

Graphics processing units (GPUs) have a strong floating-point capability and a high memory bandwidth in data parallelism and have been widely used in high-performance computing (HPC). Compute unified device architecture (CUDA) is used as a parallel computing platform and programming model for the GPU to reduce the complexity of programming. The programmable GPUs are becoming popular in computational fluid dynamics (CFD) applications. In this work, we propose a hybrid parallel algorithm of the message passing interface and CUDA for CFD applications on multi-GPU HPC clusters. The AUSM + UP upwind scheme and the three-step Runge–Kutta method are used for spatial discretization and time discretization, respectively. The turbulent solution is solved by the K−ω SST two-equation model. The CPU only manages the execution of the GPU and communication, and the GPU is responsible for data processing. Parallel execution and memory access optimizations are used to optimize the GPU-based CFD codes. We propose a nonblocking communication method to fully overlap GPU computing, CPU_CPU communication, and CPU_GPU data transfer by creating two CUDA streams. Furthermore, the one-dimensional domain decomposition method is used to balance the workload among GPUs. Finally, we evaluate the hybrid parallel algorithm with the compressible turbulent flow over a flat plate. The performance of a single GPU implementation and the scalability of multi-GPU clusters are discussed. Performance measurements show that multi-GPU parallelization can achieve a speedup of more than 36 times with respect to CPU-based parallel computing, and the parallel algorithm has good scalability.

Download Full-text

Comparative study of the implementation of the Lagrange interpolation algorithm on GPU and CPU using CUDA to compute the density of a material at different temperatures

SHS Web of Conferences ◽

10.1051/shsconf/202111907002 ◽

2021 ◽

Vol 119 ◽

pp. 07002

Author(s):

Youness Rtal ◽

Abdelkader Hadjoudja

Keyword(s):

Parallel Computing ◽

Graphics Processing Units ◽

Lagrange Interpolation ◽

Polynomial Interpolation ◽

Programming Model ◽

Interpolation Method ◽

Processing Unit ◽

Central Processing ◽

Computational Performance ◽

Different Temperatures

Graphics Processing Units (GPUs) are microprocessors attached to graphics cards, which are dedicated to the operation of displaying and manipulating graphics data. Currently, such graphics cards (GPUs) occupy all modern graphics cards. In a few years, these microprocessors have become potent tools for massively parallel computing. Such processors are practical instruments that serve in developing several fields like image processing, video and audio encoding and decoding, the resolution of a physical system with one or more unknowns. Their advantages: faster processing and consumption of less energy than the power of the central processing unit (CPU). In this paper, we will define and implement the Lagrange polynomial interpolation method on GPU and CPU to calculate the sodium density at different temperatures Ti using the NVIDIA CUDA C parallel programming model. It can increase computational performance by harnessing the power of the GPU. The objective of this study is to compare the performance of the implementation of the Lagrange interpolation method on CPU and GPU processors and to deduce the efficiency of the use of GPUs for parallel computing.

Download Full-text

A Simulation of Domain Decomposition Method for Smoothed Particle Hydrodynamics

Journal of Engineering Materials and Technology ◽

10.1115/1.4035486 ◽

2017 ◽

Vol 139 (2) ◽

Author(s):

Taehyo Park ◽

Shengjie Li ◽

Mina Lee ◽

Moonho Tak

Keyword(s):

Parallel Computing ◽

Smoothed Particle Hydrodynamics ◽

Message Passing ◽

Message Passing Interface ◽

Domain Decomposition Method ◽

Computational Domain ◽

Sph Method ◽

Multiple Data ◽

Particle Hydrodynamics ◽

Smoothed Particle

Nowadays, the numerical method has become a very important approach for solving complex problems in engineering and science. Some grid-based methods such as the finite difference method (FDM) and finite element method (FEM) have already been widely applied to various areas; however, they still suffer from inherent difficulties which limit their applications to many problems. Therefore, a strong interest is focused on the meshfree methods such as smoothed particle hydrodynamics (SPH) to simulate fluid flow recently due to the advantages in dealing with some complicated problems. In the SPH method, a great number of particles will be used because the whole domain is represented by a set of arbitrarily distributed particles. To improve the numerical efficiency, parallelization using message-passing interface (MPI) is applied to the problems with the large computational domain. In parallel computing, the whole domain is decomposed by the parallel method for continuity of subdomain boundary under the single instruction multiple data (SIMD) and also based on the procedure of the SPH computations. In this work, a new scheme of parallel computing is employed into the SPH method to analyze SPH particle fluid. In this scheme, the whole domain is decomposed into subdomains under the SIMD process and it composes the boundary conditions to the interface particles which will improve the detection of neighbor particles near the boundary. With the method of parallel computing, the SPH method is to be more flexible and perform better.

Download Full-text

Accelerating Spark-Based Applications with MPI and OpenACC

Complexity ◽

10.1155/2021/9943289 ◽

2021 ◽

Vol 2021 ◽

pp. 1-17

Author(s):

Saeed Alshahrani ◽

Waleed Al Shehri ◽

Jameel Almalki ◽

Ahmed M. Alghamdi ◽

Abdullah M. Alammari

Keyword(s):

Big Data ◽

Power Consumption ◽

Parallel Programming ◽

Graphics Processing Units ◽

Message Passing Interface ◽

Programming Model ◽

Programming Models ◽

Mapping Technique ◽

Big Data Applications ◽

Parallel Programming Models

The amount of data produced in scientific and commercial fields is growing dramatically. Correspondingly, big data technologies, such as Hadoop and Spark, have emerged to tackle the challenges of collecting, processing, and storing such large-scale data. Unfortunately, big data applications usually have performance issues and do not fully exploit a hardware infrastructure. One reason is that applications are developed using high-level programming languages that do not provide low-level system control in terms of performance of highly parallel programming models like message passing interface (MPI). Moreover, big data is considered a barrier of parallel programming models or accelerators (e.g., CUDA and OpenCL). Therefore, the aim of this study is to investigate how the performance of big data applications can be enhanced without sacrificing the power consumption of a hardware infrastructure. A Hybrid Spark MPI OpenACC (HSMO) system is proposed for integrating Spark as a big data programming model, with MPI and OpenACC as parallel programming models. Such integration brings together the advantages of each programming model and provides greater effectiveness. To enhance performance without sacrificing power consumption, the integration approach needs to exploit the hardware infrastructure in an intelligent manner. For achieving this performance enhancement, a mapping technique is proposed that is built based on the application’s virtual topology as well as the physical topology of the undelaying resources. To the best of our knowledge, there is no existing method in big data applications related to utilizing graphics processing units (GPUs), which are now an essential part of high-performance computing (HPC) as a powerful resource for fast computation.

Download Full-text

Use of GPU Computing for Uncertainty Quantification in Computational Mechanics: A Case Study

Scientific Programming ◽

10.1155/2011/730213 ◽

2011 ◽

Vol 19 (4) ◽

pp. 199-212 ◽

Cited By ~ 3

Author(s):

Gaurav ◽

Steven F. Wojtkiewicz

Keyword(s):

Parallel Computing ◽

Uncertainty Quantification ◽

Graphics Processing Units ◽

Computational Mechanics ◽

Gpu Computing ◽

Single Instruction Multiple Data ◽

Performance Constraints ◽

Multiple Data ◽

Graphics Processing

Graphics processing units (GPUs) are rapidly emerging as a more economical and highly competitive alternative to CPU-based parallel computing. As the degree of software control of GPUs has increased, many researchers have explored their use in non-gaming applications. Recent studies have shown that GPUs consistently outperform their best corresponding CPU-based parallel computing alternatives in single-instruction multiple-data (SIMD) strategies. This study explores the use of GPUs for uncertainty quantification in computational mechanics. Five types of analysis procedures that are frequently utilized for uncertainty quantification of mechanical and dynamical systems have been considered and their GPU implementations have been developed. The numerical examples presented in this study show that considerable gains in computational efficiency can be obtained for these procedures. It is expected that the GPU implementations presented in this study will serve as initial bases for further developments in the use of GPUs in the field of uncertainty quantification and will (i) aid the understanding of the performance constraints on the relevant GPU kernels and (ii) provide some guidance regarding the computational and the data structures to be utilized in these novel GPU implementations.

Download Full-text

A Parallel Heuristic Method for Optimizing a Real Life Problem (Agricultural Land Investment Problem)

Academic Journal of Nawroz University ◽

10.25007/ajnu.v7n4a286 ◽

2018 ◽

Vol 7 (4) ◽

pp. 168

Author(s):

Sagvan A. Saleh

Keyword(s):

Parallel Computing ◽

Message Passing Interface ◽

Agricultural Land ◽

Programming Model ◽

Heuristic Method ◽

Real Life ◽

Neighborhood Search ◽

Parallel Method ◽

Real Life Problem ◽

Investment Problem

This paper proposed a parallel method for solving the Agricultural Land Investment Problem (ALIP), the problem that has an important impact on the agriculture issues. The author is first represent mathematically the problem by introducing a mathematical programming model. Then, a parallel method is proposed for optimizing the problem. The proposed method based on principles of parallel computing and neighborhood search methods. Neighborhood search techniques explore a series of solutions spaces with the aim of finding the best one. This is exploited in parallel computing, where several search processes are performed simultaneously. The parallel computing is designed using Message Passing Interface (MPI) which allows to build a flexible parallel program that can be executed in multicore and/or distributed environment. The method is competitive since it is able to solve a real life problem and yield high quality results in a fast solution runtime.

Download Full-text

Accelerating Training Process in Logistic Regression Model using OpenCL Framework

International Journal of Grid and High Performance Computing ◽

10.4018/ijghpc.2017070103 ◽

2017 ◽

Vol 9 (3) ◽

pp. 34-45

Author(s):

Hamada M. Zahera ◽

Ashraf Bahgat El-Sisi

Keyword(s):

Logistic Regression ◽

Regression Model ◽

Graphics Processing Units ◽

High Performance ◽

Message Passing Interface ◽

Logistic Regression Model ◽

Gpu Computing ◽

Large Datasets ◽

Training Process ◽

Training Time

In this paper, the authors propose a new parallel implemented approach on Graphics Processing Units (GPU) for training logistic regression model. Logistic regression has been applied in many machine learning applications to build building predictive models. However, logistic training regularly requires a long time to adapt an accurate prediction model. Researchers have worked out to reduce training time using different technologies such as multi-threading, Multi-core CPUs and Message Passing Interface (MPI). In their study, the authors consider the high computation capabilities of GPU and easy development onto Open Computing Language (OpenCL) framework to execute logistic training process. GPU and OpenCL are the best choice with low cost and high performance for scaling up logistic regression model in handling large datasets. The proposed approach was implement in OpenCL C/C++ and tested by different size datasets on two GPU platforms. The experimental results showed a significant improvement in execution time with large datasets, which is reduced inversely by the available GPU computing units.

Download Full-text

New general mixed-integer linear programming model for mobile workforce management

Optimization and Engineering ◽

10.1007/s11081-021-09597-0 ◽

2021 ◽

Author(s):

András Éles ◽

István Heckl ◽

Heriberto Cabezas

Keyword(s):

Programming Model ◽

Time Windows ◽

Mutual Exclusion ◽

Parallel Execution ◽

Mixed Integer ◽

Management Problem ◽

Routing Problem ◽

Workforce Management ◽

Computational Performance ◽

Wide Range

AbstractA mathematical model is introduced to solve a mobile workforce management problem. In such a problem there are a number of tasks to be executed at different locations by various teams. For example, when an electricity utility company has to deal with planned system upgrades and damages caused by storms. The aim is to determine the schedule of the teams in such a way that the overall cost is minimal. The mobile workforce management problem involves scheduling. The following questions should be answered: when to perform a task, how to route vehicles—the vehicle routing problem—and the order the sites should be visited and by which teams. These problems are already complex in themselves. This paper proposes an integrated mathematical programming model formulation, which, by the assignment of its binary variables, can be easily included in heuristic algorithmic frameworks. In the problem specification, a wide range of parameters can be set. This includes absolute and expected time windows for tasks, packing and unpacking in case of team movement, resource utilization, relations between tasks such as precedence, mutual exclusion or parallel execution, and team-dependent travelling and execution times and costs. To make the model able to solve larger problems, an algorithmic framework is also implemented which can be used to find heuristic solutions in acceptable time. This latter solution method can be used as an alternative. Computational performance is examined through a series of test cases in which the most important factors are scaled.

Download Full-text

Parallel finite-element method using domain decomposition and Parareal for transient motor starting analysis

COMPEL The International Journal for Computation and Mathematics in Electrical and Electronic Engineering ◽

10.1108/compel-12-2018-0516 ◽

2019 ◽

Vol 38 (5) ◽

pp. 1507-1520 ◽

Cited By ~ 1

Author(s):

Yasuhito Takahashi ◽

Koji Fujiwara ◽

Takeshi Iwashita ◽

Hiroshi Nakashima

Keyword(s):

Finite Element Method ◽

Finite Element ◽

Parallel Computing ◽

Domain Decomposition ◽

Parallel Computation ◽

Domain Decomposition Method ◽

Space Time ◽

Content Type ◽

Parallel Performance ◽

Element Method

Purpose This paper aims to propose a parallel-in-space-time finite-element method (FEM) for transient motor starting analyses. Although the domain decomposition method (DDM) is suitable for solving large-scale problems and the parallel-in-time (PinT) integration method such as Parareal and time domain parallel FEM (TDPFEM) is effective for problems with a large number of time steps, their parallel performances get saturated as the number of processes increases. To overcome the difficulty, the hybrid approach in which both the DDM and PinT integration methods are used is investigated in a highly parallel computing environment. Design/methodology/approach First, the parallel performances of the DDM, Parareal and TDPFEM were compared because the scalability of these methods in highly parallel computation has not been deeply discussed. Then, the combination of the DDM and Parareal was investigated as a parallel-in-space-time FEM. The effectiveness of the developed method was demonstrated in transient starting analyses of induction motors. Findings The combination of Parareal with the DDM can improve the parallel performance in the case where the parallel performance of the DDM, TDPFEM or Parareal is saturated in highly parallel computation. In the case where the number of unknowns is large and the number of available processes is limited, the use of DDM is the most effective from the standpoint of computational cost. Originality/value This paper newly develops the parallel-in-space-time FEM and demonstrates its effectiveness in nonlinear magnetoquasistatic field analyses of electric machines. This finding is significantly important because a new direction of parallel computing techniques and great potential for its further development are clarified.

Download Full-text

PARALLEL COMPUTING OF NUMERICAL SCHEMES AND BIG DATA ANALYTIC FOR SOLVING REAL LIFE APPLICATIONS

Jurnal Teknologi ◽

10.11113/jt.v78.9552 ◽

2016 ◽

Vol 78 (8-2) ◽

Cited By ~ 2

Author(s):

Norma Alias ◽

Nadia Nofri Yeni Suhari ◽

Hafizah Farhah Saipan Saipol ◽

Abdullah Aysh Dahawi ◽

Masyitah Mohd Saidi ◽

...

Keyword(s):

Big Data ◽

Parallel Computing ◽

Parallel Algorithm ◽

Sparse Matrices ◽

Real Life ◽

Poor Performance ◽

Equation System ◽

Numerical Schemes ◽

Linear Equation System ◽

Data Analytic

This paper proposed the several real life applications for big data analytic using parallel computing software. Some parallel computing software under consideration are Parallel Virtual Machine, MATLAB Distributed Computing Server and Compute Unified Device Architecture to simulate the big data problems. The parallel computing is able to overcome the poor performance at the runtime, speedup and efficiency of programming in sequential computing. The mathematical models for the big data analytic are based on partial differential equations and obtained the large sparse matrices from discretization and development of the linear equation system. Iterative numerical schemes are used to solve the problems. Thus, the process of computational problems are summarized in parallel algorithm. Therefore, the parallel algorithm development is based on domain decomposition of problems and the architecture of difference parallel computing software. The parallel performance evaluations for distributed and shared memory architecture are investigated in terms of speedup, efficiency, effectiveness and temporal performance.

Download Full-text

A Fast and Rigorously Parallel Surface Voxelization Technique for GPU-Accelerated CFD Simulations

Communications in Computational Physics ◽

10.4208/cicp.2014.m414 ◽

2015 ◽

Vol 17 (5) ◽

pp. 1246-1270 ◽

Cited By ~ 8

Author(s):

C. F. Janßen ◽

N. Koliha ◽

T. Rung

Keyword(s):

Graphics Processing Units ◽

Parallel Implementation ◽

Parallel Execution ◽

Time Step ◽

Cfd Simulations ◽

Performance Loss ◽

Neighbor Search ◽

Body Shell ◽

Normal Vectors ◽

Grid Nodes

AbstractThis paper presents a fast surface voxelization technique for the mapping of tessellated triangular surface meshes to uniform and structured grids that provide a basis for CFD simulations with the lattice Boltzmann method (LBM). The core algorithm is optimized for massively parallel execution on graphics processing units (GPUs) and is based on a unique dissection of the inner body shell. This unique definition necessitates a topology based neighbor search as a preprocessing step, but also enables parallel implementation. More specifically, normal vectors of adjacent triangular tessellations are used to construct half-angles that clearly separate the per-triangle regions. For each triangle, the grid nodes inside the axis-aligned bounding box (AABB) are tested for their distance to the triangle in question and for certain well-defined relative angles. The performance of the presented grid generation procedure is superior to the performance of the GPU-accelerated flow field computations per time step which allows efficient fluid-structure interaction simulations, without noticeable performance loss due to the dynamic grid update.

Download Full-text