Parallelizing Multiple Flow Accumulation Algorithm using CUDA and OpenACC

Watershed analysis, as a fundamental component of digital terrain analysis, is based on the Digital Elevation Model (DEM), which is a grid (raster) model of the Earth surface and topography. Watershed analysis consists of computationally and data intensive computing algorithms that need to be implemented by leveraging parallel and high-performance computing methods and techniques. In this paper, the Multiple Flow Direction (MFD) algorithm for watershed analysis is implemented and evaluated on multi-core Central Processing Units (CPU) and many-core Graphics Processing Units (GPU), which provides significant improvements in performance and energy usage. The implementation is based on NVIDIA CUDA (Compute Unified Device Architecture) implementation for GPU, as well as on OpenACC (Open ACCelerators), a parallel programming model, and a standard for parallel computing. Both phases of the MFD algorithm (i) iterative DEM preprocessing and (ii) iterative MFD algorithm, are parallelized and run over multi-core CPU and GPU. The evaluation of the proposed solutions is performed with respect to the execution time, energy consumption, and programming effort for algorithm parallelization for different sizes of input data. An experimental evaluation has shown not only the advantage of using OpenACC programming over CUDA programming in implementing the watershed analysis on a GPU in terms of performance, energy consumption, and programming effort, but also significant benefits in implementing it on the multi-core CPU.

Download Full-text

Accelerating a Geometrical Approximated PCA Algorithm Using AVX2 and CUDA

Remote Sensing ◽

10.3390/rs12121918 ◽

2020 ◽

Vol 12 (12) ◽

pp. 1918 ◽

Cited By ~ 1

Author(s):

Alina L. Machidon ◽

Octavian M. Machidon ◽

Cătălin B. Ciobanu ◽

Petre L. Ogrutan

Keyword(s):

Energy Consumption ◽

Dimensionality Reduction ◽

Graphics Processing Units ◽

High Performance ◽

Hyperspectral Image ◽

Projection Pursuit ◽

Remote Sensing Data ◽

Principal Component ◽

Large Datasets ◽

Central Processing

Remote sensing data has known an explosive growth in the past decade. This has led to the need for efficient dimensionality reduction techniques, mathematical procedures that transform the high-dimensional data into a meaningful, reduced representation. Projection Pursuit (PP) based algorithms were shown to be efficient solutions for performing dimensionality reduction on large datasets by searching low-dimensional projections of the data where meaningful structures are exposed. However, PP faces computational difficulties in dealing with very large datasets—which are common in hyperspectral imaging, thus raising the challenge for implementing such algorithms using the latest High Performance Computing approaches. In this paper, a PP-based geometrical approximated Principal Component Analysis algorithm (gaPCA) for hyperspectral image analysis is implemented and assessed on multi-core Central Processing Units (CPUs), Graphics Processing Units (GPUs) and multi-core CPUs using Single Instruction, Multiple Data (SIMD) AVX2 (Advanced Vector eXtensions) intrinsics, which provide significant improvements in performance and energy usage over the single-core implementation. Thus, this paper presents a cross-platform and cross-language perspective, having several implementations of the gaPCA algorithm in Matlab, Python, C++ and GPU implementations based on NVIDIA Compute Unified Device Architecture (CUDA). The evaluation of the proposed solutions is performed with respect to the execution time and energy consumption. The experimental evaluation has shown not only the advantage of using CUDA programming in implementing the gaPCA algorithm on a GPU in terms of performance and energy consumption, but also significant benefits in implementing it on the multi-core CPU using AVX2 intrinsics.

Download Full-text

Comparative study of the implementation of the Lagrange interpolation algorithm on GPU and CPU using CUDA to compute the density of a material at different temperatures

SHS Web of Conferences ◽

10.1051/shsconf/202111907002 ◽

2021 ◽

Vol 119 ◽

pp. 07002

Author(s):

Youness Rtal ◽

Abdelkader Hadjoudja

Keyword(s):

Parallel Computing ◽

Graphics Processing Units ◽

Lagrange Interpolation ◽

Polynomial Interpolation ◽

Programming Model ◽

Interpolation Method ◽

Processing Unit ◽

Central Processing ◽

Computational Performance ◽

Different Temperatures

Graphics Processing Units (GPUs) are microprocessors attached to graphics cards, which are dedicated to the operation of displaying and manipulating graphics data. Currently, such graphics cards (GPUs) occupy all modern graphics cards. In a few years, these microprocessors have become potent tools for massively parallel computing. Such processors are practical instruments that serve in developing several fields like image processing, video and audio encoding and decoding, the resolution of a physical system with one or more unknowns. Their advantages: faster processing and consumption of less energy than the power of the central processing unit (CPU). In this paper, we will define and implement the Lagrange polynomial interpolation method on GPU and CPU to calculate the sodium density at different temperatures Ti using the NVIDIA CUDA C parallel programming model. It can increase computational performance by harnessing the power of the GPU. The objective of this study is to compare the performance of the implementation of the Lagrange interpolation method on CPU and GPU processors and to deduce the efficiency of the use of GPUs for parallel computing.

Download Full-text

The VOLNA-OP2 tsunami code (version 1.5)

Geoscientific Model Development ◽

10.5194/gmd-11-4621-2018 ◽

2018 ◽

Vol 11 (11) ◽

pp. 4621-4635 ◽

Cited By ~ 7

Author(s):

Istvan Z. Reguly ◽

Daniel Giles ◽

Devaraj Gopinathan ◽

Laure Quivy ◽

Joakim H. Beck ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Shallow Water Equation ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Central Processing ◽

Domain Specific ◽

Computing Platforms ◽

Graphics Processing ◽

Intel Xeon

Abstract. In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite-volume non-linear shallow-water equation (NSWE) solver built on the OP2 domain-specific language (DSL) for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high-performance computing platforms: central processing units (CPUs), the Intel Xeon Phi, and graphics processing units (GPUs). This is achieved in a way that the scientific code is kept separate from various parallel implementations, enabling easy maintainability. It has already been used in production for several years; here we discuss how it can be integrated into various workflows, such as a statistical emulator. The scalability of the code is demonstrated on three supercomputers, built with classical Xeon CPUs, the Intel Xeon Phi, and NVIDIA P100 GPUs. VOLNA-OP2 shows an ability to deliver productivity as well as performance and portability to its users across a number of platforms.

Download Full-text

Controllers: An abstraction to ease the use of hardware accelerators

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017702962 ◽

2017 ◽

Vol 32 (6) ◽

pp. 838-853 ◽

Cited By ~ 4

Author(s):

Ana Moreton–Fernandez ◽

Hector Ortega–Arranz ◽

Arturo Gonzalez–Escribano

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Abstract Entity ◽

Hardware Accelerators ◽

Processing Unit ◽

Central Processing ◽

Computing Platforms ◽

Graphics Processing ◽

Performance Computing ◽

Selection Of

Nowadays the use of hardware accelerators, such as the graphics processing units or XeonPhi coprocessors, is key in solving computationally costly problems that require high performance computing. However, programming solutions for an efficient deployment for these kind of devices is a very complex task that relies on the manual management of memory transfers and configuration parameters. The programmer has to carry out a deep study of the particular data that needs to be computed at each moment, across different computing platforms, also considering architectural details. We introduce the controller concept as an abstract entity that allows the programmer to easily manage the communications and kernel launching details on hardware accelerators in a transparent way. This model also provides the possibility of defining and launching central processing unit kernels in multi-core processors with the same abstraction and methodology used for the accelerators. It internally combines different native programming models and technologies to exploit the potential of each kind of device. Additionally, the model also allows the programmer to simplify the proper selection of values for several configuration parameters that can be selected when a kernel is launched. This is done through a qualitative characterization process of the kernel code to be executed. Finally, we present the implementation of the controller model in a prototype library, together with its application in several case studies. Its use has led to reductions in the development and porting costs, with significantly low overheads in the execution times when compared to manually programmed and optimized solutions which directly use CUDA and OpenMP.

Download Full-text

Advanced Topics GPU Programming and CUDA Architecture

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Emerging Research Surrounding Power Consumption and Performance Issues in Utility Computing ◽

10.4018/978-1-4666-8853-7.ch008 ◽

2016 ◽

pp. 175-203

Author(s):

Mainak Adhikari ◽

Sukhendu Kar

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Programming Model ◽

Graphics Processing Unit ◽

Direct Access ◽

Gpu Programming ◽

Processing Unit ◽

Computing Platform ◽

Cuda Architecture ◽

Graphics Processing

Graphics processing unit (GPU), which typically handles computation only for computer graphics. Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs). CUDA gives program developers direct access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. This chapter first discuss some features and challenges of GPU programming and the effort to address some of the challenges with building and running GPU programming in high performance computing (HPC) environment. Finally this chapter point out the importance and standards of CUDA architecture.

Download Full-text

POM.gpu-v1.0: a GPU-based Princeton Ocean Model

Geoscientific Model Development ◽

10.5194/gmd-8-2815-2015 ◽

2015 ◽

Vol 8 (9) ◽

pp. 2815-2827 ◽

Cited By ~ 13

Author(s):

S. Xu ◽

X. Huang ◽

L.-Y. Oey ◽

F. Xu ◽

H. Fu ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Climate Models ◽

Ocean Model ◽

Compute Unified Device Architecture ◽

Princeton Ocean Model ◽

Central Processing ◽

Device Architecture ◽

Computationally Intensive ◽

Graphics Processing

Abstract. Graphics processing units (GPUs) are an attractive solution in many scientific applications due to their high performance. However, most existing GPU conversions of climate models use GPUs for only a few computationally intensive regions. In the present study, we redesign the mpiPOM (a parallel version of the Princeton Ocean Model) with GPUs. Specifically, we first convert the model from its original Fortran form to a new Compute Unified Device Architecture C (CUDA-C) code, then we optimize the code on each of the GPUs, the communications between the GPUs, and the I / O between the GPUs and the central processing units (CPUs). We show that the performance of the new model on a workstation containing four GPUs is comparable to that on a powerful cluster with 408 standard CPU cores, and it reduces the energy consumption by a factor of 6.8.

Download Full-text

High-performance computing in water resources hydrodynamics

Journal of Hydroinformatics ◽

10.2166/hydro.2020.163 ◽

2020 ◽

Vol 22 (5) ◽

pp. 1217-1235 ◽

Cited By ~ 3

Author(s):

M. Morales-Hernández ◽

M. B. Sharif ◽

S. Gangrade ◽

T. T. Dullo ◽

S.-C. Kao ◽

...

Keyword(s):

Water Resources ◽

High Performance Computing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Test Case ◽

Processing Unit ◽

Central Processing ◽

Graphics Processing ◽

Performance Computing

Abstract This work presents a vision of future water resources hydrodynamics codes that can fully utilize the strengths of modern high-performance computing (HPC). The advances to computing power, formerly driven by the improvement of central processing unit processors, now focus on parallel computing and, in particular, the use of graphics processing units (GPUs). However, this shift to a parallel framework requires refactoring the code to make efficient use of the data as well as changing even the nature of the algorithm that solves the system of equations. These concepts along with other features such as the precision for the computations, dry regions management, and input/output data are analyzed in this paper. A 2D multi-GPU flood code applied to a large-scale test case is used to corroborate our statements and ascertain the new challenges for the next-generation parallel water resources codes.

Download Full-text

Reconfigurable Hardware Generation of Multigrid Solvers with Conjugate Gradient Coarse-Grid Solution

Parallel Processing Letters ◽

10.1142/s0129626418500160 ◽

2018 ◽

Vol 28 (04) ◽

pp. 1850016 ◽

Cited By ~ 2

Author(s):

Christian Schmitt ◽

Moritz Schmid ◽

Sebastian Kuckuk ◽

Harald Köstler ◽

Jürgen Teich ◽

...

Keyword(s):

Conjugate Gradient ◽

Graphics Processing Units ◽

High Performance ◽

Central Processing ◽

Domain Specific ◽

Multigrid Solvers ◽

Field Programmable ◽

Grid Solution ◽

Tool Set ◽

Numerical Solver

Not only in the field of high-performance computing (HPC), field programmable gate arrays (FPGAs) are a soaringly popular accelerator technology. However, they use a completely different programming paradigm and tool set compared to central processing units (CPUs) or even graphics processing units (GPUs), adding extra development steps and requiring special knowledge, hindering widespread use in scientific computing. To bridge this programmability gap, domain-specific languages (DSLs) are a popular choice to generate low-level implementations from an abstract algorithm description. In this work, we demonstrate our approach for the generation of numerical solver implementations based on the multigrid method for FPGAs from the same code base that is also used to generate code for CPUs using a hybrid parallelization of MPI and OpenMP. Our approach yields in a hardware design that can compute up to 11 V-cycles per second with an input grid size of 4096[Formula: see text]4096 and solution on the coarsest using the conjugate gradient (CG) method on a mid-range FPGA, beating vectorized, multi-threaded execution on an Intel Xeon processor.

Download Full-text

Optimization of K-Means Clustering on Graphics Processing Unit Using Compute Unified Device Architecture

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2017.6274 ◽

2017 ◽

Vol 14 (1) ◽

pp. 789-795

Author(s):

V Saveetha ◽

S Sophia

Keyword(s):

High Performance ◽

Programming Model ◽

Graphics Processing Unit ◽

Direct Access ◽

Communication Overhead ◽

Processing Unit ◽

Compute Unified Device Architecture ◽

Central Processing ◽

Device Architecture ◽

Graphics Processing

Parallel data clustering aims at using algorithms and methods to extract knowledge from fat databases in rational time using high performance architectures. The computational challenge faced by cluster analysis due to increasing capacity of data can be overcome by exploiting the power of these architectures. The recent development in parallel power of Graphics Processing Unit enables low cost high performance solutions for general purpose applications. The Compute Unified Device Architecture programming model provides application programming interface methods to handle data proficiently on Graphics Processing Unit for iterative clustering algorithms like K-Means. The existing Graphics Processing Unit based K-Means algorithms highly focus on improvising the speedup of the algorithms and fall short to handle the high time spent on transfer of data between the Central Processing Unit and Graphics Processing Unit. A competent K-Means algorithm is proposed in this paper to lessen the transfer time by introducing a novel approach to check the convergence of the algorithm and utilize the pinned memory for direct access. This algorithm outperforms the other algorithms by maximizing parallelism and utilizing the memory features. The relative speedups and the validity measure for the proposed algorithm is elevated when compared with K-Means on Graphics Processing Unit and K-Means using Flag on Graphics Processing Unit. Thus the planned approach proves that communication overhead can be reduced in K-Means clustering.

Download Full-text

An Overview of Thermal and Mechanical Design, Control, and Testing of the World's Most Powerful and Fastest Supercomputer

Journal of Electronic Packaging ◽

10.1115/1.4046847 ◽

2020 ◽

Vol 143 (1) ◽

Author(s):

Anil Yuksel ◽

Vic Mahaney ◽

Chris Marroquin ◽

Shurong Tian ◽

Mark Hoffmeyer ◽

...

Keyword(s):

Water Flow ◽

Graphics Processing Units ◽

High Performance ◽

Mechanical Design ◽

Control Strategies ◽

Thermal Control ◽

Cold Plate ◽

Design Strategies ◽

Electronic Components ◽

Central Processing

Abstract A new era of computing has begun with the development of high-performance computing (HPC), artificial intelligence (AI), machine learning (ML), and cognitive systems. Dramatic increases in the power density of the electronic components have led to the design and architecture of efficient thermal management technologies on these systems. IBM designed and delivered in 2018 the most powerful and fastest supercomputers of the world known as Summit and Sierra having 200 petaflops peak computing performance through LINPACK benchmarks. These systems which are called as IBM POWER AC922 are both air and liquid cooled, where water is employed in liquid-cooled systems to cool the high-power electronic components including IBM POWER9 processors and NVIDIA graphics processing units (GPUs). In this paper, we highlight the overview of the thermal and mechanical design strategies applied to these systems. Testing and experimental analysis with comparison to computational modeling is provided. Thermal control strategies are investigated for the optimization of overall system efficiency. In air cooled systems, we discuss the fan and heat sink designs, as well as the preheating effect on the PCIe section. In liquid-cooled systems, which have a unique cold plate design cooling the processors and the GPUs with water, we examine the water flow path design for the central processing units (CPUs), the GPUs, and the thermal performance of the cold plate. An overview of the cooling assemblies such as TIMs and air baffles in these systems is discussed. Unit and rack manifolds and rear door heat exchanger (RDHx) are investigated. Water flow and pressure distribution at the node and rack-level are provided.

Download Full-text