A lightweight approach to performance portability with targetDP

Alan Gray; Kevin Stratford

doi:10.1177/1094342016682071

A lightweight approach to performance portability with targetDP

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016682071 ◽

2016 ◽

Vol 32 (2) ◽

pp. 288-301

Author(s):

Alan Gray ◽

Kevin Stratford

Keyword(s):

Particle Physics ◽

Message Passing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Processing Unit ◽

Performance Portability ◽

Graphics Processing

Leading high performance computing systems achieve their status through use of highly parallel devices such as NVIDIA graphics processing units or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus separate lattice quantum chromodynamics particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with Message Passing Interface (MPI) to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and graphics processing unit-accelerated large scale supercomputers.

Get full-text (via PubEx)

GPU Computation and Platforms

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Emerging Research Surrounding Power Consumption and Performance Issues in Utility Computing ◽

10.4018/978-1-4666-8853-7.ch007 ◽

2016 ◽

pp. 136-174

Author(s):

K. Bhargavi ◽

Sathish Babu B.

Keyword(s):

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Gpu Computing ◽

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Computing Platforms ◽

Computationally Intensive ◽

Graphics Processing

The GPUs (Graphics Processing Unit) were mainly used to speed up computation intensive high performance computing applications. There are several tools and technologies available to perform general purpose computationally intensive application. This chapter primarily discusses about GPU parallelism, applications, probable challenges and also highlights some of the GPU computing platforms, which includes CUDA, OpenCL (Open Computing Language), OpenMPC (Open MP extended for CUDA), MPI (Message Passing Interface), OpenACC (Open Accelerator), DirectCompute, and C++ AMP (C++ Accelerated Massive Parallelism). Each of these platforms is discussed briefly along with their advantages and disadvantages.

Get full-text (via PubEx)

HPC simulations of brownout: A noninteracting particles dynamic model

The International Journal of High Performance Computing Applications ◽

10.1177/1094342020905971 ◽

2020 ◽

Vol 34 (3) ◽

pp. 267-281

Author(s):

Roberto Porcù ◽

Edie Miglio ◽

Nicola Parolini ◽

Mattia Penati ◽

Noemi Vergopolan

Keyword(s):

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Time Integration ◽

Computational Cost ◽

Aircraft Design ◽

Euler Method ◽

Processing Unit ◽

Integration Algorithm

Helicopters can experience brownout when flying close to a dusty surface. The uplifting of dust in the air can remarkably restrict the pilot’s visibility area. Consequently, a brownout can disorient the pilot and lead to the helicopter collision against the ground. Given its risks, brownout has become a high-priority problem for civil and military operations. Proper helicopter design is thus critical, as it has a strong influence over the shape and density of the cloud of dust that forms when brownout occurs. A way forward to improve aircraft design against brownout is the use of particle simulations. For simulations to be accurate and comparable to the real phenomenon, billions of particles are required. However, using a large number of particles, serial simulations can be slow and too computationally expensive to be performed. In this work, we investigate an message passing interface (MPI) + graphics processing unit (multi-GPU) approach to simulate brownout. In specific, we use a semi-implicit Euler method to consider the particle dynamics in a Lagrangian way, and we adopt a precomputed aerodynamic field. Here, we do not include particle–particle collisions in the model; this allows for independent trajectories and effective model parallelization. To support our methodology, we provide a speedup analysis of the parallelization concerning the serial and pure-MPI simulations. The results show (i) very high speedups of the MPI + multi-GPU implementation with respect to the serial and pure-MPI ones, (ii) excellent weak and strong scalability properties of the implemented time-integration algorithm, and (iii) the possibility to run realistic simulations of brownout with billions of particles at a relatively small computational cost. This work paves the way toward more realistic brownout simulations, and it highlights the potential of high-performance computing for aiding and advancing aircraft design for brownout mitigation.

Get full-text (via PubEx)

Splotch

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016652713 ◽

2016 ◽

Vol 31 (6) ◽

pp. 550-563

Author(s):

Timothy Dykes ◽

Claudio Gheller ◽

Marzia Rivi ◽

Mel Krokos

Keyword(s):

High Performance ◽

Large Scale ◽

Graphics Processing Unit ◽

Processing Unit ◽

Xeon Phi ◽

The Many ◽

Many Core ◽

Performance Results ◽

Graphics Processing ◽

Performance Computing

With the increasing size and complexity of data produced by large-scale numerical simulations, it is of primary importance for scientists to be able to exploit all available hardware in heterogenous high-performance computing environments for increased throughput and efficiency. We focus on the porting and optimization of Splotch, a scalable visualization algorithm, to utilize the Xeon Phi, Intel’s coprocessor based upon the new many integrated core architecture. We discuss steps taken to offload data to the coprocessor and algorithmic modifications to aid faster processing on the many-core architecture and make use of the uniquely wide vector capabilities of the device, with accompanying performance results using multiple Xeon Phi. Finally we compare performance against results achieved with the Graphics Processing Unit (GPU) based implementation of Splotch.

Get full-text (via PubEx)

Advanced Topics GPU Programming and CUDA Architecture

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Emerging Research Surrounding Power Consumption and Performance Issues in Utility Computing ◽

10.4018/978-1-4666-8853-7.ch008 ◽

2016 ◽

pp. 175-203

Author(s):

Mainak Adhikari ◽

Sukhendu Kar

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Programming Model ◽

Graphics Processing Unit ◽

Direct Access ◽

Gpu Programming ◽

Processing Unit ◽

Computing Platform ◽

Cuda Architecture ◽

Graphics Processing

Graphics processing unit (GPU), which typically handles computation only for computer graphics. Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs). CUDA gives program developers direct access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. This chapter first discuss some features and challenges of GPU programming and the effort to address some of the challenges with building and running GPU programming in high performance computing (HPC) environment. Finally this chapter point out the importance and standards of CUDA architecture.

Get full-text (via PubEx)

Data Streaming Processing Window Joined With Graphics Processing Units (GPUs)

Encyclopedia of Information Science and Technology, Fifth Edition - Advances in Information Quality and Management ◽

10.4018/978-1-7998-3479-3.ch043 ◽

2021 ◽

pp. 602-623

Author(s):

Shen Lu ◽

Richard S. Segall

Keyword(s):

Big Data ◽

Data Streams ◽

Graphics Processing Units ◽

Data Stream ◽

Large Scale ◽

Graphics Processing Unit ◽

Processing Unit ◽

Data Streaming ◽

Large Scale Data ◽

Graphics Processing

Big data is large-scale data and can be either discrete or continuous. This article entails research that discusses the continuous case of big data often called “data streaming.” More and more businesses will depend on being able to process and make decisions on streams of data. This article utilizes the algorithmic side of data stream processing often called “stream analytics” or “stream mining.” Data streaming Windows Join can be improved by using graphics processing unit (GPU) for higher performance computing. Data streams are generated by two independent threads: one thread can be used to generate Data Stream A, and the other thread can be used to generate Data Stream B. One would use a Windows Join thread to merge the two data streams, which is also the process of “Data Stream Window Join.” The Window Join process can be implemented in parallel that can efficiently improve the computing speed. Experiments are provided for Data Stream Window Joins using both static and dynamic data.

Get full-text (via PubEx)

High-performance computing in water resources hydrodynamics

Journal of Hydroinformatics ◽

10.2166/hydro.2020.163 ◽

2020 ◽

Vol 22 (5) ◽

pp. 1217-1235 ◽

Cited By ~ 3

Author(s):

M. Morales-Hernández ◽

M. B. Sharif ◽

S. Gangrade ◽

T. T. Dullo ◽

S.-C. Kao ◽

...

Keyword(s):

Water Resources ◽

High Performance Computing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Test Case ◽

Processing Unit ◽

Central Processing ◽

Graphics Processing ◽

Performance Computing

Abstract This work presents a vision of future water resources hydrodynamics codes that can fully utilize the strengths of modern high-performance computing (HPC). The advances to computing power, formerly driven by the improvement of central processing unit processors, now focus on parallel computing and, in particular, the use of graphics processing units (GPUs). However, this shift to a parallel framework requires refactoring the code to make efficient use of the data as well as changing even the nature of the algorithm that solves the system of equations. These concepts along with other features such as the precision for the computations, dry regions management, and input/output data are analyzed in this paper. A 2D multi-GPU flood code applied to a large-scale test case is used to corroborate our statements and ascertain the new challenges for the next-generation parallel water resources codes.

Get full-text (via PubEx)

GPU-Accelerated Parallel FDTD on Distributed Heterogeneous Platform

International Journal of Antennas and Propagation ◽

10.1155/2014/321081 ◽

2014 ◽

Vol 2014 ◽

pp. 1-8 ◽

Cited By ~ 2

Author(s):

Ronglin Jiang ◽

Shugang Jiang ◽

Yu Zhang ◽

Ying Xu ◽

Lei Xu ◽

...

Keyword(s):

Message Passing ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Processing Unit ◽

Problem Size ◽

Central Processing ◽

Execution Speed ◽

Speedup Ratio ◽

Electromagnetic Calculations ◽

Graphics Processing

This paper introduces a (finite difference time domain) FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI) and Open Multiprocessing (OpenMP). Since both Central Processing Unit (CPU) and Graphics Processing Unit (GPU) resources are utilized, a faster execution speed can be reached compared to a traditional pure GPU code. In our experiments, 64 NVIDIA TESLA K20m GPUs and 64 INTEL XEON E5-2670 CPUs are used to carry out the pure CPU, pure GPU, and CPU + GPU tests. Relative to the pure CPU calculations for the same problems, the speedup ratio achieved by CPU + GPU calculations is around 14. Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%–13.2% performance improvement. Because of the small memory size of GPUs, the FDTD problem size is usually very small. However, this code can enlarge the maximum problem size by 25% without reducing the performance of traditional pure GPU code. Finally, using this code, a microstrip antenna array with16×18elements is calculated and the radiation patterns are compared with the ones of MoM. Results show that there is a well agreement between them.

Get full-text (via PubEx)

A GPU-Based Wide-Band Radio Spectrometer

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2014.43 ◽

2014 ◽

Vol 31 ◽

Cited By ~ 11

Author(s):

Jayanth Chennamangalam ◽

Simon Scott ◽

Glenn Jones ◽

Hong Chen ◽

John Ford ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Graphics Processing Unit ◽

Wide Band ◽

Processing Unit ◽

Astronomical Instrumentation ◽

Online Data ◽

Polyphase Filter ◽

Polyphase Filter Bank ◽

Graphics Processing

AbstractThe graphics processing unit has become an integral part of astronomical instrumentation, enabling high-performance online data reduction and accelerated online signal processing. In this paper, we describe a wide-band reconfigurable spectrometer built using an off-the-shelf graphics processing unit card. This spectrometer, when configured as a polyphase filter bank, supports a dual-polarisation bandwidth of up to 1.1 GHz (or a single-polarisation bandwidth of up to 2.2 GHz) on the latest generation of graphics processing units. On the other hand, when configured as a direct fast Fourier transform, the spectrometer supports a dual-polarisation bandwidth of up to 1.4 GHz (or a single-polarisation bandwidth of up to 2.8 GHz).

Get full-text (via PubEx)

A Coupled Hydrologic–Hydraulic Model (XAJ–HiPIMS) for Flood Simulation

Water ◽

10.3390/w12051288 ◽

2020 ◽

Vol 12 (5) ◽

pp. 1288

Author(s):

Yueling Wang ◽

Xiaoliu Yang

Keyword(s):

High Performance ◽

Large Scale ◽

Graphics Processing Unit ◽

Hydrologic Model ◽

Hydraulic Model ◽

Small Scale ◽

Processing Unit ◽

Rainfall Runoff ◽

Graphics Processing ◽

The Impact

To protect ecologies and the environment by preventing floods, analysis of the impact of climate change on water requires a tool capable of considering the rainfall-runoff processes on a small scale, for example, 10 m. As has been shown previously, hydrologic models are good at simulating rainfall-runoff processes on a large scale, e.g., over several hundred km2, while hydraulic models are more advantageous for applications on smaller scales. In order to take advantages of these two types of models, this paper coupled a hydrologic model, the Xinanjing model (XAJ), with a hydraulic model, the Graphics Processing Unit (GPU)-accelerated high-performance integrated hydraulic modelling system (HiPIMS). The study was completed in the Misai basin (797 km2), located in Zhejiang Province, China. The coupled XAJ–HiPIMS model was validated against observed flood events. The simulated results agree well with the data observed at the basin outlet. The study proves that a coupled hydrologic and hydraulic model is capable of providing flood information on a small scale for a large basin and shows the potential of the research.

Get full-text (via PubEx)

GRAPHICS PROCESSING UNIT CLUSTER ACCELERATED MONTE CARLO SIMULATION OF PHOTON TRANSPORT IN MULTI-LAYERED TISSUES

Journal of Innovative Optical Health Sciences ◽

10.1142/s1793545812500046 ◽

2012 ◽

Vol 05 (02) ◽

pp. 1250004 ◽

Cited By ~ 5

Author(s):

CHAO JIANG ◽

HENG HE ◽

PENGCHENG LI ◽

QINGMING LUO

Keyword(s):

Monte Carlo Simulation ◽

Monte Carlo ◽

Message Passing ◽

Message Passing Interface ◽

Local Area Network ◽

Graphics Processing Unit ◽

Area Network ◽

Processing Unit ◽

Photon Transport ◽

Graphics Processing

We present a graphics processing unit (GPU) cluster-based Monte Carlo simulation of photon transport in multi-layered tissues. The cluster is composed of multiple computing nodes in a local area network where each node is a personal computer equipped with one or several GPU(s) for parallel computing. In this study, the MPI (Message Passing Interface), the OpenMP (Open Multi-Processing) and the CUDA (Compute Unified Device Architecture) technologies are employed to develop the program. It is demonstrated that this designing runs roughly N times faster than that using single GPU when the GPUs within the cluster are of the same type, where N is the total number of the GPUs within the cluster.

Get full-text (via PubEx)