Heterogeneous Multicore Parallel Programming for Graphics Processing Units

Hybrid parallel multicore architectures based on graphics processing units (GPUs) can provide tremendous computing power. Current NVIDIA and AMD Graphics Product Group hardware display a peak performance of hundreds of gigaflops. However, exploiting GPUs from existing applications is a difficult task that requires non-portable rewriting of the code. In this paper, we present HMPP, a Heterogeneous Multicore Parallel Programming workbench with compilers, developed by CAPS entreprise, that allows the integration of heterogeneous hardware accelerators in a unintrusive manner while preserving the legacy code.

Download Full-text

State-of-the-art in Heterogeneous Computing

Scientific Programming ◽

10.1155/2010/540159 ◽

2010 ◽

Vol 18 (1) ◽

pp. 1-33 ◽

Cited By ~ 96

Author(s):

Andre R. Brodtkorb ◽

Christopher Dyken ◽

Trond R. Hagen ◽

Jon M. Hjelmervik ◽

Olaf O. Storaasli

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Heterogeneous Computing ◽

State Of The Art ◽

Peak Performance ◽

Fine Grained ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Cost Efficient ◽

Graphics Processing

Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.

Download Full-text

TOWARDS FFT-BASED DIRECT NUMERICAL SIMULATIONS OF TURBULENT FLOWS ON A GPU

International Journal of Modeling Simulation and Scientific Computing ◽

10.1142/s1793962313500141 ◽

2013 ◽

Vol 05 (01) ◽

pp. 1350014 ◽

Cited By ~ 1

Author(s):

CATHERINE RUCKI ◽

ABHILASH J. CHANDY

Keyword(s):

Turbulent Flows ◽

Graphics Processing Units ◽

Turbulence Models ◽

Science And Engineering ◽

Computing Power ◽

Central Processing ◽

General Improvement ◽

Highly Turbulent Flows ◽

Graphics Processing ◽

Pseudo Spectral

The accurate simulation of turbulence and the implementation of corresponding turbulence models are both critical to the understanding of the complex physics behind turbulent flows in a variety of science and engineering applications. Despite the tremendous increase in the computing power of central processing units (CPUs), direct numerical simulation of highly turbulent flows is still not feasible due to the need for resolving the smallest length scale, and today's CPUs cannot keep pace with demand. The recent development of graphics processing units (GPU) has led to the general improvement in the performance of various algorithms. This study investigates the applicability of GPU technology in the context of fast-Fourier transform (FFT)-based pseudo-spectral methods for DNS of turbulent flows for the Taylor–Green vortex problem. They are implemented on a single GPU and a speedup of unto 31x is obtained in comparison to a single CPU.

Download Full-text

Tuned Save of DSSS Signal Using Parallel Programming Skills of General Purpose Computing on Graphics Processing Units

The Journal of Korean Institute of Communications and Information Sciences ◽

10.7840/kics.2020.45.3.577 ◽

2020 ◽

Vol 45 (3) ◽

pp. 577-583

Author(s):

Hyun-Chul Yoon ◽

Hyeon-Hwi Lee ◽

Hyun-Jin Kang ◽

Jae-Yun Kim ◽

Byung-Ho Moon ◽

...

Keyword(s):

Parallel Programming ◽

Graphics Processing Units ◽

General Purpose ◽

Programming Skills ◽

Graphics Processing

Download Full-text

Controllers: An abstraction to ease the use of hardware accelerators

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017702962 ◽

2017 ◽

Vol 32 (6) ◽

pp. 838-853 ◽

Cited By ~ 4

Author(s):

Ana Moreton–Fernandez ◽

Hector Ortega–Arranz ◽

Arturo Gonzalez–Escribano

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Abstract Entity ◽

Hardware Accelerators ◽

Processing Unit ◽

Central Processing ◽

Computing Platforms ◽

Graphics Processing ◽

Performance Computing ◽

Selection Of

Nowadays the use of hardware accelerators, such as the graphics processing units or XeonPhi coprocessors, is key in solving computationally costly problems that require high performance computing. However, programming solutions for an efficient deployment for these kind of devices is a very complex task that relies on the manual management of memory transfers and configuration parameters. The programmer has to carry out a deep study of the particular data that needs to be computed at each moment, across different computing platforms, also considering architectural details. We introduce the controller concept as an abstract entity that allows the programmer to easily manage the communications and kernel launching details on hardware accelerators in a transparent way. This model also provides the possibility of defining and launching central processing unit kernels in multi-core processors with the same abstraction and methodology used for the accelerators. It internally combines different native programming models and technologies to exploit the potential of each kind of device. Additionally, the model also allows the programmer to simplify the proper selection of values for several configuration parameters that can be selected when a kernel is launched. This is done through a qualitative characterization process of the kernel code to be executed. Finally, we present the implementation of the controller model in a prototype library, together with its application in several case studies. Its use has led to reductions in the development and porting costs, with significantly low overheads in the execution times when compared to manually programmed and optimized solutions which directly use CUDA and OpenMP.

Download Full-text

Parallel Programming with GPUs: Parallel Programming Using Graphics Processing Units with Numerical Examples for Microwave Engineering

IEEE Microwave Magazine ◽

10.1109/mmm.2013.2248611 ◽

2013 ◽

Vol 14 (4) ◽

pp. 102-115 ◽

Cited By ~ 2

Author(s):

Noyan Kinayman

Keyword(s):

Parallel Programming ◽

Graphics Processing Units ◽

Numerical Examples ◽

Microwave Engineering ◽

Graphics Processing

Download Full-text

Industry-scale finite-difference elastic wave modeling on graphics processing units using the out-of-core technique

Geophysics ◽

10.1190/geo2015-0267.1 ◽

2016 ◽

Vol 81 (2) ◽

pp. T35-T43 ◽

Cited By ~ 3

Author(s):

Jon Marius Venstad

Keyword(s):

Finite Difference ◽

Graphics Processing Units ◽

Finite Difference Methods ◽

Main Memory ◽

Multicore Architectures ◽

Wave Modeling ◽

Central Processing ◽

Computational Overhead ◽

The Difference ◽

Graphics Processing

The difference in computational power between the few- and multicore architectures represented by central processing units (CPUs) and graphics processing units (GPUs) is significant today, and this difference is likely to increase in the years ahead. GPUs are, therefore, ever more popular for applications in computational physics, such as wave modeling. Finite-difference methods are popular for wave modeling and are well suited for the GPU architecture, but developing an efficient and capable GPU implementation is hindered by the limited size of the GPU memory. I revealed how the out-of-core technique can be used to circumvent the memory limit on the GPU, increasing the available memory to that of the CPU (the main memory) instead, with no significant computational overhead. This approach has several advantages over a parallel scheme in terms of applicability, flexibility, and hardware requirements. Choices in the numerical scheme — the numerical differentiators in particular — also greatly affect computational efficiency. These factors are considered explicitly for GPU implementations of wave modeling because GPUs are special purpose with a visible architecture.

Download Full-text

A Hybrid Parallel Implementation of the Aho–Corasick and Wu–Manber Algorithms Using NVIDIA CUDA and MPI Evaluated on a Biological Sequence Database

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213015400011 ◽

2015 ◽

Vol 24 (01) ◽

pp. 1540001 ◽

Cited By ~ 7

Author(s):

Charalampos S. Kouzinopoulos ◽

John-Alexander M. Assael ◽

Themistoklis K. Pyrgiotis ◽

Konstantinos G. Margaritis

Keyword(s):

Graphics Processing Units ◽

Parallel Implementation ◽

Biological Sequence ◽

Computing Power ◽

Multiple Gpus ◽

Central Processing ◽

Large Size ◽

Hybrid Computer ◽

Graphics Processing ◽

Expressed Sequence

Multiple matching algorithms are used to locate the occurrences of patterns from a finite pattern set in a large input string. Aho–Corasick and Wu–Manber, two of the most well known algorithms for multiple matching require an increased computing power, particularly in cases where large-size datasets must be processed, as is common in computational biology applications. Over the past years, Graphics Processing Units (GPUs) have evolved to powerful parallel processors outperforming Central Processing Units (CPUs) in scientific calculations. Moreover, multiple GPUs can be used in parallel, forming hybrid computer cluster configurations to achieve an even higher processing throughput. This paper evaluates the speedup of the parallel implementation of the Aho–Corasick and Wu–Manber algorithms on a hybrid GPU cluster, when used to process a snapshot of the Expressed Sequence Tags of the human genome and for different problem parameters.

Download Full-text

Utilizing the Double-Precision Floating-Point Computing Power of GPUs for RSA Acceleration

Security and Communication Networks ◽

10.1155/2017/3508786 ◽

2017 ◽

Vol 2017 ◽

pp. 1-15 ◽

Cited By ~ 1

Author(s):

Jiankuo Dong ◽

Fangyu Zheng ◽

Wuqiong Pan ◽

Jingqiang Lin ◽

Jiwu Jing ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Chinese Remainder Theorem ◽

General Purpose ◽

Floating Point ◽

Double Precision ◽

Computing Power ◽

Cryptographic Algorithm ◽

Graphics Processing ◽

Gpu Architecture

Asymmetric cryptographic algorithm (e.g., RSA and Elliptic Curve Cryptography) implementations on Graphics Processing Units (GPUs) have been researched for over a decade. The basic idea of most previous contributions is exploiting the highly parallel GPU architecture and porting the integer-based algorithms from general-purpose CPUs to GPUs, to offer high performance. However, the great potential cryptographic computing power of GPUs, especially by the more powerful floating-point instructions, has not been comprehensively investigated in fact. In this paper, we fully exploit the floating-point computing power of GPUs, by various designs, including the floating-point-based Montgomery multiplication/exponentiation algorithm and Chinese Remainder Theorem (CRT) implementation in GPU. And for practical usage of the proposed algorithm, a new method is performed to convert the input/output between octet strings and floating-point numbers, fully utilizing GPUs and further promoting the overall performance by about 5%. The performance of RSA-2048/3072/4096 decryption on NVIDIA GeForce GTX TITAN reaches 42,211/12,151/5,790 operations per second, respectively, which achieves 13 times the performance of the previous fastest floating-point-based implementation (published in Eurocrypt 2009). The RSA-4096 decryption precedes the existing fastest integer-based result by 23%.

Download Full-text

Parallel Programming Models for Heterogeneous Multicore Architectures

IEEE Micro ◽

10.1109/mm.2010.94 ◽

2010 ◽

Vol 30 (5) ◽

pp. 42-53 ◽

Cited By ~ 3

Author(s):

SARC European Project

Keyword(s):

Parallel Programming ◽

Programming Models ◽

Multicore Architectures ◽

Heterogeneous Multicore ◽

Parallel Programming Models

Download Full-text

Directionally unsplit hydrodynamic schemes with hybrid MPI/OpenMP/GPU parallelization in AMR

The International Journal of High Performance Computing Applications ◽

10.1177/1094342011428146 ◽

2011 ◽

Vol 26 (4) ◽

pp. 367-377 ◽

Cited By ~ 14

Author(s):

Hsi-Yu Schive ◽

Ui-Han Zhang ◽

Tzihong Chiueh

Keyword(s):

Graphics Processing Units ◽

Adaptive Mesh Refinement ◽

Adaptive Mesh ◽

Uniform Mesh ◽

Computing Power ◽

Gpu Cluster ◽

Performance Benchmarks ◽

Speed Up ◽

And Performance ◽

Graphics Processing

We present the implementation and performance of a class of directionally unsplit Riemann-solver-based hydrodynamic schemes on graphics processing units (GPUs). These schemes, including the MUSCL-Hancock method, a variant of the MUSCL-Hancock method, and the corner-transport-upwind method, are embedded into the adaptive-mesh-refinement (AMR) code GAMER. Furthermore, a hybrid MPI/OpenMP model is investigated, which enables the full exploitation of the computing power in a heterogeneous CPU/GPU cluster and significantly improves the overall performance. Performance benchmarks are conducted on the Dirac GPU cluster at NERSC/LBNL using up to 32 Tesla C2050 GPUs. A single GPU achieves speed-ups of 101 (25) and 84 (22) for uniform-mesh and AMR simulations, respectively, as compared with the performance using one (four) CPU core(s), and the excellent performance persists in multi-GPU tests. In addition, we make a direct comparison between GAMER and the widely adopted CPU code Athena in adiabatic hydrodynamic tests and demonstrate that, with the same accuracy, GAMER is able to achieve two orders of magnitude performance speed-up.

Download Full-text