Collaborative Parallel Hybrid Metaheuristics on Graphics Processing Unit

Metaheuristics are nondeterministic optimization algorithms used to solve complex problems for which classic approaches are unsuitable. Despite their effectiveness, metaheuristics require considerable computational power and cannot easily be used in time critical applications. Fortunately, those algorithms are intrinsically parallel and have been implemented on shared memory systems and more recently on graphics processing units (GPUs). In this paper, we present highly efficient parallel implementations of the particle swarm optimization (PSO), the genetic algorithm (GA) and the simulated annealing (SA) algorithm on GPU using CUDA. Our approach exploits the parallelism at the solution level, follows an island model and allows for speedup up to 346× for different benchmark functions. Most importantly, we also present a strategy that uses the generalized island model to integrate multiple metaheuristics into a parallel hybrid solution adapted to the GPU. Our proposed solution uses OpenMP to heavily exploit the concurrent kernel execution feature of recent NVIDIA GPUs, allowing for the parallel execution of the different metaheuristics in an asynchronous manner. Asynchronous hybrid metaheuristics has been developed for multicore CPU, but never for GPU. The speedup offered by the GPU is far superior and key to the optimization of solutions to complex engineering problems.

Download Full-text

Graphics processing unit acceleration of the island model genetic algorithm using the CUDA programming platform

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.6286 ◽

2021 ◽

Author(s):

Dylan M. Janssen ◽

Wayne Pullan ◽

Alan Wee‐Chung Liew

Keyword(s):

Genetic Algorithm ◽

Graphics Processing Unit ◽

Island Model ◽

Processing Unit ◽

Cuda Programming ◽

Graphics Processing

Download Full-text

A lightweight approach to performance portability with targetDP

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016682071 ◽

2016 ◽

Vol 32 (2) ◽

pp. 288-301

Author(s):

Alan Gray ◽

Kevin Stratford

Keyword(s):

Particle Physics ◽

Message Passing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Processing Unit ◽

Performance Portability ◽

Graphics Processing

Leading high performance computing systems achieve their status through use of highly parallel devices such as NVIDIA graphics processing units or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus separate lattice quantum chromodynamics particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with Message Passing Interface (MPI) to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and graphics processing unit-accelerated large scale supercomputers.

Download Full-text

Graphics processing unit implementation of the F-statistic for continuous gravitational wave searches

Classical and Quantum Gravity ◽

10.1088/1361-6382/ac4616 ◽

2021 ◽

Author(s):

Liam Dunn ◽

Patrick Clearwater ◽

Andrew Melatos ◽

Karl Wette

Keyword(s):

Gravitational Wave ◽

Graphics Processing Units ◽

Graphics Processing Unit ◽

Computational Cost ◽

Processing Unit ◽

Central Processing ◽

Long Baseline ◽

Using Data ◽

Graphics Processing ◽

Gpu Implementation

Abstract The F-statistic is a detection statistic used widely in searches for continuous gravitational waves with terrestrial, long-baseline interferometers. A new implementation of the F-statistic is presented which accelerates the existing "resampling" algorithm using graphics processing units (GPUs). The new implementation runs between 10 and 100 times faster than the existing implementation on central processing units without sacrificing numerical accuracy. The utility of the GPU implementation is demonstrated on a pilot narrowband search for four newly discovered millisecond pulsars in the globular cluster Omega Centauri using data from the second Laser Interferometer Gravitational-Wave Observatory observing run. The computational cost is 17:2 GPU-hours using the new implementation, compared to 1092 core-hours with the existing implementation.

Download Full-text

Massively parallel hybrid algorithm on embedded graphics processing unit for unmanned aerial vehicle path planning

International Journal of Digital Signals and Smart Systems ◽

10.1504/ijdsss.2018.090875 ◽

2018 ◽

Vol 2 (1) ◽

pp. 68 ◽

Cited By ~ 1

Author(s):

Vincent Roberge ◽

Mohammed Tarbouchi

Keyword(s):

Path Planning ◽

Unmanned Aerial Vehicle ◽

Hybrid Algorithm ◽

Graphics Processing Unit ◽

Massively Parallel ◽

Processing Unit ◽

Parallel Hybrid ◽

Aerial Vehicle ◽

Vehicle Path ◽

Graphics Processing

Download Full-text

Advanced Topics GPU Programming and CUDA Architecture

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Emerging Research Surrounding Power Consumption and Performance Issues in Utility Computing ◽

10.4018/978-1-4666-8853-7.ch008 ◽

2016 ◽

pp. 175-203

Author(s):

Mainak Adhikari ◽

Sukhendu Kar

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Programming Model ◽

Graphics Processing Unit ◽

Direct Access ◽

Gpu Programming ◽

Processing Unit ◽

Computing Platform ◽

Cuda Architecture ◽

Graphics Processing

Graphics processing unit (GPU), which typically handles computation only for computer graphics. Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs). CUDA gives program developers direct access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. This chapter first discuss some features and challenges of GPU programming and the effort to address some of the challenges with building and running GPU programming in high performance computing (HPC) environment. Finally this chapter point out the importance and standards of CUDA architecture.

Download Full-text

Data Streaming Processing Window Joined With Graphics Processing Units (GPUs)

Encyclopedia of Information Science and Technology, Fifth Edition - Advances in Information Quality and Management ◽

10.4018/978-1-7998-3479-3.ch043 ◽

2021 ◽

pp. 602-623

Author(s):

Shen Lu ◽

Richard S. Segall

Keyword(s):

Big Data ◽

Data Streams ◽

Graphics Processing Units ◽

Data Stream ◽

Large Scale ◽

Graphics Processing Unit ◽

Processing Unit ◽

Data Streaming ◽

Large Scale Data ◽

Graphics Processing

Big data is large-scale data and can be either discrete or continuous. This article entails research that discusses the continuous case of big data often called “data streaming.” More and more businesses will depend on being able to process and make decisions on streams of data. This article utilizes the algorithmic side of data stream processing often called “stream analytics” or “stream mining.” Data streaming Windows Join can be improved by using graphics processing unit (GPU) for higher performance computing. Data streams are generated by two independent threads: one thread can be used to generate Data Stream A, and the other thread can be used to generate Data Stream B. One would use a Windows Join thread to merge the two data streams, which is also the process of “Data Stream Window Join.” The Window Join process can be implemented in parallel that can efficiently improve the computing speed. Experiments are provided for Data Stream Window Joins using both static and dynamic data.

Download Full-text

An Interval Type 2 Fuzzy Logic Framework for Faster Evolutionary Design

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2019.8576 ◽

2019 ◽

Vol 16 (12) ◽

pp. 5140-5148

Author(s):

Sarabjeet Singh ◽

Satvir Singh ◽

Vijay Kumar Banga

Keyword(s):

Fuzzy Logic ◽

Graphics Processing Unit ◽

Noisy Data ◽

Parallel Execution ◽

Rule Base ◽

Processing Unit ◽

Data Set ◽

Interval Type ◽

Graphics Processing

In this paper, a fast processing and efficient framework has been proposed to get an optimum output from a noisy data set of a system by using interval type-2 fuzzy logic system. Further, the concept of GPGPU (General Purpose Computing on Graphics Processing Unit) is used for fast execution of the fuzzy rule base on Graphics Processing Unit (GPU). Application of Whale Optimization Algorithm (WOA) is used to ascertain optimum output from noisy data set. Which is further integrated with Interval Type-2 (IT2) fuzzy logic system and executed on Graphics Processing Unit for faster execution. The proposed framework is also designed for parallel execution using GPU and the results are compared with the serial program execution. Further, it is clearly observed that the parallel execution rule base evolved provide better accuracy in less time. The proposed framework (IT2FLS) has been validated with classical bench mark problem of Mackey Glass Time Series. For non-stationary time-series data with additive gaussian noise has been implemented with proposed framework and with T1 FLS. Further, it is observed that IT2 FLS provides better rule base for noisy data set.

Download Full-text

dadi.CUDA: Accelerating population genetic inference with Graphics Processing Units

10.1101/2020.07.30.229336 ◽

2020 ◽

Author(s):

Ryan N Gutenkunst

Keyword(s):

Graphics Processing Units ◽

Population Genetic ◽

Gpu Computing ◽

Graphics Processing Unit ◽

Demographic History ◽

Processing Unit ◽

Speed Increase ◽

Population Genetic Inference ◽

Computationally Intensive ◽

Graphics Processing

Extracting insight from population genetic data often demands computationally intensive modeling. dadi is a popular program for fitting models of demographic history and natural selection to such data. Here, I show that running dadi on a Graphics Processing Unit (GPU) can speed computation by orders of magnitude compared to the CPU implementation, with minimal user burden. This speed increase enables the analysis of more complex models, which motivated the extension of dadi to four- and five-population models. Remarkably, dadi performs almost as well on inexpensive consumer-grade GPUs as on expensive server-grade GPUs. GPU computing thus offers large and accessible benefits to the community of dadi users. This functionality is available in dadi version 2.1.0.

Download Full-text

Accelerating the RTTOV-7 IASI and AMSU-A radiative transfer models on graphics processing units: evaluating central processing unit/graphics processing unit-hybrid and pure-graphics processing unit approaches

Journal of Applied Remote Sensing ◽

10.1117/1.3658028 ◽

2011 ◽

Vol 5 (1) ◽

pp. 051503 ◽

Cited By ~ 4

Author(s):

Jarno Mielikainen

Keyword(s):

Radiative Transfer ◽

Graphics Processing Units ◽

Graphics Processing Unit ◽

Central Processing Unit ◽

Processing Unit ◽

Central Processing ◽

Radiative Transfer Models ◽

Graphics Processing ◽

Transfer Models

Download Full-text

Parallel SVD Algorithm for a Three-Diagonal Matrix on a Video Card Using the Nvidia CUDA Architecture

NaUKMA Research Papers Computer Science ◽

10.18523/2617-3808.2021.4.16-22 ◽

2021 ◽

Vol 4 ◽

pp. 16-22

Author(s):

Mykola Semylitko ◽

Gennadii Malaschonok

Keyword(s):

Graphics Processing Units ◽

Graphics Processing Unit ◽

Computation Time ◽

Application Programming Interface ◽

General Purpose ◽

Diagonal Matrix ◽

Free Access ◽

Processing Unit ◽

Cuda Architecture ◽

Graphics Processing

SVD (Singular Value Decomposition) algorithm is used in recommendation systems, machine learning, image processing, and in various algorithms for working with matrices which can be very large and Big Data, so, given the peculiarities of this algorithm, it can be performed on a large number of computing threads that have only video cards.CUDA is a parallel computing platform and application programming interface model created by Nvidia. It allows software developers and software engineers to use a CUDA-enabled graphics processing unit for general purpose processing – an approach termed GPGPU (general-purpose computing on graphics processing units). The GPU provides much higher instruction throughput and memory bandwidth than the CPU within a similar price and power envelope. Many applications leverage these higher capabilities to run faster on the GPU than on the CPU. Other computing devices, like FPGAs, are also very energy efficient, but they offer much less programming flexibility than GPUs.The developed modification uses the CUDA architecture, which is intended for a large number of simultaneous calculations, which allows to quickly process matrices of very large sizes. The algorithm of parallel SVD for a three-diagonal matrix based on the Givents rotation provides a high accuracy of calculations. Also the algorithm has a number of optimizations to work with memory and multiplication algorithms that can significantly reduce the computation time discarding empty iterations.This article proposes an approach that will reduce the computation time and, consequently, resources and costs. The developed algorithm can be used with the help of a simple and convenient API in C ++ and Java, as well as will be improved by using dynamic parallelism or parallelization of multiplication operations. Also the obtained results can be used by other developers for comparison, as all conditions of the research are described in detail, and the code is in free access.

Download Full-text