Optimization of K-Means Clustering on Graphics Processing Unit Using Compute Unified Device Architecture

Parallel data clustering aims at using algorithms and methods to extract knowledge from fat databases in rational time using high performance architectures. The computational challenge faced by cluster analysis due to increasing capacity of data can be overcome by exploiting the power of these architectures. The recent development in parallel power of Graphics Processing Unit enables low cost high performance solutions for general purpose applications. The Compute Unified Device Architecture programming model provides application programming interface methods to handle data proficiently on Graphics Processing Unit for iterative clustering algorithms like K-Means. The existing Graphics Processing Unit based K-Means algorithms highly focus on improvising the speedup of the algorithms and fall short to handle the high time spent on transfer of data between the Central Processing Unit and Graphics Processing Unit. A competent K-Means algorithm is proposed in this paper to lessen the transfer time by introducing a novel approach to check the convergence of the algorithm and utilize the pinned memory for direct access. This algorithm outperforms the other algorithms by maximizing parallelism and utilizing the memory features. The relative speedups and the validity measure for the proposed algorithm is elevated when compared with K-Means on Graphics Processing Unit and K-Means using Flag on Graphics Processing Unit. Thus the planned approach proves that communication overhead can be reduced in K-Means clustering.

Download Full-text

Advanced Topics GPU Programming and CUDA Architecture

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Emerging Research Surrounding Power Consumption and Performance Issues in Utility Computing ◽

10.4018/978-1-4666-8853-7.ch008 ◽

2016 ◽

pp. 175-203

Author(s):

Mainak Adhikari ◽

Sukhendu Kar

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Programming Model ◽

Graphics Processing Unit ◽

Direct Access ◽

Gpu Programming ◽

Processing Unit ◽

Computing Platform ◽

Cuda Architecture ◽

Graphics Processing

Graphics processing unit (GPU), which typically handles computation only for computer graphics. Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs). CUDA gives program developers direct access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. This chapter first discuss some features and challenges of GPU programming and the effort to address some of the challenges with building and running GPU programming in high performance computing (HPC) environment. Finally this chapter point out the importance and standards of CUDA architecture.

Download Full-text

Parallel data mining techniques on Graphics Processing Unit with Compute Unified Device Architecture (CUDA)

The Journal of Supercomputing ◽

10.1007/s11227-011-0672-7 ◽

2011 ◽

Vol 64 (3) ◽

pp. 942-967 ◽

Cited By ~ 44

Author(s):

Liheng Jian ◽

Cheng Wang ◽

Ying Liu ◽

Shenshen Liang ◽

Weidong Yi ◽

...

Keyword(s):

Data Mining ◽

Graphics Processing Unit ◽

Processing Unit ◽

Compute Unified Device Architecture ◽

Data Mining Techniques ◽

Device Architecture ◽

Parallel Data ◽

Parallel Data Mining ◽

Graphics Processing

Download Full-text

POM.gpu-v1.0: a GPU-based Princeton Ocean Model

Geoscientific Model Development ◽

10.5194/gmd-8-2815-2015 ◽

2015 ◽

Vol 8 (9) ◽

pp. 2815-2827 ◽

Cited By ~ 13

Author(s):

S. Xu ◽

X. Huang ◽

L.-Y. Oey ◽

F. Xu ◽

H. Fu ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Climate Models ◽

Ocean Model ◽

Compute Unified Device Architecture ◽

Princeton Ocean Model ◽

Central Processing ◽

Device Architecture ◽

Computationally Intensive ◽

Graphics Processing

Abstract. Graphics processing units (GPUs) are an attractive solution in many scientific applications due to their high performance. However, most existing GPU conversions of climate models use GPUs for only a few computationally intensive regions. In the present study, we redesign the mpiPOM (a parallel version of the Princeton Ocean Model) with GPUs. Specifically, we first convert the model from its original Fortran form to a new Compute Unified Device Architecture C (CUDA-C) code, then we optimize the code on each of the GPUs, the communications between the GPUs, and the I / O between the GPUs and the central processing units (CPUs). We show that the performance of the new model on a workstation containing four GPUs is comparable to that on a powerful cluster with 408 standard CPU cores, and it reduces the energy consumption by a factor of 6.8.

Download Full-text

CUDA-ACCELERATED FEATURE SELECTION

Proceedings of the International Conference on Emerging Trends in Engineering & Technology (IConETech-2020) ◽

10.47412/juqg5057 ◽

2020 ◽

Author(s):

Sterling Ramroach ◽

Jonathan Herbert ◽

Ajay Joshi

Keyword(s):

High Performance ◽

Graphics Processing Unit ◽

Pearson Correlation ◽

High Dimensional ◽

Processing Unit ◽

Device Architecture ◽

Importance Ranking ◽

Using Data ◽

Graphics Processing ◽

Performance Computing

Identifying important features from high dimensional data is usually done using one-dimensional filtering techniques. These techniques discard noisy attributes and those that are constant throughout the data. This is a time-consuming task that has scope for acceleration via high performance computing techniques involving the graphics processing unit (GPU). The proposed algorithm involves acceleration via the Compute Unified Device Architecture (CUDA) framework developed by Nvidia. This framework facilitates the seamless scaling of computation on any CUDA-enabled GPUs. Thus, the Pearson Correlation Coefficient can be applied in parallel on each feature with respect to the response variable. The ranks obtained for each feature can be used to determine the most relevant features to select. Using data from the UCI Machine Learning Repository, our results show an increase in efficiency for multi-dimensional analysis with a more reliable feature importance ranking. When tested on a high-dimensional dataset of 1000 samples and 10,000 features, we achieved a 1,230-time speedup using CUDA. This acceleration grows exponentially, as with any embarrassingly parallel task.

Download Full-text

DEVELOPING PARALLEL COMPUTING ALGORITHMS USING GPU’S TO DETERMINE OIL AND GAS RESERVES PRESENTED IN THE UPSTREAM (EXPLORATION) SECTOR

Proceedings of the International Conference on Emerging Trends in Engineering & Technology (IConETech-2020) ◽

10.47412/mruu5197 ◽

2020 ◽

Author(s):

Stefan Boodoo ◽

Ajay Joshi

Keyword(s):

High Performance ◽

Oil And Gas ◽

Gpu Computing ◽

Graphics Processing Unit ◽

Reservoir Rock ◽

Processing Unit ◽

Potential Wells ◽

Central Processing ◽

Rock Formations ◽

Graphics Processing

Oil and Gas companies keep exploring every new possible method to increase the likelihood of finding a commercial hydrocarbon bearing prospect. Well logging generates gigabytes of data from various probes and sensors. After processing, a prospective reservoir will indicate areas of oil, gas, water and reservoir rock. Incorporating High Performance Computing (HPC) methodologies will allow for thousands of potential wells to be indicative of its hydrocarbon bearing potential. This study will present the use of the Graphics Processing Unit (GPU) computing as another method of analyzing probable reserves. Raw well log data from the Kansas Geological Society (1999-2018) forms the basis of the data analysis. Parallel algorithms are developed and make use of Nvidia’s Compute Unified Device Architecture (CUDA). The results gathered highlight a 5 times speedup using a Nvidia GeForce GT 330M GPU as compared to an Intel Core i7 740QM Central Processing Unit (CPU). The processed results display depth wise areas of shale and rock formations as well as water, oil and/or gas reserves.

Download Full-text

HARNESSING THE POWER OF IDLE GPUS FOR ACCELERATION OF BIOLOGICAL SEQUENCE ALIGNMENT

Parallel Processing Letters ◽

10.1142/s0129626409000390 ◽

2009 ◽

Vol 19 (04) ◽

pp. 513-533 ◽

Cited By ~ 7

Author(s):

FUMIHIKO INO ◽

YUKI KOTANI ◽

YUMA MUNEKAWA ◽

KENICHI HAGIHARA

Keyword(s):

Sequence Alignment ◽

Graphics Processing Unit ◽

Parallel Implementation ◽

Processing Unit ◽

Compute Unified Device Architecture ◽

Grid System ◽

Biological Sequence ◽

Device Architecture ◽

Linear Speedup ◽

Graphics Processing

This paper presents a parallel system capable of accelerating biological sequence alignment on the graphics processing unit (GPU) grid. The GPU grid in this paper is a desktop grid system that utilizes idle GPUs and CPUs in the office and home. Our parallel implementation employs a master-worker paradigm to accelerate an OpenGL-based algorithm that runs on a single GPU. We integrate this implementation into a screensaver-based grid system that detects idle resources on which the alignment code can run. We also show some experimental results comparing our implementation with three different implementations running on a single GPU, a single CPU, or multiple CPUs. As a result, we find that a single non-dedicated GPU can provide us almost the same throughput as two dedicated CPUs in our laboratory environment, where GPU-equipped machines are ordinarily used to develop GPU applications. In a dedicated environment, the GPU-accelerated code achieves five times higher throughput than the CPU-based code. Furthermore, a linear speedup of 30.7X is observed on a 32-node cluster of dedicated GPUs. We also implement a compute unified device architecture (CUDA) based algorithm to demonstrate further acceleration.

Download Full-text

GPU Computing with Python: Performance, Energy Efficiency and Usability

Computation ◽

10.3390/computation8010004 ◽

2020 ◽

Vol 8 (1) ◽

pp. 4 ◽

Cited By ~ 1

Author(s):

Håvard H. Holm ◽

André R. Brodtkorb ◽

Martin L. Sætra

Keyword(s):

Energy Efficiency ◽

High Performance ◽

Gpu Computing ◽

Graphics Processing Unit ◽

Processing Unit ◽

Device Architecture ◽

Computational Performance ◽

Graphics Processing ◽

The Impact ◽

Performance Computing

In this work, we examine the performance, energy efficiency, and usability when using Python for developing high-performance computing codes running on the graphics processing unit (GPU). We investigate the portability of performance and energy efficiency between Compute Unified Device Architecture (CUDA) and Open Compute Language (OpenCL); between GPU generations; and between low-end, mid-range, and high-end GPUs. Our findings showed that the impact of using Python is negligible for our applications, and furthermore, CUDA and OpenCL applications tuned to an equivalent level can in many cases obtain the same computational performance. Our experiments showed that performance in general varies more between different GPUs than between using CUDA and OpenCL. We also show that tuning for performance is a good way of tuning for energy efficiency, but that specific tuning is needed to obtain optimal energy efficiency.

Download Full-text

HYPERDOCK: Improving virtual screening through parallel hyperheuristics

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019847732 ◽

2019 ◽

Vol 34 (1) ◽

pp. 30-41

Author(s):

Baldomero Imbernón ◽

Antonio Llanes ◽

José-Matías Cutillas-Lozano ◽

Domingo Giménez

Keyword(s):

Virtual Screening ◽

High Performance ◽

Graphics Processing Unit ◽

Processing Unit ◽

Central Processing ◽

Pharmacological Targets ◽

Graphics Processing ◽

Computational Systems ◽

Performance Computing ◽

Different Levels

Virtual screening (VS) methods aid clinical research by predicting the interaction of ligands with pharmacological targets. The computational requirements of VS, along with the size of the databases, propitiate the use of high-performance computing. METADOCK is a tool for the application of metaheuristics to VS in heterogeneous clusters of computers based on central processing unit (CPU) and graphics processing unit (GPU). HYPERDOCK represents a step forward; the exploration for satisfactory metaheuristics is systematically approached by means of hyperheuristics working on top of the metaheuristic schema of METADOCK. Multiple metaheuristics are explored, so the process is computationally demanding. HYPERDOCK exploits the parallelism of METADOCK and includes parallelism at its own level. The different levels of parallelism can be used to exploit the parallelism offered by computational systems composed of multicore CPU + multi-GPUs. The efficient exploitation of these systems enables HYPERDOCK to improve ligand–receptor binding.

Download Full-text

GPU-accelerated Double-stage Delay-multiply-and-sum Algorithm for Fast Photoacoustic Tomography Using LED Excitation and Linear Arrays

Ultrasonic Imaging ◽

10.1177/0161734619862488 ◽

2019 ◽

Vol 41 (5) ◽

pp. 301-316 ◽

Cited By ~ 9

Author(s):

Seyyed Reza Miri Rostami ◽

Moein Mozaffarzadeh ◽

Mohsen Ghaffari-Miab ◽

Ali Hariri ◽

Jesse Jokerst

Keyword(s):

Image Reconstruction ◽

Parallel Computation ◽

Graphics Processing Unit ◽

Light Emitting Diode ◽

Processing Unit ◽

Light Emitting ◽

Central Processing ◽

Device Architecture ◽

Pixel Image ◽

Graphics Processing

Double-stage delay-multiply-and-sum (DS-DMAS) is an algorithm proposed for photoacoustic image reconstruction. The DS-DMAS algorithm offers a higher contrast than conventional delay-and-sum and delay-multiply and-sum but at the expense of higher computational complexity. Here, we utilized a compute unified device architecture (CUDA) graphics processing unit (GPU) parallel computation approach to address the high complexity of the DS-DMAS for photoacoustic image reconstruction generated from a commercial light-emitting diode (LED)–based photoacoustic scanner. In comparison with a single-threaded central processing unit (CPU), the GPU approach increased speeds by nearly 140-fold for 1024 × 1024 pixel image; there was no decrease in accuracy. The proposed implementation makes it possible to reconstruct photoacoustic images with frame rates of 250, 125, and 83.3 when the images are 64 × 64, 128 × 128, and 256 × 256, respectively. Thus, DS-DMAS can be efficiently used in clinical devices when coupled with CUDA GPU parallel computation.

Download Full-text

A GPU/CPU Programming Model for CFD Simulation

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.712-715.2538 ◽

2013 ◽

Vol 712-715 ◽

pp. 2538-2541

Author(s):

Cao Wei ◽

Zheng Hua Wang ◽

Chuan Fu Xu

Keyword(s):

High Performance ◽

Cfd Simulation ◽

Programming Model ◽

Graphics Processing Unit ◽

Processing Unit ◽

Computational Capability ◽

Parallel Programming Model ◽

Parallel Graphics ◽

Performance Results ◽

Graphics Processing

In recent years, the highly parallel graphics processing unit (GPU) is rapidly gaining maturity as a powerful engine for high performance computer. More and more researchers try to port the computational fluid dynamics (CFD) simulations into heterogeneous computers. However, most researchers focus on exploring the computational capability of GPU, while ignore the computational capability of CPU. In order to utilize the computational capability of CPU and GPU, we propose a hybrid CUDA/OpenMP parallel programming model. And we proposed an adaptive load balancing scheme to distribute the workload among CPUs and GPUs. With this programming model, we implement a high-order CFD program on “Tianhe-1A” supercomputer system. The performance results validate the workload distribution scheme.

Download Full-text