Parallel Shellsort Algorithm for Many-Core GPUs with CUDA

Chun-Yuan Lin; Wei Sheng Lee; Chuan Yi Tang

doi:10.4018/jghpc.2012040101

Parallel Shellsort Algorithm for Many-Core GPUs with CUDA

International Journal of Grid and High Performance Computing ◽

10.4018/jghpc.2012040101 ◽

2012 ◽

Vol 4 (2) ◽

pp. 1-16 ◽

Cited By ~ 1

Author(s):

Chun-Yuan Lin ◽

Wei Sheng Lee ◽

Chuan Yi Tang

Keyword(s):

Graphics Processing Units ◽

Algorithmic Problem ◽

Radix Sort ◽

Sorting Algorithms ◽

Atomic Operation ◽

Data Elements ◽

Sample Sort ◽

Many Core ◽

Graphics Processing ◽

Memory Utilization

Sorting is a classic algorithmic problem and its importance has led to the design and implementation of various sorting algorithms on many-core graphics processing units (GPUs). CUDPP Radix sort is the most efficient sorting on GPUs and GPU Sample sort is the best comparison-based sorting. Although the implementations of these algorithms are efficient, they either need an extra space for the data rearrangement or the atomic operation for the acceleration. Sorting applications usually deal with a large amount of data, thus the memory utilization is an important consideration. Furthermore, these sorting algorithms on GPUs without the atomic operation support can result in the performance degradation or fail to work. In this paper, an efficient implementation of a parallel shellsort algorithm, CUDA shellsort, is proposed for many-core GPUs with CUDA. Experimental results show that, on average, the performance of CUDA shellsort is nearly twice faster than GPU quicksort and 37% faster than Thrust mergesort under uniform distribution. Moreover, its performance is the same as GPU sample sort up to 32 million data elements, but only needs a constant space usage. CUDA shellsort is also robust over various data distributions and could be suitable for other many-core architectures.

Download Full-text

Parallelization and sustainability of distributed genetic algorithms on many-core processors

International Journal of Intelligent Computing and Cybernetics ◽

10.1108/ijicc-06-2013-0033 ◽

2014 ◽

Vol 7 (1) ◽

pp. 2-23 ◽

Cited By ~ 1

Author(s):

Yuji Sato ◽

Mikiko Sato

Keyword(s):

Graphics Processing Units ◽

Fault Tolerant ◽

Optimal Solution ◽

New Approach ◽

Content Type ◽

Distributed Genetic Algorithms ◽

Distributed Genetic Algorithm ◽

Many Core ◽

Graphics Processing ◽

Application Programs

Purpose – The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core processors such as graphics processing units (GPUs) and multi-core processors (MCPs). Design/methodology/approach – For distributed genetic algorithm (GA) models, the paper proposes a method where an island's ID number is added to the header of data transferred by this island for use in fault detection. Findings – The paper has shown that the processing time of the proposed idea is practically negligible in applications and also shown that an optimal solution can be obtained even with a single stuck-at fault or a transient fault, and that increasing the number of parallel threads makes the system less susceptible to faults. Originality/value – The study described in this paper is a new approach to increase the sustainability of application program using distributed GA on GPUs and MCPs.

Download Full-text

An Efficient Block Cipher Implementation on Many-Core Graphics Processing Units

Journal of Information Processing Systems ◽

10.3745/jips.2012.8.1.159 ◽

2012 ◽

Vol 8 (1) ◽

pp. 159-174 ◽

Cited By ~ 6

Author(s):

Sang-Pil Lee ◽

Deok-Ho Kim ◽

Jae-Young Yi ◽

Won-Woo Ro

Keyword(s):

Graphics Processing Units ◽

Block Cipher ◽

Many Core ◽

Graphics Processing

Download Full-text

Fluid-film lubrication computing with many-core processors and graphics processing units

Advances in Mechanical Engineering ◽

10.1177/1687814018804719 ◽

2018 ◽

Vol 10 (10) ◽

pp. 168781401880471

Author(s):

Nenzi Wang ◽

Hsin-Yi Chen ◽

Yu-Wen Chen

Keyword(s):

Parallel Computing ◽

Graphics Processing Units ◽

Fluid Film ◽

The Many ◽

Many Core ◽

Graphics Processing ◽

Processor Cores ◽

Many Integrated Core ◽

Film Lubrication ◽

Fluid Film Lubrication

The advancement of modern processors with many-core and large-cache may have little computational advantages if only serial computing is employed. In this study, several parallel computing approaches, using devices with multiple or many processor cores, and graphics processing units are applied and compared to illustrate the potential applications in fluid-film lubrication study. Two Reynolds equations and an air bearing optimum design are solved using three parallel computing paradigms, OpenMP, Compute Unified Device Architecture, and OpenACC, on standalone shared-memory computers. The newly developed processors with many-integrated-core are also using OpenMP to release the computing potential. The results show that the OpenACC computing can have a better performance than the OpenMP computing for the discretized Reynolds equation with a large gridwork. This is mainly due to larger sizes of available cache in the tested graphics processing units. The bearing design can benefit most when the system with many-integrated-core processor is being used. This is due to the many-integrated-core system can perform computation in the optimization-algorithm-level and using the many processor cores effectively. A proper combination of parallel computing devices and programming models can complement efficient numerical methods or optimization algorithms to accelerate many tribological simulations or engineering designs.

Download Full-text

Visualizing 3D/4D environmental data using many-core graphics processing units (GPUs) and multi-core central processing units (CPUs)

Computers & Geosciences ◽

10.1016/j.cageo.2013.04.029 ◽

2013 ◽

Vol 59 ◽

pp. 78-89 ◽

Cited By ~ 39

Author(s):

Jing Li ◽

Yunfeng Jiang ◽

Chaowei Yang ◽

Qunying Huang ◽

Matt Rice

Keyword(s):

Graphics Processing Units ◽

Environmental Data ◽

Central Processing ◽

Many Core ◽

Graphics Processing

Download Full-text

Evaluation of Selected Resource Allocation and Scheduling Methods in Heterogeneous Many-Core Processors and Graphics Processing Units

Foundations of Computing and Decision Sciences ◽

10.2478/fcds-2014-0013 ◽

2014 ◽

Vol 39 (4) ◽

pp. 233-248 ◽

Cited By ~ 1

Author(s):

Milosz Ciznicki ◽

Krzysztof Kurowski ◽

Jan Węglarz

Keyword(s):

Resource Allocation ◽

Task Scheduling ◽

Graphics Processing Units ◽

Heterogeneous Computing ◽

Heterogeneous Systems ◽

Application Programming Interface ◽

System Level ◽

Wide Range ◽

Many Core ◽

Graphics Processing

Abstract Heterogeneous many-core computing resources are increasingly popular among users due to their improved performance over homogeneous systems. Many developers have realized that heterogeneous systems, e.g. a combination of a shared memory multi-core CPU machine with massively parallel Graphics Processing Units (GPUs), can provide significant performance opportunities to a wide range of applications. However, the best overall performance can only be achieved if application tasks are efficiently assigned to different types of processor units in time taking into account their specific resource requirements. Additionally, one should note that available heterogeneous resources have been designed as general purpose units, however, with many built-in features accelerating specific application operations. In other words, the same algorithm or application functionality can be implemented as a different task for CPU or GPU. Nevertheless, from the perspective of various evaluation criteria, e.g. the total execution time or energy consumption, we may observe completely different results. Therefore, as tasks can be scheduled and managed in many alternative ways on both many-core CPUs or GPUs and consequently have a huge impact on the overall computing resources performance, there are needs for new and improved resource management techniques. In this paper we discuss results achieved during experimental performance studies of selected task scheduling methods in heterogeneous computing systems. Additionally, we present a new architecture for resource allocation and task scheduling library which provides a generic application programming interface at the operating system level for improving scheduling polices taking into account a diversity of tasks and heterogeneous computing resources characteristics.

Download Full-text

Exploiting multi–core and many–core parallelism for subspace clustering

International Journal of Applied Mathematics and Computer Science ◽

10.2478/amcs-2019-0006 ◽

2019 ◽

Vol 29 (1) ◽

pp. 81-91

Author(s):

Amitava Datta ◽

Amardeep Kaur ◽

Tobias Lauer ◽

Sami Chabbouh

Keyword(s):

Graphics Processing Units ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Subspace Clustering ◽

Research Problem ◽

Fine Grained ◽

Linear Speedup ◽

Many Core ◽

Graphics Processing ◽

Gpu Implementation

Abstract Finding clusters in high dimensional data is a challenging research problem. Subspace clustering algorithms aim to find clusters in all possible subspaces of the dataset, where a subspace is a subset of dimensions of the data. But the exponential increase in the number of subspaces with the dimensionality of data renders most of the algorithms inefficient as well as ineffective. Moreover, these algorithms have ingrained data dependency in the clustering process, which means that parallelization becomes difficult and inefficient. SUBSCALE is a recent subspace clustering algorithm which is scalable with the dimensions and contains independent processing steps which can be exploited through parallelism. In this paper, we aim to leverage the computational power of widely available multi-core processors to improve the runtime performance of the SUBSCALE algorithm. The experimental evaluation shows linear speedup. Moreover, we develop an approach using graphics processing units (GPUs) for fine-grained data parallelism to accelerate the computation further. First tests of the GPU implementation show very promising results.

Download Full-text

Quasi-Regression Monte-Carlo Method for Semi-Linear PDEs and BSDEs

Proceedings ◽

10.3390/proceedings2019021044 ◽

2019 ◽

Vol 21 (1) ◽

pp. 44 ◽

Cited By ~ 1

Author(s):

Emmanuel Gobet ◽

José Germán López Salas ◽

Carlos Vázquez

Keyword(s):

Monte Carlo ◽

Monte Carlo Method ◽

Differential Equations ◽

Discrete Time ◽

Graphics Processing Units ◽

Monte Carlo Algorithm ◽

High Dimensions ◽

Many Core ◽

Graphics Processing ◽

Linear Pdes

In this work we design a novel and efficient quasi-regression Monte Carlo algorithm in order to approximate the solution of discrete time backward stochastic differential equations (BSDEs), and we analyze the convergence of the proposed method. With the challenge of tackling problems in high dimensions we propose suitable projections of the solution and efficient parallelizations of the algorithm taking advantage of powerful many core processors such as graphics processing units (GPUs).

Download Full-text

Graph Reachability on Parallel Many-Core Architectures

Computation ◽

10.3390/computation8040103 ◽

2020 ◽

Vol 8 (4) ◽

pp. 103

Author(s):

Stefano Quer ◽

Andrea Calabrese

Keyword(s):

Graphics Processing Units ◽

Fundamental Problem ◽

General Purpose ◽

Graph Labeling ◽

Large Graphs ◽

Original Algorithm ◽

Graph Reachability ◽

Many Core ◽

Graphics Processing ◽

High Degree

Many modern applications are modeled using graphs of some kind. Given a graph, reachability, that is, discovering whether there is a path between two given nodes, is a fundamental problem as well as one of the most important steps of many other algorithms. The rapid accumulation of very large graphs (up to tens of millions of vertices and edges) from a diversity of disciplines demand efficient and scalable solutions to the reachability problem. General-purpose computing has been successfully used on Graphics Processing Units (GPUs) to parallelize algorithms that present a high degree of regularity. In this paper, we extend the applicability of GPU processing to graph-based manipulation, by re-designing a simple but efficient state-of-the-art graph-labeling method, namely the GRAIL (Graph Reachability Indexing via RAndomized Interval) algorithm, to many-core CUDA-based GPUs. This algorithm firstly generates a label for each vertex of the graph, then it exploits these labels to answer reachability queries. Unfortunately, the original algorithm executes a sequence of depth-first visits which are intrinsically recursive and cannot be efficiently implemented on parallel systems. For that reason, we design an alternative approach in which a sequence of breadth-first visits substitute the original depth-first traversal to generate the labeling, and in which a high number of concurrent visits is exploited during query evaluation. The paper describes our strategy to re-design these steps, the difficulties we encountered to implement them, and the solutions adopted to overcome the main inefficiencies. To prove the validity of our approach, we compare (in terms of time and memory requirements) our GPU-based approach with the original sequential CPU-based tool. Finally, we report some hints on how to conduct further research in the area.

Download Full-text

Efficient Computational Workload Distribution on Heterogeneous GPUs

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.479-480.805 ◽

2013 ◽

Vol 479-480 ◽

pp. 805-809

Author(s):

Chih Sheng Lin ◽

Po Ting Liu ◽

Chih Wei Hsieh ◽

Hsi Ya Chang ◽

Pao Ann Hsiung

Keyword(s):

Graphics Processing Units ◽

Power Efficiency ◽

High Performance ◽

Heterogeneous System ◽

Experimental Results ◽

System Architectures ◽

Heterogeneous Architectures ◽

Workload Distribution ◽

Many Core ◽

Graphics Processing

Recently, heterogeneous system architectures are becoming a mainstream for achieving high performance and power efficiency. In particular, many-core graphics processing units (GPUs) have started to play an important role for computing in heterogeneous architectures. However, for application designers, computational workload still needs to be distributed among heterogeneous GPUs manually and remains inefficient. In this work, we propose a MINLP-based method for efficient workload distribution among GPUs by considering the capabilities of GPUs for various applications. Experimental results demonstrate the performance of our proposed method.

Download Full-text

Methods to Load Balance a GCR Pressure Solver Using a Stencil Framework on Multi- and Many-Core Architectures

Scientific Programming ◽

10.1155/2015/648752 ◽

2015 ◽

Vol 2015 ◽

pp. 1-13 ◽

Cited By ~ 2

Author(s):

Milosz Ciznicki ◽

Michal Kulczewski ◽

Piotr Kopta ◽

Krzysztof Kurowski

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Lagrangian Model ◽

High Level ◽

Specific Implementation ◽

Many Core ◽

Graphics Processing ◽

Computing Machines ◽

Performance Computing ◽

Optimisation Techniques

The recent advent of novel multi- and many-core architectures forces application programmers to deal with hardware-specific implementation details and to be familiar with software optimisation techniques to benefit from new high-performance computing machines. Extra care must be taken for communication-intensive algorithms, which may be a bottleneck for forthcoming era of exascale computing. This paper aims to present a high-level stencil framework implemented for the EULerian or LAGrangian model (EULAG) that efficiently utilises multi- and many-cores architectures. Only an efficient usage of both many-core processors (CPUs) and graphics processing units (GPUs) with the flexible data decomposition method can lead to the maximum performance that scales the communication-intensive Generalized Conjugate Residual (GCR) elliptic solver with preconditioner.

Download Full-text