Parallel Shellsort Algorithm for Many-Core GPUs with CUDA

Author(s):  
Chun-Yuan Lin ◽  
Wei Sheng Lee ◽  
Chuan Yi Tang

Sorting is a classic algorithmic problem and its importance has led to the design and implementation of various sorting algorithms on many-core graphics processing units (GPUs). CUDPP Radix sort is the most efficient sorting on GPUs and GPU Sample sort is the best comparison-based sorting. Although the implementations of these algorithms are efficient, they either need an extra space for the data rearrangement or the atomic operation for the acceleration. Sorting applications usually deal with a large amount of data, thus the memory utilization is an important consideration. Furthermore, these sorting algorithms on GPUs without the atomic operation support can result in the performance degradation or fail to work. In this paper, an efficient implementation of a parallel shellsort algorithm, CUDA shellsort, is proposed for many-core GPUs with CUDA. Experimental results show that, on average, the performance of CUDA shellsort is nearly twice faster than GPU quicksort and 37% faster than Thrust mergesort under uniform distribution. Moreover, its performance is the same as GPU sample sort up to 32 million data elements, but only needs a constant space usage. CUDA shellsort is also robust over various data distributions and could be suitable for other many-core architectures.

Author(s):  
Yuji Sato ◽  
Mikiko Sato

Purpose – The purpose of this paper is to propose a fault-tolerant technology for increasing the durability of application programs when evolutionary computation is performed by fast parallel processing on many-core processors such as graphics processing units (GPUs) and multi-core processors (MCPs). Design/methodology/approach – For distributed genetic algorithm (GA) models, the paper proposes a method where an island's ID number is added to the header of data transferred by this island for use in fault detection. Findings – The paper has shown that the processing time of the proposed idea is practically negligible in applications and also shown that an optimal solution can be obtained even with a single stuck-at fault or a transient fault, and that increasing the number of parallel threads makes the system less susceptible to faults. Originality/value – The study described in this paper is a new approach to increase the sustainability of application program using distributed GA on GPUs and MCPs.


2012 ◽  
Vol 8 (1) ◽  
pp. 159-174 ◽  
Author(s):  
Sang-Pil Lee ◽  
Deok-Ho Kim ◽  
Jae-Young Yi ◽  
Won-Woo Ro

2018 ◽  
Vol 10 (10) ◽  
pp. 168781401880471
Author(s):  
Nenzi Wang ◽  
Hsin-Yi Chen ◽  
Yu-Wen Chen

The advancement of modern processors with many-core and large-cache may have little computational advantages if only serial computing is employed. In this study, several parallel computing approaches, using devices with multiple or many processor cores, and graphics processing units are applied and compared to illustrate the potential applications in fluid-film lubrication study. Two Reynolds equations and an air bearing optimum design are solved using three parallel computing paradigms, OpenMP, Compute Unified Device Architecture, and OpenACC, on standalone shared-memory computers. The newly developed processors with many-integrated-core are also using OpenMP to release the computing potential. The results show that the OpenACC computing can have a better performance than the OpenMP computing for the discretized Reynolds equation with a large gridwork. This is mainly due to larger sizes of available cache in the tested graphics processing units. The bearing design can benefit most when the system with many-integrated-core processor is being used. This is due to the many-integrated-core system can perform computation in the optimization-algorithm-level and using the many processor cores effectively. A proper combination of parallel computing devices and programming models can complement efficient numerical methods or optimization algorithms to accelerate many tribological simulations or engineering designs.


2014 ◽  
Vol 39 (4) ◽  
pp. 233-248 ◽  
Author(s):  
Milosz Ciznicki ◽  
Krzysztof Kurowski ◽  
Jan Węglarz

Abstract Heterogeneous many-core computing resources are increasingly popular among users due to their improved performance over homogeneous systems. Many developers have realized that heterogeneous systems, e.g. a combination of a shared memory multi-core CPU machine with massively parallel Graphics Processing Units (GPUs), can provide significant performance opportunities to a wide range of applications. However, the best overall performance can only be achieved if application tasks are efficiently assigned to different types of processor units in time taking into account their specific resource requirements. Additionally, one should note that available heterogeneous resources have been designed as general purpose units, however, with many built-in features accelerating specific application operations. In other words, the same algorithm or application functionality can be implemented as a different task for CPU or GPU. Nevertheless, from the perspective of various evaluation criteria, e.g. the total execution time or energy consumption, we may observe completely different results. Therefore, as tasks can be scheduled and managed in many alternative ways on both many-core CPUs or GPUs and consequently have a huge impact on the overall computing resources performance, there are needs for new and improved resource management techniques. In this paper we discuss results achieved during experimental performance studies of selected task scheduling methods in heterogeneous computing systems. Additionally, we present a new architecture for resource allocation and task scheduling library which provides a generic application programming interface at the operating system level for improving scheduling polices taking into account a diversity of tasks and heterogeneous computing resources characteristics.


Author(s):  
Amitava Datta ◽  
Amardeep Kaur ◽  
Tobias Lauer ◽  
Sami Chabbouh

Abstract Finding clusters in high dimensional data is a challenging research problem. Subspace clustering algorithms aim to find clusters in all possible subspaces of the dataset, where a subspace is a subset of dimensions of the data. But the exponential increase in the number of subspaces with the dimensionality of data renders most of the algorithms inefficient as well as ineffective. Moreover, these algorithms have ingrained data dependency in the clustering process, which means that parallelization becomes difficult and inefficient. SUBSCALE is a recent subspace clustering algorithm which is scalable with the dimensions and contains independent processing steps which can be exploited through parallelism. In this paper, we aim to leverage the computational power of widely available multi-core processors to improve the runtime performance of the SUBSCALE algorithm. The experimental evaluation shows linear speedup. Moreover, we develop an approach using graphics processing units (GPUs) for fine-grained data parallelism to accelerate the computation further. First tests of the GPU implementation show very promising results.


Proceedings ◽  
2019 ◽  
Vol 21 (1) ◽  
pp. 44 ◽  
Author(s):  
Emmanuel Gobet ◽  
José Germán López Salas ◽  
Carlos Vázquez

In this work we design a novel and efficient quasi-regression Monte Carlo algorithm in order to approximate the solution of discrete time backward stochastic differential equations (BSDEs), and we analyze the convergence of the proposed method. With the challenge of tackling problems in high dimensions we propose suitable projections of the solution and efficient parallelizations of the algorithm taking advantage of powerful many core processors such as graphics processing units (GPUs).


Computation ◽  
2020 ◽  
Vol 8 (4) ◽  
pp. 103
Author(s):  
Stefano Quer ◽  
Andrea Calabrese

Many modern applications are modeled using graphs of some kind. Given a graph, reachability, that is, discovering whether there is a path between two given nodes, is a fundamental problem as well as one of the most important steps of many other algorithms. The rapid accumulation of very large graphs (up to tens of millions of vertices and edges) from a diversity of disciplines demand efficient and scalable solutions to the reachability problem. General-purpose computing has been successfully used on Graphics Processing Units (GPUs) to parallelize algorithms that present a high degree of regularity. In this paper, we extend the applicability of GPU processing to graph-based manipulation, by re-designing a simple but efficient state-of-the-art graph-labeling method, namely the GRAIL (Graph Reachability Indexing via RAndomized Interval) algorithm, to many-core CUDA-based GPUs. This algorithm firstly generates a label for each vertex of the graph, then it exploits these labels to answer reachability queries. Unfortunately, the original algorithm executes a sequence of depth-first visits which are intrinsically recursive and cannot be efficiently implemented on parallel systems. For that reason, we design an alternative approach in which a sequence of breadth-first visits substitute the original depth-first traversal to generate the labeling, and in which a high number of concurrent visits is exploited during query evaluation. The paper describes our strategy to re-design these steps, the difficulties we encountered to implement them, and the solutions adopted to overcome the main inefficiencies. To prove the validity of our approach, we compare (in terms of time and memory requirements) our GPU-based approach with the original sequential CPU-based tool. Finally, we report some hints on how to conduct further research in the area.


2013 ◽  
Vol 479-480 ◽  
pp. 805-809
Author(s):  
Chih Sheng Lin ◽  
Po Ting Liu ◽  
Chih Wei Hsieh ◽  
Hsi Ya Chang ◽  
Pao Ann Hsiung

Recently, heterogeneous system architectures are becoming a mainstream for achieving high performance and power efficiency. In particular, many-core graphics processing units (GPUs) have started to play an important role for computing in heterogeneous architectures. However, for application designers, computational workload still needs to be distributed among heterogeneous GPUs manually and remains inefficient. In this work, we propose a MINLP-based method for efficient workload distribution among GPUs by considering the capabilities of GPUs for various applications. Experimental results demonstrate the performance of our proposed method.


2015 ◽  
Vol 2015 ◽  
pp. 1-13 ◽  
Author(s):  
Milosz Ciznicki ◽  
Michal Kulczewski ◽  
Piotr Kopta ◽  
Krzysztof Kurowski

The recent advent of novel multi- and many-core architectures forces application programmers to deal with hardware-specific implementation details and to be familiar with software optimisation techniques to benefit from new high-performance computing machines. Extra care must be taken for communication-intensive algorithms, which may be a bottleneck for forthcoming era of exascale computing. This paper aims to present a high-level stencil framework implemented for the EULerian or LAGrangian model (EULAG) that efficiently utilises multi- and many-cores architectures. Only an efficient usage of both many-core processors (CPUs) and graphics processing units (GPUs) with the flexible data decomposition method can lead to the maximum performance that scales the communication-intensive Generalized Conjugate Residual (GCR) elliptic solver with preconditioner.


Sign in / Sign up

Export Citation Format

Share Document