Faster optimal parallel prefix sums and list ranking

We prove that prefix sums of n integers of at most b bits can be found on a COMMON CRCW PRAM in [Formula: see text] time with a linear time-processor product. The algorithm is optimally fast, for any polynomial number of processors. In particular, if [Formula: see text] the time taken is [Formula: see text]. This is a generalisation of previous result. The previous [Formula: see text] time algorithm was valid only for O(log n)-bit numbers. Application of this algorithm to r-way parallel merge sort algorithm is also considered. We also consider a more realistic PRAM variant, in which the word size, m, may be smaller than b (m≥log n). On this model, prefix sums can be found in [Formula: see text] optimal time.

Download Full-text

Energy Characterization and Optimization of Parallel Prefix-Sums Kernels

Euro-Par 2015: Parallel Processing Workshops - Lecture Notes in Computer Science ◽

10.1007/978-3-319-27308-2_55 ◽

2015 ◽

pp. 685-696 ◽

Cited By ~ 1

Author(s):

Angelos Papatriantafyllou

Keyword(s):

Parallel Prefix ◽

Prefix Sums

Download Full-text

Effectiveness of GPU Realizations of Parallel Prefix-Sums Computation Algorithms

2018 IEEE 13th International Scientific and Technical Conference on Computer Sciences and Information Technologies (CSIT) ◽

10.1109/stc-csit.2018.8526674 ◽

2018 ◽

Author(s):

Kamil Stokfiszewski ◽

Dariusz Puchala ◽

Mykhaylo Yatsymirskyy

Keyword(s):

Parallel Prefix ◽

Computation Algorithms ◽

Prefix Sums

Download Full-text

Optimal deterministic approximate parallel prefix sums and their applications

Proceedings Third Israel Symposium on the Theory of Computing and Systems ◽

10.1109/istcs.1995.377028 ◽

2002 ◽

Cited By ~ 7

Author(s):

T. Goldberg ◽

U. Zwick

Keyword(s):

Parallel Prefix ◽

Prefix Sums

Download Full-text

Parallel Prefix Algorithms On Linked Lists

Parallel Computing Using the Prefix Problem ◽

10.1093/oso/9780195088496.003.0010 ◽

1994 ◽

Author(s):

S. Lakshmivarahan ◽

Sudarshan K. Dhall

Keyword(s):

Input Data ◽

Ranking Problem ◽

List Ranking ◽

Labeled Graph ◽

Linked List ◽

Parallel Prefix ◽

Types Of Information ◽

Linked Lists

This Chapter describes algorithms for computing prefixes/suffixes in parallel when the input data is in the form of a linked list. Developments in this Chapter complement those in Chapter 3. We begin by defining a version of the prefix problem called the list ranking problem. Let < N > = {1,2, • • • , N} and L be a list of size N. For each i ∈ < N >, the node i in L contains two types of information: the value v(i) of node i, and the successor s(i) of node i. Clearly, s(N) = 0. A linked list may conveniently be represented as a directed, labeled graph G(V, E), where V = <N > and . . . E = { (i, j ) | j = s(i), i, j ∈ V }, and v (i) denotes the value for node i.

Download Full-text

OPTIMIZATION OF LINKED LIST PREFIX COMPUTATIONS ON MULTITHREADED GPUS USING CUDA

Parallel Processing Letters ◽

10.1142/s0129626412500120 ◽

2012 ◽

Vol 22 (04) ◽

pp. 1250012 ◽

Cited By ~ 4

Author(s):

ZHENG WEI ◽

JOSEPH JAJA

Keyword(s):

Optimization Techniques ◽

List Ranking ◽

Data Parallel ◽

Fine Grain ◽

Processing Cost ◽

High Bandwidth ◽

Memory Accesses ◽

Prefix Sums ◽

Randomization Process ◽

Linked Lists

We present a number of optimization techniques to compute prefix sums on linked lists and implement them on the multithreaded GPUs Tesla C1060, Tesla C2050, and GTX480 using CUDA. Prefix computations on linked structures involve in general highly irregular fine grain memory accesses that are typical of many computations on linked lists, trees, and graphs. While the current generation of GPUs provides substantial computational power and extremely high bandwidth memory accesses, they may appear at first to be primarily geared toward streamed, highly data parallel computations. In this paper, we introduce an optimized multithreaded GPU algorithm for prefix computations through a randomization process that reduces the problem to a large number of fine-grain computations. We map these fine-grain computations onto multithreaded GPUs in such a way that the processing cost per element is shown to be close to the best possible. Our experimental results show scalability for list sizes ranging from 1M nodes to 256M nodes, and significantly improve on the recently published parallel implementations of list ranking, including implementations on the Cell Processor, the MTA-8, and the NVIDIA GT200 and Fermi series. They also compare favorably to the performance of the best known CUDA algorithm for the scan operation on the Tesla C1060 and GTX480.

Download Full-text