Experiences of the GPU Thread Configuration and Shared Memory

DaeHwan Kim

doi:10.24018/ejers.2018.3.7.788

Experiences of the GPU Thread Configuration and Shared Memory

European Journal of Engineering Research and Science ◽

10.24018/ejers.2018.3.7.788 ◽

2018 ◽

Vol 3 (7) ◽

pp. 12

Author(s):

DaeHwan Kim

Keyword(s):

Parallel Computation ◽

Shared Memory ◽

Memory Performance ◽

General Purpose ◽

Global Memory ◽

Experimental Result ◽

Gpu Programming ◽

Memory Accesses

Nowadays, GPU processors are widely used for general-purpose parallel computation applications. In the GPU programming, thread and block configuration is one of the most important decisions to be made, which increases parallelism and hides instruction latency. However, in many cases, it is often difficult to have sufficient parallelism to hide all the latencies, where the high latencies are often caused by the global memory accesses. In order to reduce the number of those accesses, the shared memory is instead used which is much faster than the global memory being located on a chip. The performance of the proposed thread configuration is evaluated on the GPU 960 processor. The experimental result shows that the best configuration improves the performance by 7.3 times compared to the worst configuration in the experiment. The experiences are also discussed for the shared memory performance when compared to that of the global memory.

Download Full-text

QSM: A general purpose shared-memory model for parallel computation

Lecture Notes in Computer Science - Foundations of Software Technology and Theoretical Computer Science ◽

10.1007/bfb0058018 ◽

1997 ◽

pp. 1-5 ◽

Cited By ~ 6

Author(s):

Vijaya Ramachandran

Keyword(s):

Parallel Computation ◽

Shared Memory ◽

General Purpose ◽

Memory Model

Download Full-text

A General-Purpose Shared-Memory Model for Parallel Computation

Algorithms for Parallel Processing - The IMA Volumes in Mathematics and its Applications ◽

10.1007/978-1-4612-1516-5_1 ◽

1999 ◽

pp. 1-17 ◽

Cited By ~ 5

Author(s):

Vijaya Ramachandran

Keyword(s):

Parallel Computation ◽

Shared Memory ◽

General Purpose ◽

Memory Model

Download Full-text

Shared-memory performance profiling

ACM SIGPLAN Notices ◽

10.1145/263767.263796 ◽

1997 ◽

Vol 32 (7) ◽

pp. 240-251 ◽

Cited By ~ 1

Author(s):

Zhichen Xu ◽

James R. Larus ◽

Barton P. Miller

Keyword(s):

Shared Memory ◽

Memory Performance ◽

Performance Profiling

Download Full-text

An Efficient Parallel Algorithm for Extreme Eigenvalues of Sparse Nonsymmetric Matrices

The International Journal of Supercomputing Applications ◽

10.1177/109434209200600106 ◽

1992 ◽

Vol 6 (1) ◽

pp. 98-111 ◽

Cited By ~ 2

Author(s):

S. K. Kim ◽

A. T. Chrortopoulos

Keyword(s):

Shared Memory ◽

Message Passing ◽

Sparse Matrices ◽

Data Locality ◽

Main Memory ◽

Global Memory ◽

Global Communication ◽

Step Method ◽

Arnoldi Algorithm ◽

Large Sparse Matrices

Main memory accesses for shared-memory systems or global communications (synchronizations) in message passing systems decrease the computation speed. In this paper, the standard Arnoldi algorithm for approximating a small number of eigenvalues, with largest (or smallest) real parts for nonsymmetric large sparse matrices, is restructured so that only one synchronization point is required; that is, one global communication in a message passing distributed-memory machine or one global memory sweep in a shared-memory machine per each iteration is required. We also introduce an s-step Arnoldi method for finding a few eigenvalues of nonsymmetric large sparse matrices. This method generates reduction matrices that are similar to those generated by the standard method. One iteration of the s-step Arnoldi algorithm corresponds to s iterations of the standard Arnoldi algorithm. The s-step method has improved data locality, minimized global communication, and superior parallel properties. These algorithms are implemented on a 64-node NCUBE/7 Hypercube and a CRAY-2, and performance results are presented.

Download Full-text

Ultra-Fast Digital Tomosynthesis Reconstruction Using General-Purpose GPU Programming for Image-Guided Radiation Therapy

Technology in Cancer Research & Treatment ◽

10.7785/tcrt.2012.500206 ◽

2011 ◽

Vol 10 (4) ◽

pp. 295-306 ◽

Cited By ~ 22

Author(s):

Justin C. Park ◽

Sung Ho Park ◽

Jin Sung Kim ◽

Youngyih Han ◽

Min Kook Cho ◽

...

Keyword(s):

Radiation Therapy ◽

General Purpose ◽

Gpu Programming ◽

Digital Tomosynthesis ◽

Image Guided Radiation Therapy ◽

Image Guided ◽

General Purpose Gpu

Download Full-text

Efficient Graph Component Labeling on Hybrid CPU and GPU Platforms

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.596.276 ◽

2014 ◽

Vol 596 ◽

pp. 276-279

Author(s):

Xiao Hui Pan

Keyword(s):

High Performance ◽

General Purpose ◽

Gpu Programming ◽

Data Parallel ◽

Graphical Processing Units ◽

Architectural Features ◽

Graph Coloring Problem ◽

Graphical Processing ◽

And Performance ◽

Performance Results

Graph component labeling, which is a subset of the general graph coloring problem, is a computationally expensive operation in many important applications and simulations. A number of data-parallel algorithmic variations to the component labeling problem are possible and we explore their use with general purpose graphical processing units (GPGPUs) and with the CUDA GPU programming language. We discuss implementation issues and performance results on CPUs and GPUs using CUDA. We evaluated our system with real-world graphs. We show how to consider different architectural features of the GPU and the host CPUs and achieve high performance.

Download Full-text

KNOWLEDGE STATES FOR THE CACHING PROBLEM IN SHARED MEMORY MULTIPROCESSOR SYSTEMS

International Journal of Foundations of Computer Science ◽

10.1142/s0129054109006504 ◽

2009 ◽

Vol 20 (01) ◽

pp. 167-183 ◽

Cited By ~ 1

Author(s):

WOLFGANG BEIN ◽

LAWRENCE L. LARMORE ◽

RÜDIGER REISCHUK

Keyword(s):

Lower Bound ◽

Shared Memory ◽

Competitive Analysis ◽

Online Algorithm ◽

Data Access ◽

Global Memory ◽

Multiprocessor Systems ◽

Optimal Behavior ◽

Knowledge States ◽

Shared Memory Multiprocessor

Multiprocessor systems with a global shared memory provide logically uniform data access. To hide latencies when accessing global memory each processor makes use of a private cache. Several copies of a data item may exist concurrently in the system. To guarantee consistency when updating an item a processor must invalidate copies of the item in other private caches. To exclude the effect of classical paging faults, one assumes that each processor knows its own data access sequence, but does not know the sequence of future invalidations requested by other processors. Performance of a processor with this restriction can be measured against the optimal behavior of a theoretical omniscient processor, using competitive analysis. We present a [Formula: see text]-competitive randomized online algorithm for this problem for cache size of 2, and prove a matching lower bound on the competitiveness. The algorithm is derived with the help of a new concept we call knowledge states. Finally, we show a lower bound of [Formula: see text] on the competitiveness for larger cache sizes.

Download Full-text

Measurement-based characterization of global memory and network contention, operating system and parallelisation overheads: case study on a shared-memory multiprocessor

Proceedings of 21 International Symposium on Computer Architecture ◽

10.1109/isca.1994.288160 ◽

2002 ◽

Cited By ~ 2

Author(s):

C. Natarajan ◽

S. Sharma ◽

R.K. Iyer

Keyword(s):

Operating System ◽

Shared Memory ◽

Global Memory ◽

Shared Memory Multiprocessor ◽

Network Contention

Download Full-text

Shared memory performance of multi-computer terminals in distributed information systems

Information Processing & Management ◽

10.1016/0306-4573(84)90005-0 ◽

1984 ◽

Vol 20 (4) ◽

pp. 535-545

Author(s):

Arumalla V. Reddi

Keyword(s):

Information Systems ◽

Shared Memory ◽

Memory Performance ◽

Distributed Information Systems ◽

Distributed Information

Download Full-text

The Decidability of Verification under PS 2.0

Programming Languages and Systems - Lecture Notes in Computer Science ◽

10.1007/978-3-030-72019-3_1 ◽

2021 ◽

pp. 1-29

Author(s):

Parosh Aziz Abdulla ◽

Mohamed Faouzi Atig ◽

Adwait Godbole ◽

S. Krishna ◽

Viktor Vafeiadis

Keyword(s):

Global Memory ◽

Program Transformations ◽

Reachability Problem ◽

Context Switching ◽

Input Program ◽

Finite State ◽

Memory Accesses ◽

Primitive Recursive

AbstractWe consider the reachability problem for finite-state multi-threaded programs under thepromising semantics() of Lee et al., which captures most common program transformations. Since reachability is already known to be undecidable in the fragment of with only release-acquire accesses (-), we consider the fragment with only relaxed accesses and promises (). We show that reachability under is undecidable in general and that it becomes decidable, albeit non-primitive recursive, if we bound the number of promises.Given these results, we consider a bounded version of the reachability problem. To this end, we bound both the number of promises and of “view-switches”, i.e., the number of times the processes may switch their local views of the global memory. We provide a code-to-code translation from an input program under (with relaxed and release-acquire memory accesses along with promises) to a program under SC, thereby reducing the bounded reachability problem under to the bounded context-switching problem under SC. We have implemented a tool and tested it on a set of benchmarks, demonstrating that typical bugs in programs can be found with a small bound.

Download Full-text