scholarly journals Experiences of the GPU Thread Configuration and Shared Memory

2018 ◽  
Vol 3 (7) ◽  
pp. 12
Author(s):  
DaeHwan Kim

Nowadays, GPU processors are widely used for general-purpose parallel computation applications. In the GPU programming, thread and block configuration is one of the most important decisions to be made, which increases parallelism and hides instruction latency. However, in many cases, it is often difficult to have sufficient parallelism to hide all the latencies, where the high latencies are often caused by the global memory accesses. In order to reduce the number of  those accesses, the shared memory is instead used which is  much faster than the global memory being located on a chip. The performance of the proposed thread configuration is evaluated on the GPU 960 processor. The experimental result shows that the best configuration improves the performance by 7.3 times compared to the worst configuration in the experiment. The experiences are also discussed for the shared memory performance when compared to that of the global memory.

1997 ◽  
Vol 32 (7) ◽  
pp. 240-251 ◽  
Author(s):  
Zhichen Xu ◽  
James R. Larus ◽  
Barton P. Miller

1992 ◽  
Vol 6 (1) ◽  
pp. 98-111 ◽  
Author(s):  
S. K. Kim ◽  
A. T. Chrortopoulos

Main memory accesses for shared-memory systems or global communications (synchronizations) in message passing systems decrease the computation speed. In this paper, the standard Arnoldi algorithm for approximating a small number of eigenvalues, with largest (or smallest) real parts for nonsymmetric large sparse matrices, is restructured so that only one synchronization point is required; that is, one global communication in a message passing distributed-memory machine or one global memory sweep in a shared-memory machine per each iteration is required. We also introduce an s-step Arnoldi method for finding a few eigenvalues of nonsymmetric large sparse matrices. This method generates reduction matrices that are similar to those generated by the standard method. One iteration of the s-step Arnoldi algorithm corresponds to s iterations of the standard Arnoldi algorithm. The s-step method has improved data locality, minimized global communication, and superior parallel properties. These algorithms are implemented on a 64-node NCUBE/7 Hypercube and a CRAY-2, and performance results are presented.


2011 ◽  
Vol 10 (4) ◽  
pp. 295-306 ◽  
Author(s):  
Justin C. Park ◽  
Sung Ho Park ◽  
Jin Sung Kim ◽  
Youngyih Han ◽  
Min Kook Cho ◽  
...  

2014 ◽  
Vol 596 ◽  
pp. 276-279
Author(s):  
Xiao Hui Pan

Graph component labeling, which is a subset of the general graph coloring problem, is a computationally expensive operation in many important applications and simulations. A number of data-parallel algorithmic variations to the component labeling problem are possible and we explore their use with general purpose graphical processing units (GPGPUs) and with the CUDA GPU programming language. We discuss implementation issues and performance results on CPUs and GPUs using CUDA. We evaluated our system with real-world graphs. We show how to consider different architectural features of the GPU and the host CPUs and achieve high performance.


2009 ◽  
Vol 20 (01) ◽  
pp. 167-183 ◽  
Author(s):  
WOLFGANG BEIN ◽  
LAWRENCE L. LARMORE ◽  
RÜDIGER REISCHUK

Multiprocessor systems with a global shared memory provide logically uniform data access. To hide latencies when accessing global memory each processor makes use of a private cache. Several copies of a data item may exist concurrently in the system. To guarantee consistency when updating an item a processor must invalidate copies of the item in other private caches. To exclude the effect of classical paging faults, one assumes that each processor knows its own data access sequence, but does not know the sequence of future invalidations requested by other processors. Performance of a processor with this restriction can be measured against the optimal behavior of a theoretical omniscient processor, using competitive analysis. We present a [Formula: see text]-competitive randomized online algorithm for this problem for cache size of 2, and prove a matching lower bound on the competitiveness. The algorithm is derived with the help of a new concept we call knowledge states. Finally, we show a lower bound of [Formula: see text] on the competitiveness for larger cache sizes.


Author(s):  
Parosh Aziz Abdulla ◽  
Mohamed Faouzi Atig ◽  
Adwait Godbole ◽  
S. Krishna ◽  
Viktor Vafeiadis

AbstractWe consider the reachability problem for finite-state multi-threaded programs under thepromising semantics() of Lee et al., which captures most common program transformations. Since reachability is already known to be undecidable in the fragment of with only release-acquire accesses (-), we consider the fragment with only relaxed accesses and promises (). We show that reachability under is undecidable in general and that it becomes decidable, albeit non-primitive recursive, if we bound the number of promises.Given these results, we consider a bounded version of the reachability problem. To this end, we bound both the number of promises and of “view-switches”, i.e., the number of times the processes may switch their local views of the global memory. We provide a code-to-code translation from an input program under (with relaxed and release-acquire memory accesses along with promises) to a program under SC, thereby reducing the bounded reachability problem under to the bounded context-switching problem under SC. We have implemented a tool and tested it on a set of benchmarks, demonstrating that typical bugs in programs can be found with a small bound.


Sign in / Sign up

Export Citation Format

Share Document