cuda kernel
Recently Published Documents


TOTAL DOCUMENTS

12
(FIVE YEARS 4)

H-INDEX

4
(FIVE YEARS 1)

2020 ◽  
Vol 21 (2) ◽  
pp. 373-380
Author(s):  
Jung Ah Yang ◽  
Taejung Park
Keyword(s):  

2019 ◽  
Author(s):  
Mahmoud Khairy ◽  
Mengchi Zhang ◽  
Roland Green ◽  
Simon David Hammond ◽  
Robert J. Hoekstra ◽  
...  
Keyword(s):  

2016 ◽  
Vol 27 (08) ◽  
pp. 981-1003 ◽  
Author(s):  
Toru Fujita ◽  
Koji Nakano ◽  
Yasuaki Ito

Conway’s Game of Life is the most well-known cellular automaton. The universe of the Game of Life is a 2-dimensional array of cells, each of which takes two possible states, alive or dead. The state of every cell is repeatedly updated according to those of eight neighbors. A cell will be alive if exactly three neighbors are alive, or if it is alive and two neighbors are alive. The main contribution of this paper is to develop several acceleration techniques for simulating the Game of Life using a GPU as follows: (1) the states of 32/64 cells in 32/64-bit words (integers) and the next states are computed by the Bitwise Parallel Bulk Computation (BPBC) technique, (2) the states of cells stored in 2 words are updated at the same time by a thread, (3) warp shuffle instruction is used to directly transfer the current states stored in registers, and (4) multi-step simulation is performed to reduce the overhead of data transfer and invoking CUDA kernel. The experimental results show that, the performance of our GPU implementation using GeForce GTX TITAN X is 1350×109 updates per second for 16K-step simulation of 512K ×512K cells stored in the SSD. Since Intel Core i7 CPU using the same technique performs 13.4×109 updates per second, our GPU implementation for the Game of Life achieves a speedup factor of 100. Thus, these techniques work very efficiently on a GPU.


Author(s):  
Ching-Hsiang Chu ◽  
Khaled Hamidouche ◽  
Akshay Venkatesh ◽  
Ammar Ahmad Awan ◽  
Dhabaleswar K. Panda
Keyword(s):  

2012 ◽  
Vol 2012 ◽  
pp. 1-8 ◽  
Author(s):  
Federico Raimondo ◽  
Juan E. Kamienkowski ◽  
Mariano Sigman ◽  
Diego Fernandez Slezak

In recent years, Independent Component Analysis (ICA) has become a standard to identify relevant dimensions of the data in neuroscience. ICA is a very reliable method to analyze data but it is, computationally, very costly. The use of ICA for online analysis of the data, used in brain computing interfaces, results are almost completely prohibitive. We show an increase with almost no cost (a rapid video card) of speed of ICA by about 25 fold. The EEG data, which is a repetition of many independent signals in multiple channels, is very suitable for processing using the vector processors included in the graphical units. We profiled the implementation of this algorithm and detected two main types of operations responsible of the processing bottleneck and taking almost 80% of computing time: vector-matrix and matrix-matrix multiplications. By replacing function calls to basic linear algebra functions to the standard CUBLAS routines provided by GPU manufacturers, it does not increase performance due to CUDA kernel launch overhead. Instead, we developed a GPU-based solution that, comparing with the original BLAS and CUBLAS versions, obtains a 25x increase of performance for the ICA calculation.


Sign in / Sign up

Export Citation Format

Share Document