cuda kernel Latest Research Papers

Conway’s Game of Life is the most well-known cellular automaton. The universe of the Game of Life is a 2-dimensional array of cells, each of which takes two possible states, alive or dead. The state of every cell is repeatedly updated according to those of eight neighbors. A cell will be alive if exactly three neighbors are alive, or if it is alive and two neighbors are alive. The main contribution of this paper is to develop several acceleration techniques for simulating the Game of Life using a GPU as follows: (1) the states of 32/64 cells in 32/64-bit words (integers) and the next states are computed by the Bitwise Parallel Bulk Computation (BPBC) technique, (2) the states of cells stored in 2 words are updated at the same time by a thread, (3) warp shuffle instruction is used to directly transfer the current states stored in registers, and (4) multi-step simulation is performed to reduce the overhead of data transfer and invoking CUDA kernel. The experimental results show that, the performance of our GPU implementation using GeForce GTX TITAN X is 1350×109 updates per second for 16K-step simulation of 512K ×512K cells stored in the SSD. Since Intel Core i7 CPU using the same technique performs 13.4×109 updates per second, our GPU implementation for the Game of Life achieves a speedup factor of 100. Thus, these techniques work very efficiently on a GPU.

Download Full-text

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters

2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid) ◽

10.1109/ccgrid.2016.111 ◽

2016 ◽

Cited By ~ 4

Author(s):

Ching-Hsiang Chu ◽

Khaled Hamidouche ◽

Akshay Venkatesh ◽

Ammar Ahmad Awan ◽

Dhabaleswar K. Panda

Keyword(s):

Large Scale ◽

Gpu Clusters ◽

Cuda Kernel

Download Full-text

Preemption of a CUDA Kernel Function

2012 13th ACIS International Conference on Software Engineering, Artificial Intelligence, Networking and Parallel/Distributed Computing ◽

10.1109/snpd.2012.53 ◽

2012 ◽

Cited By ~ 5

Author(s):

Jon Calhoun ◽

Hai Jiang

Keyword(s):

Kernel Function ◽

Cuda Kernel

Download Full-text

CUDAICA: GPU Optimization of Infomax-ICA EEG Analysis

Computational Intelligence and Neuroscience ◽

10.1155/2012/206972 ◽

2012 ◽

Vol 2012 ◽

pp. 1-8 ◽

Cited By ~ 35

Author(s):

Federico Raimondo ◽

Juan E. Kamienkowski ◽

Mariano Sigman ◽

Diego Fernandez Slezak

Keyword(s):

Computing Time ◽

Multiple Channels ◽

Function Calls ◽

Eeg Data ◽

Video Card ◽

Gpu Optimization ◽

Vector Processors ◽

Vector Matrix ◽

Cuda Kernel ◽

Analyze Data

In recent years, Independent Component Analysis (ICA) has become a standard to identify relevant dimensions of the data in neuroscience. ICA is a very reliable method to analyze data but it is, computationally, very costly. The use of ICA for online analysis of the data, used in brain computing interfaces, results are almost completely prohibitive. We show an increase with almost no cost (a rapid video card) of speed of ICA by about 25 fold. The EEG data, which is a repetition of many independent signals in multiple channels, is very suitable for processing using the vector processors included in the graphical units. We profiled the implementation of this algorithm and detected two main types of operations responsible of the processing bottleneck and taking almost 80% of computing time: vector-matrix and matrix-matrix multiplications. By replacing function calls to basic linear algebra functions to the standard CUBLAS routines provided by GPU manufacturers, it does not increase performance due to CUDA kernel launch overhead. Instead, we developed a GPU-based solution that, comparing with the original BLAS and CUBLAS versions, obtains a 25x increase of performance for the ICA calculation.

Download Full-text

cuda kernel
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

An Approach to Estimate Power Consumption of a CUDA Kernel

A fast and efficient integration of boundary conditions into a unified CUDA Kernel for a shallow water solver lattice Boltzmann Method

Software Tool to Separate CUDA Kernel for Effective CUDA Debugging

SST_GPU: An Execution -Driven CUDA Kernel Scheduler and Streaming-Multiprocessor Compute Model.

Predicting Execution Time of CUDA Kernel Using Static Analysis

Towards automatic restrictification of CUDA kernel arguments

Fast Simulation of Conway’s Game of Life Using Bitwise Parallel Bulk Computation on a GPU

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters

Preemption of a CUDA Kernel Function

CUDAICA: GPU Optimization of Infomax-ICA EEG Analysis

Export Citation Format

cuda kernelRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

An Approach to Estimate Power Consumption of a CUDA Kernel

A fast and efficient integration of boundary conditions into a unified CUDA Kernel for a shallow water solver lattice Boltzmann Method

Software Tool to Separate CUDA Kernel for Effective CUDA Debugging

SST_GPU: An Execution -Driven CUDA Kernel Scheduler and Streaming-Multiprocessor Compute Model.

Predicting Execution Time of CUDA Kernel Using Static Analysis

Towards automatic restrictification of CUDA kernel arguments

Fast Simulation of Conway’s Game of Life Using Bitwise Parallel Bulk Computation on a GPU

CUDA Kernel Based Collective Reduction Operations on Large-scale GPU Clusters

Preemption of a CUDA Kernel Function

CUDAICA: GPU Optimization of Infomax-ICA EEG Analysis

cuda kernel
Recently Published Documents