thread block Latest Research Papers

AbstractCAVLC (Context-Adaptive Variable Length Coding) is a high-performance entropy method for video and image compression. It is the most commonly used entropy method in the video standard H.264. In recent years, several hardware accelerators for CAVLC have been designed. In contrast, high-performance software implementations of CAVLC (e.g., GPU-based) are scarce. A high-performance GPU-based implementation of CAVLC is desirable in several scenarios. On the one hand, it can be exploited as the entropy component in GPU-based H.264 encoders, which are a very suitable solution when GPU built-in H.264 hardware encoders lack certain necessary functionality, such as data encryption and information hiding. On the other hand, a GPU-based implementation of CAVLC can be reused in a wide variety of GPU-based compression systems for encoding images and videos in formats other than H.264, such as medical images. This is not possible with hardware implementations of CAVLC, as they are non-separable components of hardware H.264 encoders. In this paper, we present CAVLCU, an efficient implementation of CAVLC on GPU, which is based on four key ideas. First, we use only one kernel to avoid the long latency global memory accesses required to transmit intermediate results among different kernels, and the costly launches and terminations of additional kernels. Second, we apply an efficient synchronization mechanism for thread-blocks (In this paper, to prevent confusion, a block of pixels of a frame will be referred to as simply block and a GPU thread block as thread-block.) that process adjacent frame regions (in horizontal and vertical dimensions) to share results in global memory space. Third, we exploit fully the available global memory bandwidth by using vectorized loads to move directly the quantized transform coefficients to registers. Fourth, we use register tiling to implement the zigzag sorting, thus obtaining high instruction-level parallelism. An exhaustive experimental evaluation showed that our approach is between 2.5$$\times$$ × and 5.4$$\times$$ × faster than the only state-of-the-art GPU-based implementation of CAVLC.

Download Full-text

LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs

10.1109/nas51552.2021.9605411 ◽

2021 ◽

Author(s):

Devashree Tripathy ◽

AmirAli Abdolrashidi ◽

Quan Fan ◽

Daniel Wong ◽

Manoranjan Satpathy

Keyword(s):

Thread Block ◽

Block Level

Download Full-text

Locality-aware Thread Block Design in Single and Multi-GPU Graph Processing

10.1109/nas51552.2021.9605484 ◽

2021 ◽

Author(s):

Quan Fan ◽

Zizhong Chen

Keyword(s):

Block Design ◽

Graph Processing ◽

Thread Block

Download Full-text

Demystifying the Placement Policies of the NVIDIA GPU Thread Block Scheduler for Concurrent Kernels

ACM SIGMETRICS Performance Evaluation Review ◽

10.1145/3453953.3453972 ◽

2021 ◽

Vol 48 (3) ◽

pp. 81-88

Author(s):

Guin Gilman ◽

Samuel S. Ogden ◽

Tian Guo ◽

Robert J. Walls

Keyword(s):

Resource Availability ◽

Execution Time ◽

Scheduling Algorithms ◽

Round Robin ◽

Performance Degradation ◽

Thread Block ◽

Local Resource ◽

Scheduling Policy ◽

Nvidia Gpu ◽

Placement Policies

In this work, we empirically derive the scheduler's behavior under concurrent workloads for NVIDIA's Pascal, Volta, and Turing microarchitectures. In contrast to past studies that suggest the scheduler uses a round-robin policy to assign thread blocks to streaming multiprocessors (SMs), we instead find that the scheduler chooses the next SM based on the SM's local resource availability. We show how this scheduling policy can lead to significant, and seemingly counter-intuitive, performance degradation; for example, a decrease of one thread per block resulted in a 3.58X increase in execution time for one kernel in our experiments. We hope that our work will be useful for improving the accuracy of GPU simulators and aid in the development of novel scheduling algorithms.

Download Full-text

Automatic Thread Block Size Selection Strategy in GPU Parallel Code Generation

Parallel Architectures, Algorithms and Programming - Communications in Computer and Information Science ◽

10.1007/978-981-16-0010-4_34 ◽

2021 ◽

pp. 390-404

Author(s):

Weifang Hu ◽

Lin Han ◽

Pu Han ◽

Jiandong Shang

Keyword(s):

Code Generation ◽

Block Size ◽

Selection Strategy ◽

Thread Block ◽

Size Selection ◽

Parallel Code

Download Full-text

Coordinated thread block scheduling and warp scheduler for workload distribution

The Journal of Contents Computing ◽

10.9728/jcc.2020.12.2.2.165 ◽

2020 ◽

Vol 2 (2) ◽

pp. 165-173

Author(s):

Vo Viet Tan ◽

Jihoon Lee ◽

Pyungkoo Park

Keyword(s):

Block Scheduling ◽

Thread Block ◽

Workload Distribution

Download Full-text

Performance Analysis of Thread Block Schedulers in GPGPU and Its Implications

Applied Sciences ◽

10.3390/app10249121 ◽

2020 ◽

Vol 10 (24) ◽

pp. 9121

Author(s):

KyungWoon Cho ◽

Hyokyung Bahn

Keyword(s):

Block Scheduling ◽

Modular Forms ◽

Graphics Processing Unit ◽

Round Robin ◽

General Purpose ◽

Processing Unit ◽

Thread Block ◽

Computing Unit ◽

Specialized Hardware ◽

Graphics Processing

GPGPU (General-Purpose Graphics Processing Unit) consists of hardware resources that can execute tens of thousands of threads simultaneously. However, in reality, the parallelism is limited as resource allocation is performed by the base unit called thread block, which is not managed judiciously in the current GPGPU systems. To schedule threads in GPGPU, a specialized hardware scheduler allocates thread blocks to the computing unit called SM (Stream Multiprocessors) in a Round-Robin manner. Although scheduling in hardware is simple and fast, we observe that the Round-Robin scheduling is not efficient in GPGPU, as it does not consider the workload characteristics of threads and the resource balance among SMs. In this article, we present a new thread block scheduling model that has the ability of analyzing and quantifying the performances of thread block scheduling. We implement our model as a GPGPU scheduling simulator and show that the conventional thread block scheduling provided in GPGPU hardware does not perform well as the workload becomes heavy. Specifically, we observe that the performance degradation of Round-Robin can be eliminated by adopting DFA (Depth First Allocation), which is simple but scalable. Moreover, as our simulator consists of modular forms based on the framework and we publicly open it for other researchers to use, various scheduling policies can be incorporated into our simulator for evaluating the performance of GPGPU schedulers.

Download Full-text

A thread‐block‐wise computational framework for large‐scale hierarchical continuum‐discrete modeling of granular media

International Journal for Numerical Methods in Engineering ◽

10.1002/nme.6549 ◽

2020 ◽

Vol 122 (2) ◽

pp. 579-608 ◽

Cited By ~ 1

Author(s):

Shiwei Zhao ◽

Jidong Zhao ◽

Weijian Liang

Keyword(s):

Granular Media ◽

Large Scale ◽

Computational Framework ◽

Thread Block ◽

Discrete Modeling

Download Full-text

A Representation of Membrane Computing with a Clustering Algorithm on the Graphical Processing Unit

Processes ◽

10.3390/pr8091199 ◽

2020 ◽

Vol 8 (9) ◽

pp. 1199

Author(s):

Ravie Chandren Muniyandi ◽

Ali Maroosi

Keyword(s):

Graphics Processing Units ◽

Clustering Algorithm ◽

Hamiltonian Path ◽

Fold Increase ◽

General Purpose ◽

Processing Unit ◽

Thread Block ◽

Hard Problems ◽

Graphical Processing ◽

Graphics Processing

Long-timescale simulations of biological processes such as photosynthesis or attempts to solve NP-hard problems such as traveling salesman, knapsack, Hamiltonian path, and satisfiability using membrane systems without appropriate parallelization can take hours or days. Graphics processing units (GPU) deliver an immensely parallel mechanism to compute general-purpose computations. Previous studies mapped one membrane to one thread block on GPU. This is disadvantageous given that when the quantity of objects for each membrane is small, the quantity of active thread will also be small, thereby decreasing performance. While each membrane is designated to one thread block, the communication between thread blocks is needed for executing the communication between membranes. Communication between thread blocks is a time-consuming process. Previous approaches have also not addressed the issue of GPU occupancy. This study presents a classification algorithm to manage dependent objects and membranes based on the communication rate associated with the defined weighted network and assign them to sub-matrices. Thus, dependent objects and membranes are allocated to the same threads and thread blocks, thereby decreasing communication between threads and thread blocks and allowing GPUs to maintain the highest occupancy possible. The experimental results indicate that for 48 objects per membrane, the algorithm facilitates a 93-fold increase in processing speed compared to a 1.6-fold increase with previous algorithms.

Download Full-text

Inter-kernel Reuse-aware Thread Block Scheduling

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3406538 ◽

2020 ◽

Vol 17 (3) ◽

pp. 1-27

Author(s):

Muhammad Huzaifa ◽

Johnathan Alsop ◽

Abdulrahman Mahmoud ◽

Giordano Salvador ◽

Matthew D. Sinclair ◽

...

Keyword(s):

Block Scheduling ◽

Thread Block

Download Full-text

thread block
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

CAVLCU: an efficient GPU-based implementation of CAVLC

LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs

Locality-aware Thread Block Design in Single and Multi-GPU Graph Processing

Demystifying the Placement Policies of the NVIDIA GPU Thread Block Scheduler for Concurrent Kernels

Automatic Thread Block Size Selection Strategy in GPU Parallel Code Generation

Coordinated thread block scheduling and warp scheduler for workload distribution

Performance Analysis of Thread Block Schedulers in GPGPU and Its Implications

A thread‐block‐wise computational framework for large‐scale hierarchical continuum‐discrete modeling of granular media

A Representation of Membrane Computing with a Clustering Algorithm on the Graphical Processing Unit

Inter-kernel Reuse-aware Thread Block Scheduling

Export Citation Format

thread blockRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

CAVLCU: an efficient GPU-based implementation of CAVLC

LocalityGuru: A PTX Analyzer for Extracting Thread Block-level Locality in GPGPUs

Locality-aware Thread Block Design in Single and Multi-GPU Graph Processing

Demystifying the Placement Policies of the NVIDIA GPU Thread Block Scheduler for Concurrent Kernels

Automatic Thread Block Size Selection Strategy in GPU Parallel Code Generation

Coordinated thread block scheduling and warp scheduler for workload distribution

Performance Analysis of Thread Block Schedulers in GPGPU and Its Implications

A thread‐block‐wise computational framework for large‐scale hierarchical continuum‐discrete modeling of granular media

A Representation of Membrane Computing with a Clustering Algorithm on the Graphical Processing Unit

Inter-kernel Reuse-aware Thread Block Scheduling

thread block
Recently Published Documents