gpu programming Latest Research Papers

As GPU availability has increased and programming support has matured, a wider variety of applications are being ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; as such, their correct execution depends on a degree of relative forward progress between threads (or thread groups). Unfortunately, many GPU programming specifications (e.g. Vulkan and Metal) say almost nothing about relative forward progress guarantees between workgroups. Although prior work has proposed a spectrum of plausible progress models for GPUs, cross-vendor specifications have yet to commit to any model. This work is a collection of tools and experimental data to aid specification designers when considering forward progress guarantees in programming frameworks. As a foundation, we formalize a small parallel programming language that captures the essence of fine-grained synchronization. We then provide a means of formally specifying a progress model, and develop a termination oracle that decides whether a given program is guaranteed to eventually terminate with respect to a given progress model. Next, we formalize a set of constraints that describe concurrent programs that require forward progress to terminate. This allows us to synthesize a large set of 483 progress litmus tests. Combined with the termination oracle, we can determine the expected status of each litmus test -- i.e. whether it is guaranteed to eventually terminate -- under various progress models. We present a large experimental campaign running the litmus tests across 8 GPUs from 5 different vendors. Our results highlight that GPUs have significantly different termination behaviors under our test suite. Most notably, we find that Apple and ARM GPUs do not support the linear occupancy-bound model, as was hypothesized by prior work.

Download Full-text

Specializing FGPU for Persistent Deep Learning

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3457886 ◽

2021 ◽

Vol 14 (2) ◽

pp. 1-23

Author(s):

Rui Ma ◽

Jia-Ching Hsu ◽

Tian Tan ◽

Eriko Nurvitadhi ◽

David Sheffield ◽

...

Keyword(s):

Deep Learning ◽

Open Source ◽

Real World ◽

High Performance ◽

Gpu Programming ◽

Easy Method ◽

Fast Development

Overlay architectures are a good way to enable fast development and debug on FPGAs at the expense of potentially limited performance compared to fully customized FPGA designs. When used in concert with hand-tuned FPGA solutions, performant overlay architectures can improve time-to-solution and thus overall productivity of FPGA solutions. This work tunes and specializes FGPU, an open source OpenCL-programmable GPU overlay for FPGAs. We demonstrate that our persistent deep learning (PDL )-FGPU architecture maintains the ease-of-programming and generality of GPU programming while achieving high performance from specialization for the persistent deep learning domain. We also propose an easy method to specialize for other domains. PDL-FGPU includes new instructions, along with micro-architecture and compiler enhancements. We evaluate both the FGPU baseline and the proposed PDL-FGPU on a modern high-end Intel Stratix 10 2800 FPGA in simulation running persistent DL applications (RNN, GRU, LSTM), and non-DL applications to demonstrate generality. PDL-FGPU requires 1.4–3× more ALMs, 4.4–6.4× more M20ks, and 1–9.5× more DSPs than baseline, but improves performance by 56–693× for PDL applications with an average 23.1% degradation on non-PDL applications. We integrated the PDL-FGPU overlay into Intel OPAE to measure real-world performance/power and demonstrate that PDL-FGPU is only 4.0–10.4× slower than the Nvidia V100.

Download Full-text

Continuous Gravitational-Wave Data Analysis with General Purpose Computing on Graphic Processing Units

Universe ◽

10.3390/universe7070218 ◽

2021 ◽

Vol 7 (7) ◽

pp. 218

Author(s):

Iuri La Rosa ◽

Pia Astone ◽

Sabrina D’Antonio ◽

Sergio Frasca ◽

Paola Leaci ◽

...

Keyword(s):

Data Analysis ◽

General Purpose ◽

Gpu Programming ◽

Computational Power ◽

Graphic Processing Units ◽

New Approach ◽

Multicore System ◽

Speed Up ◽

High Level ◽

Graphic Processing

We present a new approach to searching for Continuous gravitational Waves (CWs) emitted by isolated rotating neutron stars, using the high parallel computing efficiency and computational power of modern Graphic Processing Units (GPUs). Specifically, in this paper the porting of one of the algorithms used to search for CW signals, the so-called FrequencyHough transform, on the TensorFlow framework, is described. The new code has been fully tested and its performance on GPUs has been compared to those in a CPU multicore system of the same class, showing a factor of 10 speed-up. This demonstrates that GPU programming with general purpose libraries (the those of the TensorFlow framework) of a high-level programming language can provide a significant improvement of the performance of data analysis, opening new perspectives on wide-parameter searches for CWs.

Download Full-text

Simplifying low-level GPU programming with GAS

Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming ◽

10.1145/3437801.3441591 ◽

2021 ◽

Author(s):

Da Yan ◽

Wei Wang ◽

Xiaowen Chu

Keyword(s):

Gpu Programming ◽

Low Level

Download Full-text

GPU Programming Productivity in Different Abstraction Paradigms

ACM Transactions on Computing Education ◽

10.1145/3418301 ◽

2020 ◽

Vol 20 (4) ◽

pp. 1-27

Author(s):

Patrick Daleiden ◽

Andreas Stefik ◽

Philip Merlin Uesbeck

Keyword(s):

Gpu Programming

Download Full-text

COURSE CONTENT FOR LEARNING GPU PROGRAMMING

ICERI2020 Proceedings ◽

10.21125/iceri.2020.1125 ◽

2020 ◽

Author(s):

Gabriel Camporredondo ◽

Lourdes Muñoz ◽

Mathieu Legrand ◽

Ramon Barber

Keyword(s):

Gpu Programming ◽

Course Content

Download Full-text

Exploring a multi-resolution GPU programming model for Chapel

2020 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) ◽

10.1109/ipdpsw50202.2020.00117 ◽

2020 ◽

Author(s):

Akihiro Hayashi ◽

Sri Raj Paul ◽

Vivek Sarkar

Keyword(s):

Programming Model ◽

Gpu Programming

Download Full-text

A Fast Spatial Clustering Method for Sparse LiDAR Point Clouds Using GPU Programming

Sensors ◽

10.3390/s20082309 ◽

2020 ◽

Vol 20 (8) ◽

pp. 2309

Author(s):

Yifei Tian ◽

Wei Song ◽

Long Chen ◽

Yunsick Sung ◽

Jeonghoon Kwak ◽

...

Keyword(s):

Point Cloud ◽

Spatial Clustering ◽

Point Clouds ◽

Obstacle Detection ◽

Small Cells ◽

Gpu Programming ◽

Processing Unit ◽

Height Distribution ◽

Connected Component ◽

Accurate Perception

Fast and accurate obstacle detection is essential for accurate perception of mobile vehicles’ environment. Because point clouds sensed by light detection and ranging (LiDAR) sensors are sparse and unstructured, traditional obstacle clustering on raw point clouds are inaccurate and time consuming. Thus, to achieve fast obstacle clustering in an unknown terrain, this paper proposes an elevation-reference connected component labeling (ER-CCL) algorithm using graphic processing unit (GPU) programing. LiDAR points are first projected onto a rasterized x–z plane so that sparse points are mapped into a series of regularly arranged small cells. Based on the height distribution of the LiDAR point, the ground cells are filtered out and a flag map is generated. Next, the ER-CCL algorithm is implemented on the label map generated from the flag map to mark individual clusters with unique labels. Finally, obstacle labeling results are inverse transformed from the x–z plane to 3D points to provide clustering results. For real-time 3D point cloud clustering, ER-CCL is accelerated by running it in parallel with the aid of GPU programming technology.

Download Full-text