gpu programming
Recently Published Documents


TOTAL DOCUMENTS

115
(FIVE YEARS 16)

H-INDEX

17
(FIVE YEARS 1)

2022 ◽  
pp. 291-360
Author(s):  
Peter S. Pacheco ◽  
Matthew Malensek
Keyword(s):  

2021 ◽  
Vol 5 (OOPSLA) ◽  
pp. 1-30
Author(s):  
Tyler Sorensen ◽  
Lucas F. Salvador ◽  
Harmit Raval ◽  
Hugues Evrard ◽  
John Wickerson ◽  
...  

As GPU availability has increased and programming support has matured, a wider variety of applications are being ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; as such, their correct execution depends on a degree of relative forward progress between threads (or thread groups). Unfortunately, many GPU programming specifications (e.g. Vulkan and Metal) say almost nothing about relative forward progress guarantees between workgroups. Although prior work has proposed a spectrum of plausible progress models for GPUs, cross-vendor specifications have yet to commit to any model. This work is a collection of tools and experimental data to aid specification designers when considering forward progress guarantees in programming frameworks. As a foundation, we formalize a small parallel programming language that captures the essence of fine-grained synchronization. We then provide a means of formally specifying a progress model, and develop a termination oracle that decides whether a given program is guaranteed to eventually terminate with respect to a given progress model. Next, we formalize a set of constraints that describe concurrent programs that require forward progress to terminate. This allows us to synthesize a large set of 483 progress litmus tests. Combined with the termination oracle, we can determine the expected status of each litmus test -- i.e. whether it is guaranteed to eventually terminate -- under various progress models. We present a large experimental campaign running the litmus tests across 8 GPUs from 5 different vendors. Our results highlight that GPUs have significantly different termination behaviors under our test suite. Most notably, we find that Apple and ARM GPUs do not support the linear occupancy-bound model, as was hypothesized by prior work.


2021 ◽  
Vol 14 (2) ◽  
pp. 1-23
Author(s):  
Rui Ma ◽  
Jia-Ching Hsu ◽  
Tian Tan ◽  
Eriko Nurvitadhi ◽  
David Sheffield ◽  
...  

Overlay architectures are a good way to enable fast development and debug on FPGAs at the expense of potentially limited performance compared to fully customized FPGA designs. When used in concert with hand-tuned FPGA solutions, performant overlay architectures can improve time-to-solution and thus overall productivity of FPGA solutions. This work tunes and specializes FGPU, an open source OpenCL-programmable GPU overlay for FPGAs. We demonstrate that our persistent deep learning (PDL )-FGPU architecture maintains the ease-of-programming and generality of GPU programming while achieving high performance from specialization for the persistent deep learning domain. We also propose an easy method to specialize for other domains. PDL-FGPU includes new instructions, along with micro-architecture and compiler enhancements. We evaluate both the FGPU baseline and the proposed PDL-FGPU on a modern high-end Intel Stratix 10 2800 FPGA in simulation running persistent DL applications (RNN, GRU, LSTM), and non-DL applications to demonstrate generality. PDL-FGPU requires 1.4–3× more ALMs, 4.4–6.4× more M20ks, and 1–9.5× more DSPs than baseline, but improves performance by 56–693× for PDL applications with an average 23.1% degradation on non-PDL applications. We integrated the PDL-FGPU overlay into Intel OPAE to measure real-world performance/power and demonstrate that PDL-FGPU is only 4.0–10.4× slower than the Nvidia V100.


Universe ◽  
2021 ◽  
Vol 7 (7) ◽  
pp. 218
Author(s):  
Iuri La Rosa ◽  
Pia Astone ◽  
Sabrina D’Antonio ◽  
Sergio Frasca ◽  
Paola Leaci ◽  
...  

We present a new approach to searching for Continuous gravitational Waves (CWs) emitted by isolated rotating neutron stars, using the high parallel computing efficiency and computational power of modern Graphic Processing Units (GPUs). Specifically, in this paper the porting of one of the algorithms used to search for CW signals, the so-called FrequencyHough transform, on the TensorFlow framework, is described. The new code has been fully tested and its performance on GPUs has been compared to those in a CPU multicore system of the same class, showing a factor of 10 speed-up. This demonstrates that GPU programming with general purpose libraries (the those of the TensorFlow framework) of a high-level programming language can provide a significant improvement of the performance of data analysis, opening new perspectives on wide-parameter searches for CWs.


2020 ◽  
Vol 20 (4) ◽  
pp. 1-27
Author(s):  
Patrick Daleiden ◽  
Andreas Stefik ◽  
Philip Merlin Uesbeck
Keyword(s):  

Author(s):  
Gabriel Camporredondo ◽  
Lourdes Muñoz ◽  
Mathieu Legrand ◽  
Ramon Barber

Sensors ◽  
2020 ◽  
Vol 20 (8) ◽  
pp. 2309
Author(s):  
Yifei Tian ◽  
Wei Song ◽  
Long Chen ◽  
Yunsick Sung ◽  
Jeonghoon Kwak ◽  
...  

Fast and accurate obstacle detection is essential for accurate perception of mobile vehicles’ environment. Because point clouds sensed by light detection and ranging (LiDAR) sensors are sparse and unstructured, traditional obstacle clustering on raw point clouds are inaccurate and time consuming. Thus, to achieve fast obstacle clustering in an unknown terrain, this paper proposes an elevation-reference connected component labeling (ER-CCL) algorithm using graphic processing unit (GPU) programing. LiDAR points are first projected onto a rasterized x–z plane so that sparse points are mapped into a series of regularly arranged small cells. Based on the height distribution of the LiDAR point, the ground cells are filtered out and a flag map is generated. Next, the ER-CCL algorithm is implemented on the label map generated from the flag map to mark individual clusters with unique labels. Finally, obstacle labeling results are inverse transformed from the x–z plane to 3D points to provide clustering results. For real-time 3D point cloud clustering, ER-CCL is accelerated by running it in parallel with the aid of GPU programming technology.


Sign in / Sign up

Export Citation Format

Share Document