gpu implementation
Recently Published Documents


TOTAL DOCUMENTS

522
(FIVE YEARS 95)

H-INDEX

28
(FIVE YEARS 4)

Author(s):  
Liam Dunn ◽  
Patrick Clearwater ◽  
Andrew Melatos ◽  
Karl Wette

Abstract The F-statistic is a detection statistic used widely in searches for continuous gravitational waves with terrestrial, long-baseline interferometers. A new implementation of the F-statistic is presented which accelerates the existing "resampling" algorithm using graphics processing units (GPUs). The new implementation runs between 10 and 100 times faster than the existing implementation on central processing units without sacrificing numerical accuracy. The utility of the GPU implementation is demonstrated on a pilot narrowband search for four newly discovered millisecond pulsars in the globular cluster Omega Centauri using data from the second Laser Interferometer Gravitational-Wave Observatory observing run. The computational cost is 17:2 GPU-hours using the new implementation, compared to 1092 core-hours with the existing implementation.


Electronics ◽  
2021 ◽  
Vol 11 (1) ◽  
pp. 11
Author(s):  
Xing Xie ◽  
Lin Bai ◽  
Xinming Huang

LiDAR has been widely used in autonomous driving systems to provide high-precision 3D geometric information about the vehicle’s surroundings for perception, localization, and path planning. LiDAR-based point cloud semantic segmentation is an important task with a critical real-time requirement. However, most of the existing convolutional neural network (CNN) models for 3D point cloud semantic segmentation are very complex and can hardly be processed at real-time on an embedded platform. In this study, a lightweight CNN structure was proposed for projection-based LiDAR point cloud semantic segmentation with only 1.9 M parameters that gave an 87% reduction comparing to the state-of-the-art networks. When evaluated on a GPU, the processing time was 38.5 ms per frame, and it achieved a 47.9% mIoU score on Semantic-KITTI dataset. In addition, the proposed CNN is targeted on an FPGA using an NVDLA architecture, which results in a 2.74x speedup over the GPU implementation with a 46 times improvement in terms of power efficiency.


Author(s):  
Omer Anjum ◽  
Mohammad Almasri ◽  
Simon Garcia de Gonzalo ◽  
Wen-mei Hwu

2021 ◽  
Author(s):  
Jiamian Huang ◽  
Yasuaki Ito ◽  
Koji Nakano

2021 ◽  
Author(s):  
Abdulrahman Manea

Abstract Due to its simplicity, adaptability, and applicability to various grid formats, the restriction-smoothed basis multiscale method (MsRSB) (Møyne and Lie 2016) has received wide attention and has been extended to various flow problems in porous media. Unlike the standard multiscale methods, MsRSB relies on iterative smoothing to find the multiscale basis functions in an adaptive manner, giving it the ability to naturally adjust to various complex grid orientations often encountered in real-life industrial applications. In this work, we investigate the scalability of MsRSB on various state-of-the-art parallel architectures, including multi-core systems and GPUs. While MsRSB is — like most other multiscale methods — directly amenable to parallelization, the dependence on a smoother to find the basis functions creates unique control- and data-flow patterns. These patterns require careful design and implementation in parallel environments to achieve good scalability. We extend the work on parallel multiscale methods in Manea et al. (2016) and Manea and Almani (2019) to map the MsRSB special kernels to the shared-memory parallel multi-core and GPU architectures. The scalability of our optimized parallel MsRSB implementation is demonstrated using highly heterogeneous 3D problems derived from the SPE10 Benchmark (Christie and Blunt 2001). Those problems range in size from millions to tens of millions of cells. The multi-core implementation is benchmarked on a shared memory multi-core architecture consisting of two packages of Intel's Cascade Lake Xeon® Gold 6246 CPU, while the GPU implementation is benchmarked on a massively parallel architecture consisting of Nvidia Volta V100 GPUs. We compare the multi-core implementation to the GPU implementation for both the setup and solution stages. To the best of our knowledge, this is the first parallel implementation and demonstration of the versatile MsRSB method on the GPU architecture.


2021 ◽  
Author(s):  
John D. Bartlett ◽  
Duane Storti

Abstract The rapid development of parallelization technology over the recent decades has provided a promising avenue for the acceleration of meshfree simulation methods. One such method, peridynamics, is particularly well-suited for parallelization due to the simplicity of the operations which must occur at each material point. However, while MPI-based parallelization (Message-Passing Interface; a method for CPU-based parallelization) of peridynamic problems is commonplace, GPU parallelization of peridynamics has received far less attention. While GPU technology may have once been an inferior option to MPI parallelization for peridynamics, modern GPU cards are more than capable of handling substantially sized peridynamics problems. This paper presents the parallelization of the peridynamic method for single-card GPU computing, providing a schematic for a compact parallel approach. The resulting method is tested with CUDA on a NVIDIA Tesla P100 card with 16 GB of memory. The per-node memory requirements for each data structure used are evaluated, as well as the per-node execution times for each operation in a million-node benchmark test. This setup is shown to provide speedup factors over 200 for problems sized up to several million nodes, therefore indicating such a GPU is more than adequate for the single-card parallelization of the peridynamic method.


Sign in / Sign up

Export Citation Format

Share Document