nvidia gpu
Recently Published Documents


TOTAL DOCUMENTS

51
(FIVE YEARS 25)

H-INDEX

6
(FIVE YEARS 1)

2022 ◽  
Vol 19 (1) ◽  
pp. 1-26
Author(s):  
Aditya Ukarande ◽  
Suryakant Patidar ◽  
Ram Rangan

The compute work rasterizer or the GigaThread Engine of a modern NVIDIA GPU focuses on maximizing compute work occupancy across all streaming multiprocessors in a GPU while retaining design simplicity. In this article, we identify the operational aspects of the GigaThread Engine that help it meet those goals but also lead to less-than-ideal cache locality for texture accesses in 2D compute shaders, which are an important optimization target for gaming applications. We develop three software techniques, namely LargeCTAs , Swizzle , and Agents , to show that it is possible to effectively exploit the texture data working set overlap intrinsic to 2D compute shaders. We evaluate these techniques on gaming applications across two generations of NVIDIA GPUs, RTX 2080 and RTX 3080, and find that they are effective on both GPUs. We find that the bandwidth savings from all our software techniques on RTX 2080 is much higher than the bandwidth savings on baseline execution from inter-generational cache capacity increase going from RTX 2080 to RTX 3080. Our best-performing technique, Agents , records up to a 4.7% average full-frame speedup by reducing bandwidth demand of targeted shaders at the L1-L2 and L2-DRAM interfaces by 23% and 32%, respectively, on the latest generation RTX 3080. These results acutely highlight the sensitivity of cache locality to compute work rasterization order and the importance of locality-aware cooperative thread array scheduling for gaming applications.


Computation ◽  
2021 ◽  
Vol 9 (12) ◽  
pp. 142
Author(s):  
Tair Askar ◽  
Bekdaulet Shukirgaliyev ◽  
Martin Lukac ◽  
Ernazar Abdikamalov

Monte Carlo methods rely on sequences of random numbers to obtain solutions to many problems in science and engineering. In this work, we evaluate the performance of different pseudo-random number generators (PRNGs) of the Curand library on a number of modern Nvidia GPU cards. As a numerical test, we generate pseudo-random number (PRN) sequences and obtain non-uniform distributions using the acceptance-rejection method. We consider GPU, CPU, and hybrid CPU/GPU implementations. For the GPU, we additionally consider two different implementations using the host and device application programming interfaces (API). We study how the performance depends on implementation parameters, including the number of threads per block and the number of blocks per streaming multiprocessor. To achieve the fastest performance, one has to minimize the time consumed by PRNG seed setup and state update. The duration of seed setup time increases with the number of threads, while PRNG state update decreases. Hence, the fastest performance is achieved by the optimal balance of these opposing effects.


Webology ◽  
2021 ◽  
Vol 18 (Special Issue 04) ◽  
pp. 341-353
Author(s):  
Huda Khurshed Shawkat Aljader

In 5G MASSIVE MIMO the main problem for better video communication is of transmission bandwidth issues which in turn reflects in video conferencing, Mobile-learning, gaming, graphics applications etc. However, in regard to a versatile framework i.e. a computerized camera with web and correspondence office, at that point the transmission capacity for correspondence and additionally stockpiling are of genuine concern. Tremendous valuable data must be put away and recovered proficiently for down to earth purposes. This is the novel methodology that scaling technique has proposed for 2D Haar wavelet transform based compression of picture utilizing parameter CR- ratio of Compression with PSNR improvement, with maintaining picture quality, less bandwidth usage with better compression ratio which increase by increasing speed with the help of NVIDIA GPU.


Author(s):  
Martin Svedin ◽  
Steven W. D. Chien ◽  
Gibson Chikafa ◽  
Niclas Jansson ◽  
Artur Podobas
Keyword(s):  

2021 ◽  
Vol 14 (6) ◽  
pp. 3577-3602
Author(s):  
James Shaw ◽  
Georges Kesserwani ◽  
Jeffrey Neal ◽  
Paul Bates ◽  
Mohammad Kazem Sharifian

Abstract. LISFLOOD-FP 8.0 includes second-order discontinuous Galerkin (DG2) and first-order finite-volume (FV1) solvers of the two-dimensional shallow-water equations for modelling a wide range of flows, including rapidly propagating, supercritical flows, shock waves or flows over very smooth surfaces. The solvers are parallelised on multi-core CPU and Nvidia GPU architectures and run existing LISFLOOD-FP modelling scenarios without modification. These new, fully two-dimensional solvers are available alongside the existing local inertia solver (called ACC), which is optimised for multi-core CPUs and integrates with the LISFLOOD-FP sub-grid channel model. The predictive capabilities and computational scalability of the new DG2 and FV1 solvers are studied for two Environment Agency benchmark tests and a real-world fluvial flood simulation driven by rainfall across a 2500 km2 catchment. DG2's second-order-accurate, piecewise-planar representation of topography and flow variables enables predictions on coarse grids that are competitive with FV1 and ACC predictions on 2–4 times finer grids, particularly where river channels are wider than half the grid spacing. Despite the simplified formulation of the local inertia solver, ACC is shown to be spatially second-order-accurate and yields predictions that are close to DG2. The DG2-CPU and FV1-CPU solvers achieve near-optimal scalability up to 16 CPU cores and achieve greater efficiency on grids with fewer than 0.1 million elements. The DG2-GPU and FV1-GPU solvers are most efficient on grids with more than 1 million elements, where the GPU solvers are 2.5–4 times faster than the corresponding 16-core CPU solvers. LISFLOOD-FP 8.0 therefore marks a new step towards operational DG2 flood inundation modelling at the catchment scale. LISFLOOD-FP 8.0 is freely available under the GPL v3 license, with additional documentation and case studies at https://www.seamlesswave.com/LISFLOOD8.0 (last access: 2 June 2021).


Sensors ◽  
2021 ◽  
Vol 21 (8) ◽  
pp. 2595
Author(s):  
Balakrishnan Ramalingam ◽  
Abdullah Aamir Hayat ◽  
Mohan Rajesh Elara ◽  
Braulio Félix Gómez ◽  
Lim Yi ◽  
...  

The pavement inspection task, which mainly includes crack and garbage detection, is essential and carried out frequently. The human-based or dedicated system approach for inspection can be easily carried out by integrating with the pavement sweeping machines. This work proposes a deep learning-based pavement inspection framework for self-reconfigurable robot named Panthera. Semantic segmentation framework SegNet was adopted to segment the pavement region from other objects. Deep Convolutional Neural Network (DCNN) based object detection is used to detect and localize pavement defects and garbage. Furthermore, Mobile Mapping System (MMS) was adopted for the geotagging of the defects. The proposed system was implemented and tested with the Panthera robot having NVIDIA GPU cards. The experimental results showed that the proposed technique identifies the pavement defects and litters or garbage detection with high accuracy. The experimental results on the crack and garbage detection are presented. It is found that the proposed technique is suitable for deployment in real-time for garbage detection and, eventually, sweeping or cleaning tasks.


2021 ◽  
Vol 13 (6) ◽  
pp. 1077
Author(s):  
Oscar Ferraz ◽  
Vitor Silva ◽  
Gabriel Falcao

Edge applications evolved into a variety of scenarios that include the acquisition and compression of immense amounts of images acquired in space remote environments such as satellites and drones, where characteristics such as power have to be properly balanced with constrained memory and parallel computational resources. The CCSDS-123 is a standard for lossless compression of multispectral and hyperspectral images used in on-board satellites and military drones. This work explores the performance and power of 3 families of low-power heterogeneous Nvidia GPU Jetson architectures, namely the 128-core Nano, the 256-core TX2 and the 512-core Xavier AGX by proposing a parallel solution to the CCSDS-123 compressor on embedded systems, reducing development effort, compared to the production of dedicated circuits, while maintaining low power. This solution parallelizes the predictor on the low-power GPU while the entropy encoders exploit the heterogeneous multiple CPU cores and the GPU concurrently. We report more than 4.4 GSamples/s for the predictor and up to 6.7 Gb/s for the complete system, requiring less than 11 W and providing an efficiency of 611 Mb/s/W.


2021 ◽  
Author(s):  
Xing Hu ◽  
Ling Liang ◽  
chen xiaobing ◽  
Lei Deng ◽  
Yu Ji ◽  
...  

As deep neural networks (DNNs) continue their reach into a wide range of application domains, the neural network architecture of DNN models becomes an increasingly sensitive subject, due to either intellectual property protection or risks of adversarial attacks. In observing the large gap between the architectural surfaces exploration and the model integrity study, this paper first presents the formulated schema of the model leakage risks. Then, we propose DeepSniffer, a learning-based model extraction framework, to obtain the complete model architecture information without any prior knowledge of the victim model. It is robust to architectural and system noises introduced by the complex memory hierarchy and diverse run-time system optimizations. Taking GPU platforms as a showcase, DeepSniffer performs model extraction by learning both the architecture-level execution features of kernels and the inter-layer temporal association information introduced by the common practice of DNN design. We demonstrate that DeepSniffer works experimentally in the context of an off-the-shelf Nvidia GPU platform running a variety of DNN models. The extracted models are directly helpful to the attempting of crafting adversarial inputs. The DeepSniffer project has been released in https://github.com/xinghu7788/DeepSniffer.


2021 ◽  
Author(s):  
Xing Hu ◽  
Ling Liang ◽  
chen xiaobing ◽  
Lei Deng ◽  
Yu Ji ◽  
...  

As deep neural networks (DNNs) continue their reach into a wide range of application domains, the neural network architecture of DNN models becomes an increasingly sensitive subject, due to either intellectual property protection or risks of adversarial attacks. In observing the large gap between the architectural surfaces exploration and the model integrity study, this paper first presents the formulated schema of the model leakage risks. Then, we propose DeepSniffer, a learning-based model extraction framework, to obtain the complete model architecture information without any prior knowledge of the victim model. It is robust to architectural and system noises introduced by the complex memory hierarchy and diverse run-time system optimizations. Taking GPU platforms as a showcase, DeepSniffer performs model extraction by learning both the architecture-level execution features of kernels and the inter-layer temporal association information introduced by the common practice of DNN design. We demonstrate that DeepSniffer works experimentally in the context of an off-the-shelf Nvidia GPU platform running a variety of DNN models. The extracted models are directly helpful to the attempting of crafting adversarial inputs. The DeepSniffer project has been released in https://github.com/xinghu7788/DeepSniffer.


2021 ◽  
Vol 48 (3) ◽  
pp. 81-88
Author(s):  
Guin Gilman ◽  
Samuel S. Ogden ◽  
Tian Guo ◽  
Robert J. Walls

In this work, we empirically derive the scheduler's behavior under concurrent workloads for NVIDIA's Pascal, Volta, and Turing microarchitectures. In contrast to past studies that suggest the scheduler uses a round-robin policy to assign thread blocks to streaming multiprocessors (SMs), we instead find that the scheduler chooses the next SM based on the SM's local resource availability. We show how this scheduling policy can lead to significant, and seemingly counter-intuitive, performance degradation; for example, a decrease of one thread per block resulted in a 3.58X increase in execution time for one kernel in our experiments. We hope that our work will be useful for improving the accuracy of GPU simulators and aid in the development of novel scheduling algorithms.


Sign in / Sign up

Export Citation Format

Share Document