SUGAR

Author(s):  
Lishan Yang ◽  
Bin Nie ◽  
Adwait Jog ◽  
Evgenia Smirni

As Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One of the major challenges in the domain of GPU reliability is to accurately measure GPGPU application error resilience. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on theapplication error resilience is impractical. Application resilience is evaluated via extensive fault injection campaigns based on sampling of an extensive fault site space. Typically, the larger the input of the GPGPU application, the longer the experimental campaign. In this work, we devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by judicious input sizing. We show how analyzing a small fraction of the input is sufficient to estimate the application resilience with high accuracy and dramatically reduce the duration of experimentation. Key of our estimation methodology is the discovery of repeating patterns as a function of the input size. Using the well-established fact that error resilience in GPGPU applications is mostly determined by the dynamic instruction count at the thread level, we identify the patterns that allow us to accurately predict application error resilience for arbitrarily large inputs. For the cases that we examine in this paper, this new resilience estimation mechanism provides significant speedups (up to 1336 times) and 97.0 on the average, while keeping estimation errors to less than 1%.

Author(s):  
Michael Commer ◽  
Filipe RNC Maia ◽  
Gregory A Newman

Many geo-scientific applications involve boundary value problems arising in simulating electrostatic and electromagnetic fields for geophysical prospecting and subsurface imaging of electrical resistivity. Modeling complex geological media with three-dimensional finite-difference grids gives rise to large sparse linear systems of equations. For such systems, we have implemented three common iterative Krylov solution methods on graphics processing units and compared their performance with parallel host-based versions. The benchmarks show that the device efficiency improves with increasing grid sizes. Limitations are currently given by the device memory resources.


2014 ◽  
Vol 1077 ◽  
pp. 118-123 ◽  
Author(s):  
Lubomír Klimeš ◽  
Pavel Charvát ◽  
Milan Ostrý ◽  
Josef Stetina

Phase change materials have a wide range of application including thermal energy storage in building structures, solar air collectors, heat storage units and exchangers. Such applications often utilize a commercially produced phase change material enclosed in a thin panel (container) made of aluminum. A parallel 1D heat transfer model of a container with phase change material was developed by means of the control volume and effective heat capacity methods. The parallel implementation in the CUDA computing architecture allows the model for running on graphics processing units which makes the model very fast in comparison to traditional models computed on a single CPU. The paper presents the model implementation and results of computational model benchmarking carried out with the use of high-level and low-level GPUs NVIDIA.


Author(s):  
Khalid Adam ◽  
Izzeldin I. Mohd ◽  
Younis Ibrahim

Recently, deep neural networks (DNNs) have been increasingly deployed in various healthcare applications, which are considered safety-critical applications. Thus, the reliability of these DNN models should be remarkably high, because even a small error in healthcare applications can lead to injury or death. Due to the high computations of the DNN models, DNNs are often executed on the graphics processing units (GPUs). However, the GPUs have been reportedly impacted by soft errors, which are extremely serious issues in the healthcare applications. In this paper, we show how the fault injection can provide a deeper understanding of DenseNet201 model instructions vulnerability on the GPU. Then, we analyze vulnerable instructions of the DenseNet201 on the GPU. Our results show that the most significant vulnerable instructions against soft errors PR, STORE, FADD, FFMA, SETP and LD can be reduced from 4.42% to 0.14% of injected faults, after we applied our mitigation strategy.


2018 ◽  
Vol 9 (2) ◽  
pp. 1
Author(s):  
André Luiz Buarque Vieira-e-Silva ◽  
Caio Brito ◽  
Mozart William Almeida ◽  
Veronica Teichrieb

Meshless methods to simulate fluid flows have been increasingly evolving through the years since they are a great alternative to deal with large deformations, which is where meshbased methods fail to perform efficiently. A well known meshless method is the Moving Particle Semi-implicit (MPS) method, which was designed to simulate free-surface truly incompressible fluid flows. Many variations and refinements of the method’s accuracy and precision have been proposed through the years and, in this paper, a reasonably wide literature review was performed together with their theoretical and mathematical explanations. Due to these works, it has proved to be very useful in a wide range of naval and mechanical engineering problems. However, one of its drawbacks is a high computational load and some quite time-consuming functions, which prevents it to be more used in Computer Graphics and Virtual Reality applications. Graphics Processing Units (GPU) provide unprecedented capabilities for scientific computations. To promote the GPU-acceleration, the solution of the Poisson Pressure equation was brought into focus. This work benefits from some of the techniques presented in the related work and also from the CUDA language in order to get a stable, accurate and GPU-accelerated MPS-based method, which is this work’s main contribution. It is shown that the GPU version of the method developed can perform from, approximately, 6 to 10 times faster with the same reliability as the CPU version, both extended to three dimensions. Lastly, a simulation containing a total of 62,600 particles is fully rendered in 3D.


2014 ◽  
Vol 23 (08) ◽  
pp. 1430002 ◽  
Author(s):  
SPARSH MITTAL

Initially introduced as special-purpose accelerators for graphics applications, graphics processing units (GPUs) have now emerged as general purpose computing platforms for a wide range of applications. To address the requirements of these applications, modern GPUs include sizable hardware-managed caches. However, several factors, such as unique architecture of GPU, rise of CPU–GPU heterogeneous computing, etc., demand effective management of caches to achieve high performance and energy efficiency. Recently, several techniques have been proposed for this purpose. In this paper, we survey several architectural and system-level techniques proposed for managing and leveraging GPU caches. We also discuss the importance and challenges of cache management in GPUs. The aim of this paper is to provide the readers insights into cache management techniques for GPUs and motivate them to propose even better techniques for leveraging the full potential of caches in the GPUs of tomorrow.


2019 ◽  
Author(s):  
Qianqian Fang ◽  
Shijie Yan

AbstractThe mesh-based Monte Carlo (MMC) algorithm is increasingly used as the gold-standard for developing new biophotonics modeling techniques in 3-D complex tissues, including both diffusion-based and various Monte Carlo (MC) based methods. Compared to multi-layered and voxel-based MCs, MMC can utilize tetrahedral meshes to gain improved anatomical accuracy, but also results in higher computational and memory demands. Previous attempts of accelerating MMC using graphics processing units (GPUs) have yielded limited performance improvement and are not publicly available. Here we report a highly efficient MMC – MMCL – using the OpenCL heterogeneous computing framework, and demonstrate a speedup ratio up to 420× compared to state-of-the-art single-threaded CPU simulations. The MMCL simulator supports almost all advanced features found in our widely disseminated MMC software, such as support for a dozen of complex source forms, wide-field detectors, boundary reflection, photon replay and storing a rich set of detected photon information. Furthermore, this tool supports a wide range of GPUs/CPUs across vendors and is freely available with full source codes and benchmark suites at http://mcx.space/#mmc.


2021 ◽  
Vol 10 (2) ◽  
pp. 917-926
Author(s):  
Viet Tan Vo ◽  
Cheol Hong Kim

This study analyzes the efficiency of parallel computational applications with the adoption of recent graphics processing units (GPUs). We investigate the impacts of the additional resources of recent architecture on the popular benchmarks compared with previous architecture. Our simulation results demonstrate that Pascal GPU architecture improves the performance by 273% on average compared to old-fashioned Fermi architecture. To evaluate the performance improvement depending on specific hardware resources, we divide the hardware resources into two types: computing and memory resources. Computing resources have bigger impact on performance improvement than memory resources in most of benchmarks. For Hotspot and B+ tree, the architecture adopting only enhanced computing resources can achieve similar performance gains of the architecture adopting both computing and memory resources. We also evaluate the influence of the number of warp schedulers in the SM (Streaming Multiprocessor) to the GPU performance in relationship with barrier waiting time. Based on these analyses, we propose the development direction for the future generation of GPUs.


2012 ◽  
Vol 23 (08) ◽  
pp. 1240002 ◽  
Author(s):  
MARTIN WEIGEL

The use of graphics processing units (GPUs) in scientific computing has gathered considerable momentum in the past five years. While GPUs in general promise high performance and excellent performance per Watt ratios, not every class of problems is equally well suitable for exploiting the massively parallel architecture they provide. Lattice spin models appear to be prototypic examples of problems suitable for this architecture, at least as long as local update algorithms are employed. In this review, I summarize our recent experience with the simulation of a wide range of spin models on GPU employing an equally wide range of update algorithms, ranging from Metropolis and heat bath updates, over cluster algorithms to generalized ensemble simulations.


2020 ◽  
Vol 12 (8) ◽  
pp. 1257 ◽  
Author(s):  
Mercedes E. Paoletti ◽  
Juan M. Haut ◽  
Xuanwen Tao ◽  
Javier Plaza Miguel ◽  
Antonio Plaza

The storage and processing of remotely sensed hyperspectral images (HSIs) is facing important challenges due to the computational requirements involved in the analysis of these images, characterized by continuous and narrow spectral channels. Although HSIs offer many opportunities for accurately modeling and mapping the surface of the Earth in a wide range of applications, they comprise massive data cubes. These huge amounts of data impose important requirements from the storage and processing points of view. The support vector machine (SVM) has been one of the most powerful machine learning classifiers, able to process HSI data without applying previous feature extraction steps, exhibiting a robust behaviour with high dimensional data and obtaining high classification accuracies. Nevertheless, the training and prediction stages of this supervised classifier are very time-consuming, especially for large and complex problems that require an intensive use of memory and computational resources. This paper develops a new, highly efficient implementation of SVMs that exploits the high computational power of graphics processing units (GPUs) to reduce the execution time by massively parallelizing the operations of the algorithm while performing efficient memory management during data-reading and writing instructions. Our experiments, conducted over different HSI benchmarks, demonstrate the efficiency of our GPU implementation.


2015 ◽  
Vol 04 (01n02) ◽  
pp. 1550002
Author(s):  
A. Magro ◽  
K. Zarb Adami ◽  
J. Hickish

Graphics processing units (GPU)-based beamforming is a relatively unexplored area in radio astronomy, possibly due to the assumption that any such system will be severely limited by the PCIe bandwidth required to transfer data to the GPU. We have developed a CUDA-based GPU implementation of a coherent beamformer, specifically designed and optimized for deployment at the BEST-2 array which can generate an arbitrary number of synthesized beams for a wide range of parameters. It achieves [Formula: see text] TFLOPs on an NVIDIA Tesla K20, approximately 10x faster than an optimized, multithreaded CPU implementation. This kernel has been integrated into two real-time, GPU-based time-domain software pipelines deployed at the BEST-2 array in Medicina: a standalone beamforming pipeline and a transient detection pipeline. We present performance benchmarks for the beamforming kernel as well as the transient detection pipeline with beamforming capabilities as well as results of test observation.


Sign in / Sign up

Export Citation Format

Share Document