Architecture exploration of recent GPUs to analyze the efficiency of hardware resources

This study analyzes the efficiency of parallel computational applications with the adoption of recent graphics processing units (GPUs). We investigate the impacts of the additional resources of recent architecture on the popular benchmarks compared with previous architecture. Our simulation results demonstrate that Pascal GPU architecture improves the performance by 273% on average compared to old-fashioned Fermi architecture. To evaluate the performance improvement depending on specific hardware resources, we divide the hardware resources into two types: computing and memory resources. Computing resources have bigger impact on performance improvement than memory resources in most of benchmarks. For Hotspot and B+ tree, the architecture adopting only enhanced computing resources can achieve similar performance gains of the architecture adopting both computing and memory resources. We also evaluate the influence of the number of warp schedulers in the SM (Streaming Multiprocessor) to the GPU performance in relationship with barrier waiting time. Based on these analyses, we propose the development direction for the future generation of GPUs.

Download Full-text

Iterative Krylov solution methods for geophysical electromagnetic simulations on throughput-oriented processing units

The International Journal of High Performance Computing Applications ◽

10.1177/1094342011428145 ◽

2011 ◽

Vol 26 (4) ◽

pp. 378-385 ◽

Cited By ~ 4

Author(s):

Michael Commer ◽

Filipe RNC Maia ◽

Gregory A Newman

Keyword(s):

Graphics Processing Units ◽

Three Dimensional ◽

Subsurface Imaging ◽

Systems Of Equations ◽

Sparse Linear Systems ◽

Electromagnetic Simulations ◽

Solution Methods ◽

Memory Resources ◽

Graphics Processing ◽

Linear Systems Of Equations

Many geo-scientific applications involve boundary value problems arising in simulating electrostatic and electromagnetic fields for geophysical prospecting and subsurface imaging of electrical resistivity. Modeling complex geological media with three-dimensional finite-difference grids gives rise to large sparse linear systems of equations. For such systems, we have implemented three common iterative Krylov solution methods on graphics processing units and compared their performance with parallel host-based versions. The benchmarks show that the device efficiency improves with increasing grid sizes. Limitations are currently given by the device memory resources.

Download Full-text

PERFORMANCE IMPROVEMENT OF MULTICHANNEL AUDIO BY GRAPHICS PROCESSING UNITS

10.4995/thesis/10251/40651 ◽

2014 ◽

Author(s):

José Antonio Belloch Rodríguez

Keyword(s):

Performance Improvement ◽

Graphics Processing Units ◽

Multichannel Audio ◽

Graphics Processing

Download Full-text

MadFlow: automating Monte Carlo simulation on GPU for particle physics processes

The European Physical Journal C ◽

10.1140/epjc/s10052-021-09443-8 ◽

2021 ◽

Vol 81 (7) ◽

Author(s):

Stefano Carrazza ◽

Juan Cruz-Martinez ◽

Marco Rossi ◽

Marco Zaro

Keyword(s):

Monte Carlo ◽

Particle Physics ◽

Graphics Processing Units ◽

Hardware Acceleration ◽

Leading Order ◽

Event Simulation ◽

Multiple Processes ◽

Analytic Expressions ◽

Simulation Results ◽

Graphics Processing

AbstractWe present , a first general multi-purpose framework for Monte Carlo (MC) event simulation of particle physics processes designed to take full advantage of hardware accelerators, in particular, graphics processing units (GPUs). The automation process of generating all the required components for MC simulation of a generic physics process and its deployment on hardware accelerator is still a big challenge nowadays. In order to solve this challenge, we design a workflow and code library which provides to the user the possibility to simulate custom processes through the MadGraph5_aMC@NLO framework and a plugin for the generation and exporting of specialized code in a GPU-like format. The exported code includes analytic expressions for matrix elements and phase space. The simulation is performed using the VegasFlow and PDFFlow libraries which deploy automatically the full simulation on systems with different hardware acceleration capabilities, such as multi-threading CPU, single-GPU and multi-GPU setups. The package also provides an asynchronous unweighted events procedure to store simulation results. Crucially, although only Leading Order is automatized, the library provides all ingredients necessary to build full complex Monte Carlo simulators in a modern, extensible and maintainable way. We show simulation results at leading-order for multiple processes on different hardware configurations.

Download Full-text

Method for Adaptation of Algorithms to GPU Architecture

10.20948/graphicon-2021-3027-930-941 ◽

2021 ◽

Author(s):

Vadim Bulavintsev ◽

Dmitry Zhdanov

Keyword(s):

Graphics Processing Units ◽

Search Algorithm ◽

Boolean Satisfiability ◽

Control Flow ◽

Code Optimization ◽

Search Performance ◽

Backtracking Search ◽

Boolean Satisfiability Problem ◽

Graphics Processing ◽

Gpu Architecture

We propose a generalized method for adapting and optimizing algorithms for efficient execution on modern graphics processing units (GPU). The method consists of several steps. First, build a control flow graph (CFG) of the algorithm. Next, transform the CFG into a tree of loops and merge non-parallelizable loops into parallelizable ones. Finally, map the resulting loops tree to the tree of GPU computational units, unrolling the algorithm’s loops as necessary for the match. The mapping should be performed bottom-up, from the lowest GPU architecture levels to the highest ones, to minimize off-chip memory access and maximize register file usage. The method provides programmer with a convenient and robust mental framework and strategy for GPU code optimization. We demonstrate the method by adapting to a GPU the DPLL backtracking search algorithm for solving the Boolean satisfiability problem (SAT). The resulting GPU version of DPLL outperforms the CPU version in raw tree search performance sixfold for regular Boolean satisfiability problems and twofold for irregular ones.

Download Full-text

Utilizing the Double-Precision Floating-Point Computing Power of GPUs for RSA Acceleration

Security and Communication Networks ◽

10.1155/2017/3508786 ◽

2017 ◽

Vol 2017 ◽

pp. 1-15 ◽

Cited By ~ 1

Author(s):

Jiankuo Dong ◽

Fangyu Zheng ◽

Wuqiong Pan ◽

Jingqiang Lin ◽

Jiwu Jing ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Chinese Remainder Theorem ◽

General Purpose ◽

Floating Point ◽

Double Precision ◽

Computing Power ◽

Cryptographic Algorithm ◽

Graphics Processing ◽

Gpu Architecture

Asymmetric cryptographic algorithm (e.g., RSA and Elliptic Curve Cryptography) implementations on Graphics Processing Units (GPUs) have been researched for over a decade. The basic idea of most previous contributions is exploiting the highly parallel GPU architecture and porting the integer-based algorithms from general-purpose CPUs to GPUs, to offer high performance. However, the great potential cryptographic computing power of GPUs, especially by the more powerful floating-point instructions, has not been comprehensively investigated in fact. In this paper, we fully exploit the floating-point computing power of GPUs, by various designs, including the floating-point-based Montgomery multiplication/exponentiation algorithm and Chinese Remainder Theorem (CRT) implementation in GPU. And for practical usage of the proposed algorithm, a new method is performed to convert the input/output between octet strings and floating-point numbers, fully utilizing GPUs and further promoting the overall performance by about 5%. The performance of RSA-2048/3072/4096 decryption on NVIDIA GeForce GTX TITAN reaches 42,211/12,151/5,790 operations per second, respectively, which achieves 13 times the performance of the previous fastest floating-point-based implementation (published in Eurocrypt 2009). The RSA-4096 decryption precedes the existing fastest integer-based result by 23%.

Download Full-text

SUGAR

Proceedings of the ACM on Measurement and Analysis of Computing Systems ◽

10.1145/3447375 ◽

2021 ◽

Vol 5 (1) ◽

pp. 1-29

Author(s):

Lishan Yang ◽

Bin Nie ◽

Adwait Jog ◽

Evgenia Smirni

Keyword(s):

Graphics Processing Units ◽

Fault Injection ◽

Error Resilience ◽

Reliable Operation ◽

Estimation Errors ◽

Input Size ◽

Wide Range ◽

Established Fact ◽

Memory Resources ◽

Graphics Processing

As Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One of the major challenges in the domain of GPU reliability is to accurately measure GPGPU application error resilience. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on theapplication error resilience is impractical. Application resilience is evaluated via extensive fault injection campaigns based on sampling of an extensive fault site space. Typically, the larger the input of the GPGPU application, the longer the experimental campaign. In this work, we devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by judicious input sizing. We show how analyzing a small fraction of the input is sufficient to estimate the application resilience with high accuracy and dramatically reduce the duration of experimentation. Key of our estimation methodology is the discovery of repeating patterns as a function of the input size. Using the well-established fact that error resilience in GPGPU applications is mostly determined by the dynamic instruction count at the thread level, we identify the patterns that allow us to accurately predict application error resilience for arbitrarily large inputs. For the cases that we examine in this paper, this new resilience estimation mechanism provides significant speedups (up to 1336 times) and 97.0 on the average, while keeping estimation errors to less than 1%.

Download Full-text