scholarly journals Method for Adaptation of Algorithms to GPU Architecture

Author(s):  
Vadim Bulavintsev ◽  
Dmitry Zhdanov

We propose a generalized method for adapting and optimizing algorithms for efficient execution on modern graphics processing units (GPU). The method consists of several steps. First, build a control flow graph (CFG) of the algorithm. Next, transform the CFG into a tree of loops and merge non-parallelizable loops into parallelizable ones. Finally, map the resulting loops tree to the tree of GPU computational units, unrolling the algorithm’s loops as necessary for the match. The mapping should be performed bottom-up, from the lowest GPU architecture levels to the highest ones, to minimize off-chip memory access and maximize register file usage. The method provides programmer with a convenient and robust mental framework and strategy for GPU code optimization. We demonstrate the method by adapting to a GPU the DPLL backtracking search algorithm for solving the Boolean satisfiability problem (SAT). The resulting GPU version of DPLL outperforms the CPU version in raw tree search performance sixfold for regular Boolean satisfiability problems and twofold for irregular ones.

2007 ◽  
Vol 4 (2) ◽  
pp. 2-26
Author(s):  
Gernot Gebhard ◽  
Philipp Lucas

Retargeting a compiler?s back end to a new architecture is a time-consuming process. This becomes an evident problem in the area of programmable graphics hardware (graphics processing units, GPUs) or embedded processors, where architectural changes are faster than elsewhere. We propose the object-oriented rewrite system OORS to overcome this problem. Using the OORS language, a compiler developer can express the code generation and optimization phase in terms of cost-annotated rewrite rules supporting complex non-linearmatching and replacing patterns. Retargetability is achieved by organizing rules into profiles, one for each supported target architecture. Featuring a rule and profile inheritance mechanism, OORS makes the reuse of existing specifications possible. This is an improvement regarding traditional approaches. Altogether OORS increases the maintainability of the compiler?s back end and thus both decreases the complexity and reduces the effort of the retargeting process. To show the potential of this approach, we have implemented a code generation and a code optimization pattern matcher supporting different target architectures using the OORS language and introduced them in a compiler of a programming language for CPUs and GPUs.


Energies ◽  
2020 ◽  
Vol 13 (8) ◽  
pp. 2083 ◽  
Author(s):  
Wangqi Xiong ◽  
Jiandong Wang

This paper proposes a parallel grid search algorithm to find an optimal operating point for minimizing the power consumption of an experimental heating, ventilating and air conditioning (HVAC) system. First, a multidimensional, nonlinear and non-convex optimization problem subject to constraints is formulated based on a semi-physical model of the experimental HVAC system. Second, the optimization problem is parallelized based on Graphics Processing Units to simultaneously compute optimization loss functions for different solutions in a searching grid, and to find the optimal solution as the one having the minimum loss function. The proposed algorithm has an advantage that the optimal solution is known with evidence as to the best one subject to current resolutions of the searching grid. Experimental studies are provided to support the proposed algorithm.


2016 ◽  
Vol 41 ◽  
pp. 290-304 ◽  
Author(s):  
Adriane B.S. Serapião ◽  
Guilherme S. Corrêa ◽  
Felipe B. Gonçalves ◽  
Veronica O. Carvalho

2021 ◽  
Vol 10 (2) ◽  
pp. 917-926
Author(s):  
Viet Tan Vo ◽  
Cheol Hong Kim

This study analyzes the efficiency of parallel computational applications with the adoption of recent graphics processing units (GPUs). We investigate the impacts of the additional resources of recent architecture on the popular benchmarks compared with previous architecture. Our simulation results demonstrate that Pascal GPU architecture improves the performance by 273% on average compared to old-fashioned Fermi architecture. To evaluate the performance improvement depending on specific hardware resources, we divide the hardware resources into two types: computing and memory resources. Computing resources have bigger impact on performance improvement than memory resources in most of benchmarks. For Hotspot and B+ tree, the architecture adopting only enhanced computing resources can achieve similar performance gains of the architecture adopting both computing and memory resources. We also evaluate the influence of the number of warp schedulers in the SM (Streaming Multiprocessor) to the GPU performance in relationship with barrier waiting time. Based on these analyses, we propose the development direction for the future generation of GPUs.


2017 ◽  
Vol 2017 ◽  
pp. 1-15 ◽  
Author(s):  
Jiankuo Dong ◽  
Fangyu Zheng ◽  
Wuqiong Pan ◽  
Jingqiang Lin ◽  
Jiwu Jing ◽  
...  

Asymmetric cryptographic algorithm (e.g., RSA and Elliptic Curve Cryptography) implementations on Graphics Processing Units (GPUs) have been researched for over a decade. The basic idea of most previous contributions is exploiting the highly parallel GPU architecture and porting the integer-based algorithms from general-purpose CPUs to GPUs, to offer high performance. However, the great potential cryptographic computing power of GPUs, especially by the more powerful floating-point instructions, has not been comprehensively investigated in fact. In this paper, we fully exploit the floating-point computing power of GPUs, by various designs, including the floating-point-based Montgomery multiplication/exponentiation algorithm and Chinese Remainder Theorem (CRT) implementation in GPU. And for practical usage of the proposed algorithm, a new method is performed to convert the input/output between octet strings and floating-point numbers, fully utilizing GPUs and further promoting the overall performance by about 5%. The performance of RSA-2048/3072/4096 decryption on NVIDIA GeForce GTX TITAN reaches 42,211/12,151/5,790 operations per second, respectively, which achieves 13 times the performance of the previous fastest floating-point-based implementation (published in Eurocrypt 2009). The RSA-4096 decryption precedes the existing fastest integer-based result by 23%.


Sign in / Sign up

Export Citation Format

Share Document