Method for Adaptation of Algorithms to GPU Architecture

Mapping Intimacies ◽

10.20948/graphicon-2021-3027-930-941 ◽

2021 ◽

Author(s):

Vadim Bulavintsev ◽

Dmitry Zhdanov

Keyword(s):

Graphics Processing Units ◽

Search Algorithm ◽

Boolean Satisfiability ◽

Control Flow ◽

Code Optimization ◽

Search Performance ◽

Backtracking Search ◽

Boolean Satisfiability Problem ◽

Graphics Processing ◽

Gpu Architecture

We propose a generalized method for adapting and optimizing algorithms for efficient execution on modern graphics processing units (GPU). The method consists of several steps. First, build a control flow graph (CFG) of the algorithm. Next, transform the CFG into a tree of loops and merge non-parallelizable loops into parallelizable ones. Finally, map the resulting loops tree to the tree of GPU computational units, unrolling the algorithm’s loops as necessary for the match. The mapping should be performed bottom-up, from the lowest GPU architecture levels to the highest ones, to minimize off-chip memory access and maximize register file usage. The method provides programmer with a convenient and robust mental framework and strategy for GPU code optimization. We demonstrate the method by adapting to a GPU the DPLL backtracking search algorithm for solving the Boolean satisfiability problem (SAT). The resulting GPU version of DPLL outperforms the CPU version in raw tree search performance sixfold for regular Boolean satisfiability problems and twofold for irregular ones.

Download Full-text

OORS: An object-oriented rewrite system

Computer Science and Information Systems ◽

10.2298/csis0702002g ◽

2007 ◽

Vol 4 (2) ◽

pp. 2-26

Author(s):

Gernot Gebhard ◽

Philipp Lucas

Keyword(s):

Code Generation ◽

Graphics Processing Units ◽

Object Oriented ◽

Graphics Hardware ◽

Code Optimization ◽

Target Architecture ◽

Rewrite Rules ◽

Graphics Processing ◽

Traditional Approaches ◽

Rewrite System

Retargeting a compiler?s back end to a new architecture is a time-consuming process. This becomes an evident problem in the area of programmable graphics hardware (graphics processing units, GPUs) or embedded processors, where architectural changes are faster than elsewhere. We propose the object-oriented rewrite system OORS to overcome this problem. Using the OORS language, a compiler developer can express the code generation and optimization phase in terms of cost-annotated rewrite rules supporting complex non-linearmatching and replacing patterns. Retargetability is achieved by organizing rules into profiles, one for each supported target architecture. Featuring a rule and profile inheritance mechanism, OORS makes the reuse of existing specifications possible. This is an improvement regarding traditional approaches. Altogether OORS increases the maintainability of the compiler?s back end and thus both decreases the complexity and reduces the effort of the retargeting process. To show the potential of this approach, we have implemented a code generation and a code optimization pattern matcher supporting different target architectures using the OORS language and introduced them in a compiler of a programming language for CPUs and GPUs.

Download Full-text

Parallelization of a Self-adaptive Harmony Search Algorithm on Graphics Processing Units

2019 Eleventh International Conference on Advanced Computational Intelligence (ICACI) ◽

10.1109/icaci.2019.8778491 ◽

2019 ◽

Author(s):

Yin-Fu Huang ◽

Sun-Ho Chou

Keyword(s):

Graphics Processing Units ◽

Search Algorithm ◽

Harmony Search ◽

Harmony Search Algorithm ◽

Graphics Processing ◽

Self Adaptive

Download Full-text

Minimizing Power Consumption of an Experimental HVAC System Based on Parallel Grid Searching

Energies ◽

10.3390/en13082083 ◽

2020 ◽

Vol 13 (8) ◽

pp. 2083 ◽

Cited By ~ 1

Author(s):

Wangqi Xiong ◽

Jiandong Wang

Keyword(s):

Power Consumption ◽

Graphics Processing Units ◽

Optimization Problem ◽

Search Algorithm ◽

Experimental Studies ◽

Optimal Solution ◽

Hvac System ◽

Convex Optimization Problem ◽

The One ◽

Graphics Processing

This paper proposes a parallel grid search algorithm to find an optimal operating point for minimizing the power consumption of an experimental heating, ventilating and air conditioning (HVAC) system. First, a multidimensional, nonlinear and non-convex optimization problem subject to constraints is formulated based on a semi-physical model of the experimental HVAC system. Second, the optimization problem is parallelized based on Graphics Processing Units to simultaneously compute optimization loss functions for different solutions in a searching grid, and to find the optimal solution as the one having the minimum loss function. The proposed algorithm has an advantage that the optimal solution is known with evidence as to the best one subject to current resolutions of the searching grid. Experimental studies are provided to support the proposed algorithm.

Download Full-text

Combining K-Means and K-Harmonic with Fish School Search Algorithm for data clustering task on graphics processing units

Applied Soft Computing ◽

10.1016/j.asoc.2015.12.032 ◽

2016 ◽

Vol 41 ◽

pp. 290-304 ◽

Cited By ~ 38

Author(s):

Adriane B.S. Serapião ◽

Guilherme S. Corrêa ◽

Felipe B. Gonçalves ◽

Veronica O. Carvalho

Keyword(s):

Data Clustering ◽

Graphics Processing Units ◽

Search Algorithm ◽

Fish School ◽

Graphics Processing

Download Full-text

Bypass-Enabled Thread Compaction for Divergent Control Flow in Graphics Processing Units

Journal of Shanghai Jiaotong University (Science) ◽

10.1007/s12204-020-2240-x ◽

2020 ◽

Author(s):

Bingchao Li ◽

Jizeng Wei ◽

Wei Guo ◽

Jizhou Sun

Keyword(s):

Graphics Processing Units ◽

Control Flow ◽

Graphics Processing

Download Full-text

Architecture exploration of recent GPUs to analyze the efficiency of hardware resources

Bulletin of Electrical Engineering and Informatics ◽

10.11591/eei.v10i2.2736 ◽

2021 ◽

Vol 10 (2) ◽

pp. 917-926

Author(s):

Viet Tan Vo ◽

Cheol Hong Kim

Keyword(s):

Performance Improvement ◽

Graphics Processing Units ◽

Future Generation ◽

Architecture Exploration ◽

Development Direction ◽

Simulation Results ◽

Memory Resources ◽

Graphics Processing ◽

Performance Gains ◽

Gpu Architecture

This study analyzes the efficiency of parallel computational applications with the adoption of recent graphics processing units (GPUs). We investigate the impacts of the additional resources of recent architecture on the popular benchmarks compared with previous architecture. Our simulation results demonstrate that Pascal GPU architecture improves the performance by 273% on average compared to old-fashioned Fermi architecture. To evaluate the performance improvement depending on specific hardware resources, we divide the hardware resources into two types: computing and memory resources. Computing resources have bigger impact on performance improvement than memory resources in most of benchmarks. For Hotspot and B+ tree, the architecture adopting only enhanced computing resources can achieve similar performance gains of the architecture adopting both computing and memory resources. We also evaluate the influence of the number of warp schedulers in the SM (Streaming Multiprocessor) to the GPU performance in relationship with barrier waiting time. Based on these analyses, we propose the development direction for the future generation of GPUs.

Download Full-text

Utilizing the Double-Precision Floating-Point Computing Power of GPUs for RSA Acceleration

Security and Communication Networks ◽

10.1155/2017/3508786 ◽

2017 ◽

Vol 2017 ◽

pp. 1-15 ◽

Cited By ~ 1

Author(s):

Jiankuo Dong ◽

Fangyu Zheng ◽

Wuqiong Pan ◽

Jingqiang Lin ◽

Jiwu Jing ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Chinese Remainder Theorem ◽

General Purpose ◽

Floating Point ◽

Double Precision ◽

Computing Power ◽

Cryptographic Algorithm ◽

Graphics Processing ◽

Gpu Architecture

Asymmetric cryptographic algorithm (e.g., RSA and Elliptic Curve Cryptography) implementations on Graphics Processing Units (GPUs) have been researched for over a decade. The basic idea of most previous contributions is exploiting the highly parallel GPU architecture and porting the integer-based algorithms from general-purpose CPUs to GPUs, to offer high performance. However, the great potential cryptographic computing power of GPUs, especially by the more powerful floating-point instructions, has not been comprehensively investigated in fact. In this paper, we fully exploit the floating-point computing power of GPUs, by various designs, including the floating-point-based Montgomery multiplication/exponentiation algorithm and Chinese Remainder Theorem (CRT) implementation in GPU. And for practical usage of the proposed algorithm, a new method is performed to convert the input/output between octet strings and floating-point numbers, fully utilizing GPUs and further promoting the overall performance by about 5%. The performance of RSA-2048/3072/4096 decryption on NVIDIA GeForce GTX TITAN reaches 42,211/12,151/5,790 operations per second, respectively, which achieves 13 times the performance of the previous fastest floating-point-based implementation (published in Eurocrypt 2009). The RSA-4096 decryption precedes the existing fastest integer-based result by 23%.

Download Full-text