Implementation and Analysis of Fractals Shapes using GPU-CUDA Model

The rapid evolution of floating-point computing capacity and memory in recent years has resulted graphics processing units (GPUs) an increasingly attractive platform to speed scientific applications and are popular rapidly due to the large amount of data that processes the data on time. Fractals have many implementations that involve faster computation and massive amounts of floating-point computation. In this paper, constructing the fractal image algorithm has been implemented both sequential and parallel versions using fractal Mandelbrot and Julia sets. CPU was used for the execution in sequential mode while GPUarray and CUDA kernel was used for the parallel mode. The evaluation of the performance of the constructed algorithms for sequential structure using CPUs (2.20 GHz and 2.60 GHz) and parallelism structure for various models of GPU (GeForce GTX 1060 and GeForce GTX 1660 Ti ) devices, calculated in terms of execution time and speedup to compare between CPU and GPU maximum ability. The results showed that the execution on GPU using GPUArray or GUDA kernel is faster than its sequential implementation using CPU. And the execution using the GUDA kernel is faster than the execution using GPUArray, and the execution time between GPU devices was different, GPU with (Ti) series execute faster than the other models.

Download Full-text

Multiple-Precision BLAS Library for Graphics Processing Units

10.36227/techrxiv.12580301.v1 ◽

2020 ◽

Author(s):

Konstantin Isupov ◽

Vladimir Knyazkov

Keyword(s):

Graphics Processing Units ◽

Arithmetic Operation ◽

Number System ◽

Residue Number System ◽

Floating Point ◽

Data Types ◽

Rounding Errors ◽

Multiple Precision ◽

Graphics Processing ◽

Point Arithmetic

The binary32 and binary64 floating-point formats provide good performance on current hardware, but also introduce a rounding error in almost every arithmetic operation. Consequently, the accumulation of rounding errors in large computations can cause accuracy issues. One way to prevent these issues is to use multiple-precision floating-point arithmetic. This preprint, submitted to Russian Supercomputing Days 2020, presents a new library of basic linear algebra operations with multiple precision for graphics processing units. The library is written in CUDA C/C++ and uses the residue number system to represent multiple-precision significands of floating-point numbers. The supported data types, memory layout, and main features of the library are considered. Experimental results are presented showing the performance of the library.

Download Full-text

Using Compiler Directives to Port Large Scientific Applications to GPUs: An Example from Atmospheric Science

Parallel Processing Letters ◽

10.1142/s0129626414500030 ◽

2014 ◽

Vol 24 (01) ◽

pp. 1450003 ◽

Cited By ~ 24

Author(s):

Xavier Lapillonne ◽

Oliver Fuhrer

Keyword(s):

Execution Time ◽

Graphics Processing Units ◽

Weather Prediction ◽

Small Scale ◽

Scientific Applications ◽

Scale Modeling ◽

Implementation Effort ◽

Performance Penalty ◽

Fundamental Equations ◽

Graphics Processing

For many scientific applications, Graphics Processing Units (GPUs) can be an interesting alternative to conventional CPUs as they can deliver higher memory bandwidth and computing power. While it is conceivable to re-write the most execution time intensive parts using a low-level API for accelerator programming, it may not be feasible to do it for the entire application. But, having only selected parts of the application running on the GPU requires repetitively transferring data between the GPU and the host CPU, which may lead to a serious performance penalty. In this paper we assess the potential of compiler directives, based on the OpenACC standard, for porting large parts of code and thus achieving a full GPU implementation. As an illustrative and relevant example, we consider the climate and numerical weather prediction code COSMO (Consortium for Small Scale Modeling) and focus on the physical parametrizations, a part of the code which describes all physical processes not accounted for by the fundamental equations of atmospheric motion. We show, by porting three of the dominant parametrization schemes, the radiation, microphysics and turbulence parametrizations, that compiler directives are an efficient tool both in terms of final execution time as well as implementation effort. Compiler directives enable to port large sections of the existing code with minor modifications while still allowing for further optimizations for the most performance critical parts. With the example of the radiation parametrization, which contains the solution of a block tri-diagonal linear system, the required code modifications and key optimizations are discussed in detail. Performance tests for the three physical parametrizations show a speedup of between 3× and 7× for execution time obtained on a GPU and on a multi-core CPU of an equivalent generation.

Download Full-text

Utilizing the Double-Precision Floating-Point Computing Power of GPUs for RSA Acceleration

Security and Communication Networks ◽

10.1155/2017/3508786 ◽

2017 ◽

Vol 2017 ◽

pp. 1-15 ◽

Cited By ~ 1

Author(s):

Jiankuo Dong ◽

Fangyu Zheng ◽

Wuqiong Pan ◽

Jingqiang Lin ◽

Jiwu Jing ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Chinese Remainder Theorem ◽

General Purpose ◽

Floating Point ◽

Double Precision ◽

Computing Power ◽

Cryptographic Algorithm ◽

Graphics Processing ◽

Gpu Architecture

Asymmetric cryptographic algorithm (e.g., RSA and Elliptic Curve Cryptography) implementations on Graphics Processing Units (GPUs) have been researched for over a decade. The basic idea of most previous contributions is exploiting the highly parallel GPU architecture and porting the integer-based algorithms from general-purpose CPUs to GPUs, to offer high performance. However, the great potential cryptographic computing power of GPUs, especially by the more powerful floating-point instructions, has not been comprehensively investigated in fact. In this paper, we fully exploit the floating-point computing power of GPUs, by various designs, including the floating-point-based Montgomery multiplication/exponentiation algorithm and Chinese Remainder Theorem (CRT) implementation in GPU. And for practical usage of the proposed algorithm, a new method is performed to convert the input/output between octet strings and floating-point numbers, fully utilizing GPUs and further promoting the overall performance by about 5%. The performance of RSA-2048/3072/4096 decryption on NVIDIA GeForce GTX TITAN reaches 42,211/12,151/5,790 operations per second, respectively, which achieves 13 times the performance of the previous fastest floating-point-based implementation (published in Eurocrypt 2009). The RSA-4096 decryption precedes the existing fastest integer-based result by 23%.

Download Full-text

IMPLEMENTATION OF MIDDLE PRODUCT ALGORITHM ON LINEAR PROCESSOR ARRAYS

Journal of Circuits System and Computers ◽

10.1142/s0218126614500807 ◽

2014 ◽

Vol 23 (06) ◽

pp. 1450080

Author(s):

E. I. MILOVANOVIĆ ◽

I. Ž. MILOVANOVIĆ ◽

M. K. STOJČEV

Keyword(s):

Execution Time ◽

Graphics Processing Units ◽

Performance Metrics ◽

Matrix Multiplication ◽

Software Tool ◽

Gain Factor ◽

Processor Array ◽

And Performance ◽

A Chain ◽

Graphics Processing

This paper presents the design, implementation and performance evaluation of the linear processor array accelerator for matrix multiplication. We call it matrix multiplication processor (MMP). The MMP is composed of n processing elements (PEs) connected in a chain, distributed memory, and dedicated address generator unit (AGU) to generate memory addresses. By using this approach, address generation does not increase the processing time. The AGU is one major difference in the proposed architecture compared to graphics processing units (GPUs) that use ALUs to create addresses. MMP is based on FPGA technology since this circuits possess extreme degree of parallelism and ability to customize the RAM and data path architecture to the computation. We have considered performance metrics of the proposed architecture in the sense of number of PEs, execution time, speedup, efficiency and gain factor. We have implemented AGU and PE in Xilinx Spartan 2E FPGAs using ISE 9.01 as a software tool. We compare our design with respect to the execution time, number of PEs, AT measure, speedup and efficiency with other solutions proposed in the literature.

Download Full-text

The Potential for a GPU-Like Overlay Architecture for FPGAs

International Journal of Reconfigurable Computing ◽

10.1155/2011/514581 ◽

2011 ◽

Vol 2011 ◽

pp. 1-15 ◽

Cited By ~ 14

Author(s):

Jeffrey Kingyens ◽

J. Gregory Steffan

Keyword(s):

Graphics Processing Units ◽

Programming Model ◽

Instruction Level Parallelism ◽

Floating Point ◽

High Level ◽

Graphics Processing ◽

Level Parallelism ◽

Data Level ◽

Accelerator System

We propose a soft processor programming model and architecture inspired by graphics processing units (GPUs) that are well-matched to the strengths of FPGAs, namely, highly parallel and pipelinable computation. In particular, our soft processor architecture exploits multithreading, vector operations, and predication to supply a floating-point pipeline of 64 stages via hardware support for up to 256 concurrent thread contexts. The key new contributions of our architecture are mechanisms for managing threads and register files that maximize data-level and instruction-level parallelism while overcoming the challenges of port limitations of FPGA block memories as well as memory and pipeline latency. Through simulation of a system that (i) is programmable via NVIDIA's high-levelCglanguage, (ii) supports AMD's CTM r5xx GPU ISA, and (iii) is realizable on an XtremeData XD1000 FPGA-based accelerator system, we demonstrate the potential for such a system to achieve 100% utilization of a deeply pipelined floating-point datapath.

Download Full-text