Analysis of Fast Fourier Transformations algorithm for CUDA Architecture

Lietuvos matematikos rinkinys ◽

10.15388/lmr.b.2012.46 ◽

2012 ◽

Vol 53 ◽

Author(s):

Beatričė Andziulienė ◽

Evaldas Žulkas ◽

Audrius Kuprinavičius

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Fast Fourier Transformation ◽

Processing Unit ◽

Data Allocation ◽

Analysis Method ◽

Central Processing ◽

Execution Speed ◽

Cuda Architecture ◽

Graphics Processing

In this work Fast Fourier transformation algorithm for general purpose graphics processing unit processing (GPGPU) is discussed. Algorithm structure and individual stages performance were analysed. With performance analysis method algorithm distribution and data allocation possibilities were determined, depending on algorithm stages execution speed and algorithm structure. Ratio between CPU and GPU execution during Fast Fourier transform signal processing was determined using computer-generated data with frequency. When adopting CPU code for CUDA execution, it not becomes more complex, even if stream procesor parallelization and data transfering algorith stages are considered. But central processing unit serial execution).

Download Full-text

GPU-Accelerated Parallel FDTD on Distributed Heterogeneous Platform

International Journal of Antennas and Propagation ◽

10.1155/2014/321081 ◽

2014 ◽

Vol 2014 ◽

pp. 1-8 ◽

Cited By ~ 2

Author(s):

Ronglin Jiang ◽

Shugang Jiang ◽

Yu Zhang ◽

Ying Xu ◽

Lei Xu ◽

...

Keyword(s):

Message Passing ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Processing Unit ◽

Problem Size ◽

Central Processing ◽

Execution Speed ◽

Speedup Ratio ◽

Electromagnetic Calculations ◽

Graphics Processing

This paper introduces a (finite difference time domain) FDTD code written in Fortran and CUDA for realistic electromagnetic calculations with parallelization methods of Message Passing Interface (MPI) and Open Multiprocessing (OpenMP). Since both Central Processing Unit (CPU) and Graphics Processing Unit (GPU) resources are utilized, a faster execution speed can be reached compared to a traditional pure GPU code. In our experiments, 64 NVIDIA TESLA K20m GPUs and 64 INTEL XEON E5-2670 CPUs are used to carry out the pure CPU, pure GPU, and CPU + GPU tests. Relative to the pure CPU calculations for the same problems, the speedup ratio achieved by CPU + GPU calculations is around 14. Compared to the pure GPU calculations for the same problems, the CPU + GPU calculations have 7.6%–13.2% performance improvement. Because of the small memory size of GPUs, the FDTD problem size is usually very small. However, this code can enlarge the maximum problem size by 25% without reducing the performance of traditional pure GPU code. Finally, using this code, a microstrip antenna array with16×18elements is calculated and the radiation patterns are compared with the ones of MoM. Results show that there is a well agreement between them.

Download Full-text

Parallel SVD Algorithm for a Three-Diagonal Matrix on a Video Card Using the Nvidia CUDA Architecture

NaUKMA Research Papers Computer Science ◽

10.18523/2617-3808.2021.4.16-22 ◽

2021 ◽

Vol 4 ◽

pp. 16-22

Author(s):

Mykola Semylitko ◽

Gennadii Malaschonok

Keyword(s):

Graphics Processing Units ◽

Graphics Processing Unit ◽

Computation Time ◽

Application Programming Interface ◽

General Purpose ◽

Diagonal Matrix ◽

Free Access ◽

Processing Unit ◽

Cuda Architecture ◽

Graphics Processing

SVD (Singular Value Decomposition) algorithm is used in recommendation systems, machine learning, image processing, and in various algorithms for working with matrices which can be very large and Big Data, so, given the peculiarities of this algorithm, it can be performed on a large number of computing threads that have only video cards.CUDA is a parallel computing platform and application programming interface model created by Nvidia. It allows software developers and software engineers to use a CUDA-enabled graphics processing unit for general purpose processing – an approach termed GPGPU (general-purpose computing on graphics processing units). The GPU provides much higher instruction throughput and memory bandwidth than the CPU within a similar price and power envelope. Many applications leverage these higher capabilities to run faster on the GPU than on the CPU. Other computing devices, like FPGAs, are also very energy efficient, but they offer much less programming flexibility than GPUs.The developed modification uses the CUDA architecture, which is intended for a large number of simultaneous calculations, which allows to quickly process matrices of very large sizes. The algorithm of parallel SVD for a three-diagonal matrix based on the Givents rotation provides a high accuracy of calculations. Also the algorithm has a number of optimizations to work with memory and multiplication algorithms that can significantly reduce the computation time discarding empty iterations.This article proposes an approach that will reduce the computation time and, consequently, resources and costs. The developed algorithm can be used with the help of a simple and convenient API in C ++ and Java, as well as will be improved by using dynamic parallelism or parallelization of multiplication operations. Also the obtained results can be used by other developers for comparison, as all conditions of the research are described in detail, and the code is in free access.

Download Full-text

Fast Effective Deterministic Primality Test Using CUDA/GPGPU

INTERNATIONAL JOURNAL OF COMPUTERS & TECHNOLOGY ◽

10.24297/ijct.v12i3.3247 ◽

2014 ◽

Vol 12 (3) ◽

pp. 3338-3346

Author(s):

Abu Asaduzzaman ◽

Anindya Maiti ◽

Chok Meng Yip

Keyword(s):

Computer Security ◽

Graphics Processing Unit ◽

Prime Numbers ◽

Deterministic Algorithm ◽

General Purpose ◽

Processing Unit ◽

Central Processing ◽

Primality Test ◽

Testing Algorithms ◽

Graphics Processing

There are great interests in understanding the manner by which the prime numbers are distributed throughout the integers. Prime numbers are being used in secret codes for more than 60 years now. Computer security authorities use extremely large prime numbers when they devise cryptographs, like RSA (short for Rivest, Shamir, and Adleman) algorithm, for protecting vital information that is transmitted between computers. There are many primality testing algorithms including mathematical models and computer programs. However, they are very time consuming when the given number n is very big or nâ†’âˆž. In this paper, we propose a novel parallel computing model based on a deterministic algorithm using central processing unit (CPU) / general-purpose graphics processing unit (GPGPU) systems, which determines whether an input number is prime or composite much faster. We develop and implement the proposed algorithm using a system with a 8-core CPU and a 448-core GPGPU. Experimental results indicate that upto 94.35x speedup can be achieved for 21-digit decimal numbers.

Download Full-text

Numerical simulation of flattened heat pipe with double heat sources for CPU and GPU cooling application in laptop computers

Journal of Computational Design and Engineering ◽

10.1093/jcde/qwaa091 ◽

2020 ◽

Author(s):

Wisoot Sanhan ◽

Kambiz Vafai ◽

Niti Kammuang-Lue ◽

Pradit Terdtoon ◽

Phrut Sakulchangsatjatai

Keyword(s):

Experimental Data ◽

Heat Pipe ◽

Graphics Processing Unit ◽

Processing Unit ◽

Heat Sources ◽

Final Thickness ◽

Laptop Computers ◽

Central Processing ◽

Graphics Processing ◽

Good Agreement

Abstract An investigation of the effect of the thermal performance of the flattened heat pipe on its double heat sources acting as central processing unit and graphics processing unit in laptop computers is presented in this work. A finite element method is used for predicting the flattening effect of the heat pipe. The cylindrical heat pipe with a diameter of 6 mm and the total length of 200 mm is flattened into three final thicknesses of 2, 3, and 4 mm. The heat pipe is placed under a horizontal configuration and heated with heater 1 and heater 2, 40 W in combination. The numerical model shows good agreement compared with the experimental data with the standard deviation of 1.85%. The results also show that flattening the cylindrical heat pipe to 66.7 and 41.7% of its original diameter could reduce its normalized thermal resistance by 5.2%. The optimized final thickness or the best design final thickness for the heat pipe is found to be 2.5 mm.

Download Full-text

Practical Implementation of Prestack Kirchhoff Time Migration on a General Purpose Graphics Processing Unit

Acta Geophysica ◽

10.1515/acgeo-2016-0033 ◽

2016 ◽

Vol 64 (4) ◽

pp. 1051-1063 ◽

Cited By ~ 2

Author(s):

Guofeng Liu ◽

Chun Li

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Practical Implementation ◽

Processing Unit ◽

Time Migration ◽

Graphics Processing

Download Full-text

Comprehensive regression-based model to predict performance of general-purpose graphics processing unit

Cluster Computing ◽

10.1007/s10586-019-03011-2 ◽

2019 ◽

Vol 23 (2) ◽

pp. 1505-1516 ◽

Cited By ~ 2

Author(s):

Mohammad Hossein Shafiabadi ◽

Hossein Pedram ◽

Midia Reshadi ◽

Akram Reza

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Graphics Processing

Download Full-text

AN EVALUATION OF MULTIPLE FEED-FORWARD NETWORKS ON GPUs

International Journal of Neural Systems ◽

10.1142/s0129065711002638 ◽

2011 ◽

Vol 21 (01) ◽

pp. 31-47 ◽

Cited By ~ 14

Author(s):

NOEL LOPES ◽

BERNARDETE RIBEIRO

Keyword(s):

Graphics Processing Unit ◽

Parallel Implementation ◽

Low Cost ◽

Back Propagation ◽

General Purpose ◽

Training System ◽

Graphics Hardware ◽

Processing Unit ◽

Data Parallel ◽

Graphics Processing

The Graphics Processing Unit (GPU) originally designed for rendering graphics and which is difficult to program for other tasks, has since evolved into a device suitable for general-purpose computations. As a result graphics hardware has become progressively more attractive yielding unprecedented performance at a relatively low cost. Thus, it is the ideal candidate to accelerate a wide variety of data parallel tasks in many fields such as in Machine Learning (ML). As problems become more and more demanding, parallel implementations of learning algorithms are crucial for a useful application. In particular, the implementation of Neural Networks (NNs) in GPUs can significantly reduce the long training times during the learning process. In this paper we present a GPU parallel implementation of the Back-Propagation (BP) and Multiple Back-Propagation (MBP) algorithms, and describe the GPU kernels needed for this task. The results obtained on well-known benchmarks show faster training times and improved performances as compared to the implementation in traditional hardware, due to maximized floating-point throughput and memory bandwidth. Moreover, a preliminary GPU based Autonomous Training System (ATS) is developed which aims at automatically finding high-quality NNs-based solutions for a given problem.

Download Full-text

Implementation of algebraic procedures on the GPU using CUDA architecture on the example of generalized eigenvalue problem

Open Computer Science ◽

10.1515/comp-2016-0006 ◽

2016 ◽

Vol 6 (1) ◽

pp. 79-90

Author(s):

Łukasz Syrocki ◽

Grzegorz Pestka

Keyword(s):

Eigenvalue Problem ◽

Graphics Processing Unit ◽

Generalized Eigenvalue Problem ◽

Processing Unit ◽

Graphics Processors ◽

Central Processing ◽

Generalized Eigenvalue ◽

Cuda Technology ◽

Cuda Architecture ◽

High Level

AbstractThe ready to use set of functions to facilitate solving a generalized eigenvalue problem for symmetric matrices in order to efficiently calculate eigenvalues and eigenvectors, using Compute Unified Device Architecture (CUDA) technology from NVIDIA, is provided. An integral part of the CUDA is the high level programming environment enabling tracking both code executed on Central Processing Unit and on Graphics Processing Unit. The presented matrix structures allow for the analysis of the advantages of using graphics processors in such calculations.

Download Full-text

ALGORITHM OF SKELETON-BASED STATIC HAND GESTURE RECOGNITION

Vestnik komp iuternykh i informatsionnykh tekhnologii ◽

10.14489/vkit.2020.05.pp.013-022 ◽

2020 ◽

pp. 13-22

Author(s):

D. A. Kalina ◽

R. V. Golovanov ◽

D. V. Vorotnev

Keyword(s):

Gesture Recognition ◽

Graphics Processing Unit ◽

Recognition System ◽

Machine Learning Algorithms ◽

Support Vector ◽

Processing Unit ◽

The Novel ◽

Central Processing ◽

Graphics Processing ◽

Artificial Network

We present the monocamera approach of static hand gestures recognition based on skeletonization. The problem of creating skeleton of the human’s hand, as well as body, became solvable a few years ago after inventing so called convolutional pose machines – the novel architecture of artificial neural network. Our solution uses such kind of pretrained convolutional artificial network for extracting hand joints keypoints with further skeleton reconstruction. In this work we also propose special skeleton descriptor with proving its stability and distinguishability in terms of classification. We considered a few widespread machine learning algorithms to build and verify different classifiers. The quality of the classifier’s recognition is estimated using the wellknown Accuracy metric, which identified that classical SVM (Support Vector Machines) with radial basis kernel gives the best results. The testing of the whole system was conducted using public databases containing about 3000 of test images for more than 10 types of gestures. The results of a comparative analysis of the proposed system with existing approaches are demonstrated. It is shown that our gesture recognition system provides better quality in comparison with existing solutions. The performance of the proposed system was estimated for two configurations of standard personal computer: with CPU (Central Processing Unit) only and with GPU (Graphics Processing Unit) in addition where the latest one provides realtime processing with up to 60 frames per second. Thus we demonstrate that the proposed approach can find an application in the practice.

Download Full-text

General purpose graphics-processing-unit implementation of cosmological domain wall network evolution

Physical Review E ◽

10.1103/physreve.96.043310 ◽

2017 ◽

Vol 96 (4) ◽

Cited By ~ 5

Author(s):

J. R. C. C. C. Correia ◽

C. J. A. P. Martins

Keyword(s):

Domain Wall ◽

Graphics Processing Unit ◽

General Purpose ◽

Network Evolution ◽

Processing Unit ◽

Graphics Processing

Download Full-text