Acceleration of a two-dimensional Euler flow solver using commodity graphics hardware

Author(s):  
T Brandvik ◽  
G Pullan

The implementation of a two-dimensional Euler solver on graphics hardware is described. The graphics processing unit is highly parallelized and uses a programming model that is well suited to flow computation. Results for a transonic turbine cascade test-case are presented. For large grids (106 nodes) a 40 times speed-up compared with a Fortran implementation on a contemporary CPU is observed.


2011 ◽  
Vol 21 (01) ◽  
pp. 31-47 ◽  
Author(s):  
NOEL LOPES ◽  
BERNARDETE RIBEIRO

The Graphics Processing Unit (GPU) originally designed for rendering graphics and which is difficult to program for other tasks, has since evolved into a device suitable for general-purpose computations. As a result graphics hardware has become progressively more attractive yielding unprecedented performance at a relatively low cost. Thus, it is the ideal candidate to accelerate a wide variety of data parallel tasks in many fields such as in Machine Learning (ML). As problems become more and more demanding, parallel implementations of learning algorithms are crucial for a useful application. In particular, the implementation of Neural Networks (NNs) in GPUs can significantly reduce the long training times during the learning process. In this paper we present a GPU parallel implementation of the Back-Propagation (BP) and Multiple Back-Propagation (MBP) algorithms, and describe the GPU kernels needed for this task. The results obtained on well-known benchmarks show faster training times and improved performances as compared to the implementation in traditional hardware, due to maximized floating-point throughput and memory bandwidth. Moreover, a preliminary GPU based Autonomous Training System (ATS) is developed which aims at automatically finding high-quality NNs-based solutions for a given problem.





Author(s):  
Mainak Adhikari ◽  
Sukhendu Kar

Graphics processing unit (GPU), which typically handles computation only for computer graphics. Any GPU providing a functionally complete set of operations performed on arbitrary bits can compute any computable value. Additionally, the use of multiple graphics cards in one computer, or large numbers of graphics chips, further parallelizes the already parallel nature of graphics processing. CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs). CUDA gives program developers direct access to the virtual instruction set and memory of the parallel computational elements in CUDA GPUs. This chapter first discuss some features and challenges of GPU programming and the effort to address some of the challenges with building and running GPU programming in high performance computing (HPC) environment. Finally this chapter point out the importance and standards of CUDA architecture.



2014 ◽  
Vol 8 (5) ◽  
pp. 229-236 ◽  
Author(s):  
Changhe Song ◽  
Yunsong Li ◽  
Jie Guo ◽  
Jie Lei


2020 ◽  
Vol 45 (15) ◽  
pp. 4124
Author(s):  
Pin-Chieh Huang ◽  
Rishyashring R. Iyer ◽  
Yuan-Zhi Liu ◽  
Stephen A. Boppart




2017 ◽  
Vol 14 (1) ◽  
pp. 789-795
Author(s):  
V Saveetha ◽  
S Sophia

Parallel data clustering aims at using algorithms and methods to extract knowledge from fat databases in rational time using high performance architectures. The computational challenge faced by cluster analysis due to increasing capacity of data can be overcome by exploiting the power of these architectures. The recent development in parallel power of Graphics Processing Unit enables low cost high performance solutions for general purpose applications. The Compute Unified Device Architecture programming model provides application programming interface methods to handle data proficiently on Graphics Processing Unit for iterative clustering algorithms like K-Means. The existing Graphics Processing Unit based K-Means algorithms highly focus on improvising the speedup of the algorithms and fall short to handle the high time spent on transfer of data between the Central Processing Unit and Graphics Processing Unit. A competent K-Means algorithm is proposed in this paper to lessen the transfer time by introducing a novel approach to check the convergence of the algorithm and utilize the pinned memory for direct access. This algorithm outperforms the other algorithms by maximizing parallelism and utilizing the memory features. The relative speedups and the validity measure for the proposed algorithm is elevated when compared with K-Means on Graphics Processing Unit and K-Means using Flag on Graphics Processing Unit. Thus the planned approach proves that communication overhead can be reduced in K-Means clustering.



2017 ◽  
Vol 06 (04) ◽  
pp. 1750009
Author(s):  
Jonathan Van Belle ◽  
Richard Armstrong ◽  
James Gain

Deconvolution of native radio interferometric images constitutes a major computational component of the imaging process. An efficient and robust deconvolution operation is essential for reconstruction of the true sky signal from measured telescopic data. The techniques of compressed sensing provide a mathematically-rigorous framework within which to implement deconvolution of images formed from a sparse set of nearly-random measurements. We present an accelerated implementation of the orthogonal matching pursuit (OMP) algorithm (a compressed sensing method) that makes use of graphics processing unit (GPU) hardware. We show that OMP correctly identifies more sources than CLEAN, identifying up to 82% of the sources in 100 test images, while CLEAN only identifies up to 61% of the sources. In addition, the residual after source extraction is [Formula: see text] times lower for OMP than for CLEAN. Furthermore, the graphics implementation of OMP performs around 23 times faster than a 4-core CPU.



Sign in / Sign up

Export Citation Format

Share Document