Theoretical Parallel Computing Models for GPU Computing

Author(s):  
Koji Nakano

In the age of emerging technologies, the amount of data is increasing very rapidly. Due to massive increase of data the level of computations are increasing. Computer executes instructions sequentially. But the time has now changed and innovation has been advanced. We are currently managing gigantic data centers that perform billions of executions on consistent schedule. Truth be- hold, if we dive deep into the processor engineering and mechanism, even a successive machine works parallel. Parallel computing is growing faster as a substitute of distributing computing. The performance to functionality ratio of parallel systems is high. Also, the I/O usage of parallel systems is lower because of ability to perform all operations simultaneously. On the other hand, the performance to functionality ratio of distributed systems is low. The I/O usage of distributed systems is higher because of incapability to perform all operations simultaneously. In this paper, an overview of distributed and parallel computing is described. The basic concept of these two computing is discussed. In addition to this, pros and cons of distributed and parallel computing models are described. Through many aspects, we can conclude that parallel systems are better than distributed systems.


2020 ◽  
Vol 2020 ◽  
pp. 1-15
Author(s):  
Jianqi Lai ◽  
Hang Yu ◽  
Zhengyu Tian ◽  
Hua Li

Graphics processing units (GPUs) have a strong floating-point capability and a high memory bandwidth in data parallelism and have been widely used in high-performance computing (HPC). Compute unified device architecture (CUDA) is used as a parallel computing platform and programming model for the GPU to reduce the complexity of programming. The programmable GPUs are becoming popular in computational fluid dynamics (CFD) applications. In this work, we propose a hybrid parallel algorithm of the message passing interface and CUDA for CFD applications on multi-GPU HPC clusters. The AUSM + UP upwind scheme and the three-step Runge–Kutta method are used for spatial discretization and time discretization, respectively. The turbulent solution is solved by the K−ω SST two-equation model. The CPU only manages the execution of the GPU and communication, and the GPU is responsible for data processing. Parallel execution and memory access optimizations are used to optimize the GPU-based CFD codes. We propose a nonblocking communication method to fully overlap GPU computing, CPU_CPU communication, and CPU_GPU data transfer by creating two CUDA streams. Furthermore, the one-dimensional domain decomposition method is used to balance the workload among GPUs. Finally, we evaluate the hybrid parallel algorithm with the compressible turbulent flow over a flat plate. The performance of a single GPU implementation and the scalability of multi-GPU clusters are discussed. Performance measurements show that multi-GPU parallelization can achieve a speedup of more than 36 times with respect to CPU-based parallel computing, and the parallel algorithm has good scalability.


2020 ◽  
Vol 31 (01) ◽  
pp. 2050049 ◽  
Author(s):  
Zeqiong Lv ◽  
Tingting Bao ◽  
Nan Zhou ◽  
Hong Peng ◽  
Xiangnian Huang ◽  
...  

This paper discusses a new variant of spiking neural P systems (in short, SNP systems), spiking neural P systems with extended channel rules (in short, SNP–ECR systems). SNP–ECR systems are a class of distributed parallel computing models. In SNP–ECR systems, a new type of spiking rule is introduced, called ECR. With an ECR, a neuron can send the different numbers of spikes to its subsequent neurons. Therefore, SNP–ECR systems can provide a stronger firing control mechanism compared with SNP systems and the variant with multiple channels. We discuss the Turing universality of SNP–ECR systems. It is proven that SNP–ECR systems as number generating/accepting devices are Turing universal. Moreover, we provide a small universal SNP–ECR system as function computing devices.


Author(s):  
Tzvetomir Ivanov Vassilev

This paper describes several techniques for accelerating a virtual try-on garment simulation on a mobile device (smartphone or tablet) using parallel computing on a multicore CPU, GPU computing or both depending on the mobile hardware. The system exploits a mass-spring cloth model with velocity modification approach to overcome the super-elasticity. The simulation starts from flat garment pattern meshes positioned around a 3D human body, then seaming forces are applied on the edges of the panels until the garment is seamed and several cloth draping steps are performed in the end. The cloth-body collision detection and response algorithm is based on image-space interference tests and the cloth-cloth collision detection uses entirely GPU based approach on the newer hardware or recursive parallel algorithm on the CPU. As the results section shows the average time of dressing a virtual body with a garment on a modern smart phone supporting OpenGL ES2.0 is 2 seconds and on a tablet supporting OpenGL ES3.0 or 3.1 is less than one second.


2014 ◽  
Vol 15 (2) ◽  
pp. 285-329 ◽  
Author(s):  
Cristóbal A. Navarro ◽  
Nancy Hitschfeld-Kahler ◽  
Luis Mateu

AbstractParallel computing has become an important subject in the field of computer science and has proven to be critical when researching high performance solutions. The evolution of computer architectures (multi-coreandmany-core) towards a higher number of cores can only confirm that parallelism is the method of choice for speeding up an algorithm. In the last decade, the graphics processing unit, or GPU, has gained an important place in the field of high performance computing (HPC) because of its low cost and massive parallel processing power. Super-computing has become, for the first time, available to anyone at the price of a desktop computer. In this paper, we survey the concept of parallel computing and especially GPU computing. Achieving efficient parallel algorithms for the GPU is not a trivial task, there are several technical restrictions that must be satisfied in order to achieve the expected performance. Some of these limitations are consequences of the underlying architecture of the GPU and the theoretical models behind it. Our goal is to present a set of theoretical and technical concepts that are often required to understand the GPU and itsmassive parallelismmodel. In particular, we show how this new technology can help the field ofcomputational physics,especially when the problem isdata-parallel.We present four examples of computational physics problems;n-body, collision detection, Potts modelandcellular automatasimulations. These examples well represent the kind of problems that are suitable for GPU computing. By understanding the GPU architecture and its massive parallelism programming model, one can overcome many of the technical limitations found along the way, design better GPU-based algorithms for computational physics problems and achieve speedups that can reach up to two orders of magnitude when compared to sequential implementations.


Sign in / Sign up

Export Citation Format

Share Document