gpu architecture
Recently Published Documents


TOTAL DOCUMENTS

178
(FIVE YEARS 38)

H-INDEX

17
(FIVE YEARS 3)

2022 ◽  
Vol 19 (1) ◽  
pp. 1-23
Author(s):  
Yaosheng Fu ◽  
Evgeny Bolotin ◽  
Niladrish Chatterjee ◽  
David Nellans ◽  
Stephen W. Keckler

As GPUs scale their low-precision matrix math throughput to boost deep learning (DL) performance, they upset the balance between math throughput and memory system capabilities. We demonstrate that a converged GPU design trying to address diverging architectural requirements between FP32 (or larger)-based HPC and FP16 (or smaller)-based DL workloads results in sub-optimal configurations for either of the application domains. We argue that a C omposable O n- PA ckage GPU (COPA-GPU) architecture to provide domain-specialized GPU products is the most practical solution to these diverging requirements. A COPA-GPU leverages multi-chip-module disaggregation to support maximal design reuse, along with memory system specialization per application domain. We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4× higher off-die bandwidth, 32× larger on-package cache, and 2.3× higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs. This work explores the microarchitectural design necessary to enable composable GPUs and evaluates the benefits composability can provide to HPC, DL training, and DL inference. We show that when compared to a converged GPU design, a DL-optimized COPA-GPU featuring a combination of 16× larger cache capacity and 1.6× higher DRAM bandwidth scales per-GPU training and inference performance by 31% and 35%, respectively, and reduces the number of GPU instances by 50% in scale-out training scenarios.


Sensors ◽  
2022 ◽  
Vol 22 (1) ◽  
pp. 396
Author(s):  
Gaogao Liu ◽  
Wenbo Yang ◽  
Peng Li ◽  
Guodong Qin ◽  
Jingjing Cai ◽  
...  

The data volume and computation task of MIMO radar is huge; a very high-speed computation is necessary for its real-time processing. In this paper, we mainly study the time division MIMO radar signal processing flow, propose an improved MIMO radar signal processing algorithm, raising the MIMO radar algorithm processing speed combined with the previous algorithms, and, on this basis, a parallel simulation system for the MIMO radar based on the CPU/GPU architecture is proposed. The outer layer of the framework is coarse-grained with OpenMP for acceleration on the CPU, and the inner layer of fine-grained data processing is accelerated on the GPU. Its performance is significantly faster than the serial computing equipment, and satisfactory acceleration effects have been achieved in the CPU/GPU architecture simulation. The experimental results show that the MIMO radar parallel simulation system with CPU/GPU architecture greatly improves the computing power of the CPU-based method. Compared with the serial sequential CPU method, GPU simulation achieves a speedup of 130 times. In addition, the MIMO radar signal processing parallel simulation system based on the CPU/GPU architecture has a performance improvement of 13%, compared to the GPU-only method.


2021 ◽  
Vol 8 (4) ◽  
pp. 1-23
Author(s):  
Shao-Chung Wang ◽  
Lin-Ya Yu ◽  
Li-An Her ◽  
Yuan-Shin Hwang ◽  
Jenq-Kuen Lee

A modern GPU is designed with many large thread groups to achieve a high throughput and performance. Within these groups, the threads are grouped into fixed-size SIMD batches in which the same instruction is applied to vectors of data in a lockstep. This GPU architecture is suitable for applications with a high degree of data parallelism, but its performance degrades seriously when divergence occurs. Many optimizations for divergence have been proposed, and they vary with the divergence information about variables and branches. A previous analysis scheme viewed pointers and return values from functions as divergence directly, and only focused on OpenCL 1.x. In this article, we present a novel scheme that reports the divergence information for pointer-intensive OpenCL programs. The approach is based on extended static single assignment (SSA) and adds some special functions and annotations from memory SSA and gated SSA. The proposed scheme first constructs extended SSA, which is then used to build a divergence relation graph that includes all of the possible points-to relationships of the pointers and initialized divergence states. The divergence state of the pointers can be determined by propagating the divergence state of the divergence relation graph. The scheme is further extended for interprocedural cases by considering function-related statements. The proposed scheme was implemented in an LLVM compiler and can be applied to OpenCL programs. We analyzed 10 programs with 24 kernels, with a total analyzed program size of 1,306 instructions in an LLVM intermediate representation, with 885 variables, 108 branches, and 313 pointer-related statements. The total number of divergent pointers detected was 146 for the proposed scheme, 200 for the scheme in which the pointer was always divergent, and 155 for the current LLVM default scheme; the total numbers of divergent variables detected were 458, 519, and 482, respectively, with 31, 34, and 32 divergent branches. These experimental results indicate that the proposed scheme is more precise than both a scheme in which a pointer is always divergent and the current LLVM default scheme.


2021 ◽  
Vol 2131 (3) ◽  
pp. 032025
Author(s):  
Oleg Agibalov ◽  
Nikolay Ventsov

Abstract The problem under consideration consists in choosing the number of k individuals, so that the time for processing k individuals by the genetic algorithm (GA) on the CPU architecture is close to the time for processing l individuals on the GPU architecture by the genetic algorithm. The initial information is data arrays containing information about the processing time of a given number of individuals by the genetic algorithm on the available hardware architectures. Fuzzy numbers are determined based on these arrays?~? and?~?, describing the processing time of a given number of individuals, respectively, on the CPU and GPU architectures. The peculiarities of the subject area do not allow considering the well-known methods of comparison based on the equalities of the membership functions and the nearest clear sets as adequate. Based on the known formula “close to Y (around Y)” the way to compare fuzzy numbers?~? and?~? was developed in order to determine the degree of closeness of the processing time of k and l individuals, respectively, on the hardware architectures of the CPU and GPU.


SPE Journal ◽  
2021 ◽  
pp. 1-20
Author(s):  
A. M. Manea ◽  
T. Almani

Summary In this work, the scalability of two key multiscale solvers for the pressure equation arising from incompressible flow in heterogeneous porous media, namely, the multiscale finite volume (MSFV) solver, and the restriction-smoothed basis multiscale (MsRSB) solver, are investigated on the graphics processing unit (GPU) massively parallel architecture. The robustness and scalability of both solvers are compared against their corresponding carefully optimized implementation on the shared-memory multicore architecture in a structured problem setting. Although several components in MSFV and MsRSB algorithms are directly parallelizable, their scalability on the GPU architecture depends heavily on the underlying algorithmic details and data-structure design of every step, where one needs to ensure favorable control and data flow on the GPU, while extracting enough parallel work for a massively parallel environment. In addition, the type of algorithm chosen for each step greatly influences the overall robustness of the solver. Thus, we extend the work on the parallel multiscale methods of Manea et al. (2016) to map the MSFV and MsRSB special kernels to the massively parallel GPU architecture. The scalability of our optimized parallel MSFV and MsRSB GPU implementations are demonstrated using highly heterogeneous structured 3D problems derived from the SPE10 Benchmark (Christie and Blunt 2001). Those problems range in size from millions to tens of millions of cells. For both solvers, the multicore implementations are benchmarked on a shared-memory multicore architecture consisting of two packages of Intel® Cascade Lake Xeon Gold 6246 central processing unit (CPU), whereas the GPU implementations are benchmarked on a massively parallel architecture consisting of NVIDIA Volta V100 GPUs. We compare the multicore implementations to the GPU implementations for both the setup and solution stages. Finally, we compare the parallel MsRSB scalability to the scalability of MSFV on the multicore (Manea et al. 2016) and GPU architectures. To the best of our knowledge, this is the first parallel implementation and demonstration of these versatile multiscale solvers on the GPU architecture. NOTE: This paper is published as part of the 2021 SPE Reservoir Simulation Conference Special Issue.


2021 ◽  
Author(s):  
Jheanel Estrada ◽  
Gil Opina Jr ◽  
Anshuman Tripathi

The autonomous vehicle is both an exciting yet complex field to dig in these past few years. Many have ventured out to develop Level 4 Autonomous Vehicle but up to this point, many issues were still arising about its safety, perception and sensing capabilities, tracking, and localization. This paper aims to address the struggles of developing an acceptable model for object detection in real-time. Object detection is one of the challenging areas of autonomous vehicles due to the limitations of the camera, lidar, radar, and other sensors, especially during night-time. There were various datasets and models available, but the number of samples, the labels, the occlusions, and other factors may affect the performance of the dataset. To address the mentioned problem, this study has undergone a rigorous process of scene selection and imitation to deal with the imbalance dataset, applied the state-of-the-art YOLO architecture for the model development. After the development process, the model was deployed in a multi-GPU architecture that lessens the computational load on a single GPU structure and was tested on a 12-meter fully electric autonomous bus. This study will lead to the development of a usable and safe autonomous bus that will lead the future of public transportation.


2021 ◽  
Author(s):  
Usuf Middya ◽  
Abdulrahman Manea ◽  
Maitham Alhubail ◽  
Todd Ferguson ◽  
Thomas Byer ◽  
...  

Abstract Reservoir simulation computational costs have been continuously growing due to high-resolution reservoir characterization, increasing model complexity, and uncertainty analysis workflows. Reducing simulation costs by upscaling is often necessary for operational requirements. Fast evolving HPC technologies offer opportunities to reduce cost without compromising fidelity. This work presents a novel in-house massively parallel full-physics reservoir simulator running on the emerging GPU architecture. Almost all the simulation kernels have been designed and implemented to honor the GPU SIMD programming paradigm. These kernels include physical property calculations, phase equilibrium computations, Jacobian construction, linear and nonlinear solvers, and wells. Novel techniques are devised in various kernels to expose enough parallelism to ensure that the control and data-flow patterns are well suited for the GPU environment. Mixed-precision computation is also employed when appropriate (e.g., in derivative calculation) to reduce computational costs without compromising the solution accuracy. The GPU implementation of the simulator is tested and benchmarked using various reservoir models, ranging from the synthetic SPE10 Benchmark (Christie & Blunt, 2001) to several industrial-scale models. These real field models range in size from tens of millions of cells to more than billion cells with black-oil and multicomponent compositional fluid. The GPU simulator is benchmarked on the IBM AC922 massively parallel architecture having tens of NVidia Volta V100 GPUs. To compare performance with CPU architectures, an optimized CPU implementation of the simulator is benchmarked on the IBM AC922 CPUs and on a cluster consisting of thousands of Intel's Haswell-EP Xeon® CPU E5-2680 v3. Detailed analysis of several numerical experiments comparing the simulator performance on the GPU and the CPU architectures is presented. In almost all of the cases, the analysis shows that the use of hardware acceleration offers substantial benefits in terms of wall time and power consumption. This novel in-house full-physics, black-oil and compositional reservoir simulator employs several novel techniques in various simulation kernels to ensure full utilization of the GPU resources. Detailed analysis is presented to highlight the simulator performance in terms of runtime reduction, parallel scalability and power savings.


2021 ◽  
pp. 100267
Author(s):  
Lining Yu ◽  
Tiezheng Nie ◽  
Derong Shen ◽  
Yue Kou

Mathematics ◽  
2021 ◽  
Vol 9 (14) ◽  
pp. 1685
Author(s):  
Julian Miller ◽  
Lukas Trümper ◽  
Christian Terboven ◽  
Matthias S. Müller

With the quickly evolving hardware landscape of high-performance computing (HPC) and its increasing specialization, the implementation of efficient software applications becomes more challenging. This is especially prevalent for domain scientists and may hinder the advances in large-scale simulation software. One idea to overcome these challenges is through software abstraction. We present a parallel algorithm model that allows for global optimization of their synchronization and dataflow and optimal mapping to complex and heterogeneous architectures. The presented model strictly separates the structure of an algorithm from its executed functions. It utilizes a hierarchical decomposition of parallel design patterns as well-established building blocks for algorithmic structures and captures them in an abstract pattern tree (APT). A data-centric flow graph is constructed based on the APT, which acts as an intermediate representation for rich and automated structural transformations. We demonstrate the applicability of this model to three representative algorithms and show runtime speedups between 1.83 and 2.45 on a typical heterogeneous CPU/GPU architecture.


Sign in / Sign up

Export Citation Format

Share Document