MG-Join: A Scalable Join for Massively Parallel Multi-GPU Architectures

GPUs deliver higher performance than traditional processors, offering remarkable energy efficiency, and are quickly becoming very popular processors for HPC applications. Still, writing efficient and scalable programs for GPUs is not an easy task as codes must adapt to increasingly parallel architecture features. In this chapter, the authors describe in full detail design and implementation strategies for lattice Boltzmann (LB) codes able to meet these goals. Most of the discussion uses a state-of-the art thermal lattice Boltzmann method in 2D, but all lessons learned in this particular case can be immediately extended to most LB and other scientific applications. The authors describe the structure of the code, discussing in detail several key design choices that were guided by theoretical models of performance and experimental benchmarks, having in mind both single-GPU codes and massively parallel implementations on commodity clusters of GPUs. The authors then present and analyze performances on several recent GPU architectures, including data on energy optimization.

Download Full-text

Performance evaluation in the reconstruction of 2D images of computed tomography using massively parallel programming CUDA

10.21203/rs.3.rs-863369/v1 ◽

2021 ◽

Author(s):

Alexssandro Ferreira Cordeiro ◽

Pedro Luiz de Paula Filho ◽

Hamilton Pereira Silva ◽

Arnaldo Candido Junior ◽

Edresson Casanova ◽

...

Keyword(s):

Parallel Programming ◽

Processing Time ◽

Data Type ◽

Massively Parallel ◽

Data Types ◽

Sequential Approach ◽

Time Performance ◽

Sequential Programming ◽

Gpu Architectures ◽

2D Images

Abstract Purpose: analysis of processing time and similarity of images generated between CPU and GPU architectures and sequential and parallel programming methodologies. Material and methods: for image processing a computer with AMD FX-8350 processor and an Nvidia GTX 960 Maxwell GPU was used, along with the CUDAFY library and the programming language C# with the IDE Visual studio. Results: the results of the comparisons indicate that the form of sequential programming in a CPU generates reliable images at a high custom of time when compared to the forms of parallel programming in CPU and GPU. While parallel programming generates faster results, but with increased noise in the reconstructed image. For data types float a GPU obtained best result with average time equivalent to 1/3 of the processor, however the data is of type double the parallel CPU approach obtained the best performance. Conclusion: for the float data type, the GPU had the best average time performance, while for the double data type the best average time performance was for the parallel approach CPU. Regarding image quality, the sequential approach obtained similar outputs, while theparallel approaches generated noise in their outputs.

Download Full-text

A Massively Parallel Restriction-Smoothed Basis Multiscale Solver on Multi-Core and GPU Architectures

10.2118/203939-ms ◽

2021 ◽

Author(s):

Abdulrahman Manea

Keyword(s):

Shared Memory ◽

Parallel Implementation ◽

Real Life ◽

Parallel Architecture ◽

Industrial Applications ◽

Multiscale Methods ◽

Basis Functions ◽

Massively Parallel ◽

Gpu Architectures ◽

Gpu Implementation

Abstract Due to its simplicity, adaptability, and applicability to various grid formats, the restriction-smoothed basis multiscale method (MsRSB) (Møyne and Lie 2016) has received wide attention and has been extended to various flow problems in porous media. Unlike the standard multiscale methods, MsRSB relies on iterative smoothing to find the multiscale basis functions in an adaptive manner, giving it the ability to naturally adjust to various complex grid orientations often encountered in real-life industrial applications. In this work, we investigate the scalability of MsRSB on various state-of-the-art parallel architectures, including multi-core systems and GPUs. While MsRSB is — like most other multiscale methods — directly amenable to parallelization, the dependence on a smoother to find the basis functions creates unique control- and data-flow patterns. These patterns require careful design and implementation in parallel environments to achieve good scalability. We extend the work on parallel multiscale methods in Manea et al. (2016) and Manea and Almani (2019) to map the MsRSB special kernels to the shared-memory parallel multi-core and GPU architectures. The scalability of our optimized parallel MsRSB implementation is demonstrated using highly heterogeneous 3D problems derived from the SPE10 Benchmark (Christie and Blunt 2001). Those problems range in size from millions to tens of millions of cells. The multi-core implementation is benchmarked on a shared memory multi-core architecture consisting of two packages of Intel's Cascade Lake Xeon® Gold 6246 CPU, while the GPU implementation is benchmarked on a massively parallel architecture consisting of Nvidia Volta V100 GPUs. We compare the multi-core implementation to the GPU implementation for both the setup and solution stages. To the best of our knowledge, this is the first parallel implementation and demonstration of the versatile MsRSB method on the GPU architecture.

Download Full-text