A Novel Multi-GPU Parallel Optimization Model for The Sparse Matrix-Vector Multiplication

Accelerating the sparse matrix-vector multiplication (SpMV) on the graphics processing units (GPUs) has attracted considerable attention recently. We observe that on a specific multiple-GPU platform, the SpMV performance can usually be greatly improved when a matrix is partitioned into several blocks according to a predetermined rule and each block is assigned to a GPU with an appropriate storage format. This motivates us to propose a novel multi-GPU parallel SpMV optimization model. Our model involves two stages. In the first stage, a simple rule is defined to divide any given matrix among multiple GPUs, and then a performance model, which is independent of the problems and dependent on the resources of devices, is proposed to accurately predict the execution time of SpMV kernels. Using these models, we construct in the second stage an optimally multi-GPU parallel SpMV algorithm that is automatically and rapidly generated for the platform for any problem. Given that our model for SpMV is general, independent of the problems, and dependent on the resources of devices, this model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.

Download Full-text

A Novel CSR-Based Sparse Matrix-Vector Multiplication on GPUs

Mathematical Problems in Engineering ◽

10.1155/2016/8471283 ◽

2016 ◽

Vol 2016 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Guixia He ◽

Jiaquan Gao

Keyword(s):

Sparse Matrix ◽

Sparse Matrices ◽

Poor Performance ◽

Test Results ◽

Graphic Processing Units ◽

Multiple Gpus ◽

Matrix Vector Multiplication ◽

Compressed Sparse Row ◽

Access Patterns ◽

Matrix Vector

Sparse matrix-vector multiplication (SpMV) is an important operation in scientific computations. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMVs on graphic processing units (GPUs), for example, CSR-scalar and CSR-vector, usually have poor performance due to irregular memory access patterns. This motivates us to propose a perfect CSR-based SpMV on the GPU that is called PCSR. PCSR involves two kernels and accesses CSR arrays in a fully coalesced manner by introducing a middle array, which greatly alleviates the deficiencies of CSR-scalar (rare coalescing) and CSR-vector (partial coalescing). Test results on a single C2050 GPU show that PCSR fully outperforms CSR-scalar, CSR-vector, and CSRMV and HYBMV in the vendor-tuned CUSPARSE library and is comparable with a most recently proposed CSR-based algorithm, CSR-Adaptive. Furthermore, we extend PCSR on a single GPU to multiple GPUs. Experimental results on four C2050 GPUs show that no matter whether the communication between GPUs is considered or not PCSR on multiple GPUs achieves good performance and has high parallel efficiency.

Download Full-text

Efficient Sparse Matrix-Vector Multiplication on GPUs Using the CSR Storage Format

SC14: International Conference for High Performance Computing, Networking, Storage and Analysis ◽

10.1109/sc.2014.68 ◽

2014 ◽

Cited By ~ 76

Author(s):

Joseph L. Greathouse ◽

Mayank Daga

Keyword(s):

Sparse Matrix ◽

Matrix Vector Multiplication ◽

Storage Format ◽

Matrix Vector

Download Full-text

Efficient sparse matrix–vector multiplication using cache oblivious extension quadtree storage format

Future Generation Computer Systems ◽

10.1016/j.future.2015.03.005 ◽

2016 ◽

Vol 54 ◽

pp. 490-500 ◽

Cited By ~ 9

Author(s):

Jilin Zhang ◽

Jian Wan ◽

Fangfang Li ◽

Jie Mao ◽

Li Zhuang ◽

...

Keyword(s):

Sparse Matrix ◽

Matrix Vector Multiplication ◽

Storage Format ◽

Matrix Vector ◽

Cache Oblivious

Download Full-text

VBSF: a new storage format for SIMD sparse matrix–vector multiplication on modern processors

The Journal of Supercomputing ◽

10.1007/s11227-019-02835-4 ◽

2019 ◽

Vol 76 (3) ◽

pp. 2063-2081 ◽

Cited By ~ 1

Author(s):

Yishui Li ◽

Peizhen Xie ◽

Xinhai Chen ◽

Jie Liu ◽

Bo Yang ◽

...

Keyword(s):

Sparse Matrix ◽

Matrix Vector Multiplication ◽

Storage Format ◽

Matrix Vector

Download Full-text

Merge-based sparse matrix-vector multiplication (SpMV) using the CSR storage format

Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP '16 ◽

10.1145/2851141.2851190 ◽

2016 ◽

Cited By ~ 19

Author(s):

Duane Merrill ◽

Michael Garland

Keyword(s):

Sparse Matrix ◽

Matrix Vector Multiplication ◽

Storage Format ◽

Matrix Vector

Download Full-text

Sparse Matrix-vector Multiplication on GPGPU Clusters: A New Storage Format and a Scalable Implementation

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum ◽

10.1109/ipdpsw.2012.211 ◽

2012 ◽

Cited By ~ 19

Author(s):

Moritz Kreutzer ◽

Georg Hager ◽

Gerhard Wellein ◽

Holger Fehske ◽

Achim Basermann ◽

...

Keyword(s):

Sparse Matrix ◽

Matrix Vector Multiplication ◽

Storage Format ◽

Matrix Vector

Download Full-text

An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units

2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems ◽

10.1109/hpcc.2012.68 ◽

2012 ◽

Cited By ~ 7

Author(s):

Walid Abu-Sufah ◽

Asma Abdel Karim

Keyword(s):

Graphics Processing Units ◽

Sparse Matrix ◽

Matrix Vector Multiplication ◽

Graphics Processing ◽

Matrix Vector

Download Full-text

A novel multi-graphics processing unit parallel optimization framework for the sparse matrix-vector multiplication

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.3936 ◽

2016 ◽

Vol 29 (5) ◽

pp. e3936 ◽

Cited By ~ 10

Author(s):

Jiaquan Gao ◽

Yu Wang ◽

Jun Wang

Keyword(s):

Graphics Processing Unit ◽

Sparse Matrix ◽

Parallel Optimization ◽

Processing Unit ◽

Optimization Framework ◽

Matrix Vector Multiplication ◽

Graphics Processing ◽

Matrix Vector

Download Full-text

Multiple-precision matrix-vector multiplication on graphics processing units

Program systems theory and applications ◽

10.25209/2079-3316-2020-11-3-61-84 ◽

2020 ◽

Vol 11 (3) ◽

pp. 61-84

Author(s):

Konstantin Isupov ◽

Vladimir Knyazkov

Keyword(s):

Graphics Processing Units ◽

High Efficiency ◽

Parallel Implementation ◽

Number System ◽

Residue Number System ◽

Global Memory ◽

Matrix Vector Multiplication ◽

Multiple Precision ◽

Graphics Processing ◽

Matrix Vector

We are considering a parallel implementation of matrix-vector multiplication (GEMV, Level 2 of the BLAS) for graphics processing units (GPUs) using multiple-precision arithmetic based on the residue number system. In our GEMV implementation, element-wise operations with multiple-precision vectors and matrices consist of several parts, each of which is calculated by a separate CUDA kernel. This feature eliminates branch divergence when performing sequential parts of multiple-precision operations and allows the full utilization of the GPU’s resources. An efficient data structure for storing arrays with multiple-precision entries provides a coalesced access pattern to the GPU global memory. We have performed a rounding error analysis and derived error bounds for the proposed GEMV implementation. Experimental results show the high efficiency of the proposed solution compared to existing high-precision packages deployed on GPU.

Download Full-text

Using GPU-Based Computing to Solve Large Sparse Systems of Linear Equations

Volume 2: 31st Computers and Information in Engineering Conference, Parts A and B ◽

10.1115/detc2011-48452 ◽

2011 ◽

Author(s):

Travis J. Carrigan ◽

Jacob Watt ◽

Brian H. Dennis

Keyword(s):

Finite Element ◽

Domain Decomposition ◽

Graphics Processing Units ◽

Sparse Matrix ◽

Low Cost ◽

Linear Equations ◽

Parallel Architecture ◽

General Purpose ◽

Matrix Vector Multiplication ◽

Matrix Vector

Often thought of as tools for image rendering or data visualization, graphics processing units (GPU) are becoming increasingly popular in the areas of scientific computing due to their low cost massively parallel architecture. With the introduction of CUDA C by NVIDIA and CUDA enabled GPUs, the ability to perform general purpose computations without the need to utilize shading languages is now possible. One such application that benefits from the capabilities provided by NVIDIA hardware is computational continuum mechanics (CCM). The need to solve sparse linear systems of equations is common in CCM when partial differential equations are discretized. Often these systems are solved iteratively using domain decomposition among distributed processors working in parallel. In this paper we explore the benefits of using GPUs to improve the performance of sparse matrix operations, more specifically, sparse matrix-vector multiplication. Our approach does not require domain decomposition, so it is simpler than corresponding implementation for distributed memory parallel computers. We demonstrate that for matrices produced from finite element discretizations on unstructured meshes, the performance of the matrix-vector multiplication operation is just under 13 times faster than when run serially on an Intel i5 system. Furthermore, we show that when used in conjunction with the biconjugate gradient stabilized method (BiCGSTAB), a gradient based iterative linear solver, the method is over 13 times faster than the serially executed C equivalent. And lastly, we emphasize the application of such method for solving Poisson’s equation using the Galerkin finite element method, and demonstrate over 10.5 times higher performance on the GPU when compared with the Intel i5 system.

Download Full-text