Design Patterns for Sparse-Matrix Computations on Hybrid CPU/GPU Platforms

We apply object-oriented software design patterns to develop code for scientific software involving sparse matrices. Design patterns arise when multiple independent developments produce similar designs which converge onto a generic solution. We demonstrate how to use design patterns to implement an interface for sparse matrix computations on NVIDIA GPUs starting from PSBLAS, an existing sparse matrix library, and from existing sets of GPU kernels for sparse matrices. We also compare the throughput of the PSBLAS sparse matrix–vector multiplication on two platforms exploiting the GPU with that obtained by a CPU-only PSBLAS implementation. Our experiments exhibit encouraging results regarding the comparison between CPU and GPU executions in double precision, obtaining a speedup of up to 35.35 on NVIDIA GTX 285 with respect to AMD Athlon 7750, and up to 10.15 on NVIDIA Tesla C2050 with respect to Intel Xeon X5650.

Download Full-text

GPU-Friendly Preconditioners for Efficient 3-D Finite Element Analysis of Thin Structures

Volume 2: 31st Computers and Information in Engineering Conference, Parts A and B ◽

10.1115/detc2011-47330 ◽

2011 ◽

Cited By ~ 1

Author(s):

Vikalp Mishra ◽

Krishnan Suresh

Keyword(s):

Finite Element Analysis ◽

Finite Element ◽

Sparse Matrix ◽

Grid Method ◽

Double Precision ◽

Thin Structures ◽

Element Analysis ◽

Dual Representation ◽

Matrix Vector Multiplication ◽

Matrix Vector

A serious computational bottle-neck in finite element analysis today is the solution of the underlying system of equations. To alleviate this problem, researchers have proposed the use of graphics programmable units (GPU) for fast iterative solution of such equations. Indeed, researchers have shown that a GPU-implementation of a double-precision sparse-matrix-vector multiplication (that underlies all iterative methods) is approximately an order of magnitude faster than that of an optimized CPU implementation. Unfortunately, fast matrix-vector multiplication alone is insufficient… a good preconditioner is necessary for rapid convergence. Furthermore, most modern preconditioners, such as incomplete Cholesky, are expensive to compute, and cannot be easily ported to the GPU. In this paper, we propose a special class of preconditioners for the analysis of thin structures, such as beams and plates. The proposed preconditioners are developed by combining the multi-grid method, with recently developed dual-representation method for thin structures. It is shown, that these preconditioners are computationally inexpensive, perform better than standard pre-conditioners, and can be easily ported to the GPU.

Download Full-text

A Novel CSR-Based Sparse Matrix-Vector Multiplication on GPUs

Mathematical Problems in Engineering ◽

10.1155/2016/8471283 ◽

2016 ◽

Vol 2016 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Guixia He ◽

Jiaquan Gao

Keyword(s):

Sparse Matrix ◽

Sparse Matrices ◽

Poor Performance ◽

Test Results ◽

Graphic Processing Units ◽

Multiple Gpus ◽

Matrix Vector Multiplication ◽

Compressed Sparse Row ◽

Access Patterns ◽

Matrix Vector

Sparse matrix-vector multiplication (SpMV) is an important operation in scientific computations. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMVs on graphic processing units (GPUs), for example, CSR-scalar and CSR-vector, usually have poor performance due to irregular memory access patterns. This motivates us to propose a perfect CSR-based SpMV on the GPU that is called PCSR. PCSR involves two kernels and accesses CSR arrays in a fully coalesced manner by introducing a middle array, which greatly alleviates the deficiencies of CSR-scalar (rare coalescing) and CSR-vector (partial coalescing). Test results on a single C2050 GPU show that PCSR fully outperforms CSR-scalar, CSR-vector, and CSRMV and HYBMV in the vendor-tuned CUSPARSE library and is comparable with a most recently proposed CSR-based algorithm, CSR-Adaptive. Furthermore, we extend PCSR on a single GPU to multiple GPUs. Experimental results on four C2050 GPUs show that no matter whether the communication between GPUs is considered or not PCSR on multiple GPUs achieves good performance and has high parallel efficiency.

Download Full-text

Sparse matrix–vector multiplication

Parallel Scientific Computation ◽

10.1093/oso/9780198788348.003.0004 ◽

2020 ◽

pp. 190-290

Author(s):

Rob H. Bisseling

Keyword(s):

Sparse Matrix ◽

Sparse Matrices ◽

Distributed Shared Memory ◽

Sparsity Pattern ◽

Matrix Vector Multiplication ◽

Special Cases ◽

The Matrix ◽

Matrix Vector ◽

Memory Architectures ◽

Shared Memory Architectures

This chapter introduces irregular algorithms and presents the example of parallel sparse matrix-vector multiplication (SpMV), which is the central operation in iterative linear system solvers. The irregular sparsity pattern of the matrix does not change during the multiplication, which may be repeated many times. This justifies putting a lot of effort into finding a good data distribution. The Mondriaan distribution of a sparse matrix is a useful non-Cartesian distribution that can be found by hypergraph-based partitioning. The Mondriaan package implements such a partitioning and also the newer medium-grain partitioning method. The chapter analyses the special cases of random sparse matrices and Laplacian matrices. It uses performance profiles and geometric means to compare different partitioning methods. Furthermore, it presents the hybrid-BSP model and a hybrid-BSP SpMV, which are aimed at hybrid distributed/shared-memory architectures. The parallel SpMV can be incorporated in applications, ranging from PageRank computation to artificial neural networks.

Download Full-text

DOUBLE PRECISION SPARSE MATRIX VECTOR MULTIPLICATION ACCELERATOR ON FPGA

Proceedings of the 1st International Conference on Pervasive and Embedded Computing and Communication Systems ◽

10.5220/0003400804760484 ◽

2011 ◽

Keyword(s):

Sparse Matrix ◽

Double Precision ◽

Matrix Vector Multiplication ◽

Matrix Vector

Download Full-text

BASIC SPARSE MATRIX COMPUTATIONS ON THE CM-5

International Journal of Modern Physics C ◽

10.1142/s0129183193000082 ◽

1993 ◽

Vol 04 (01) ◽

pp. 65-83 ◽

Cited By ~ 4

Author(s):

SERGE PETITON ◽

YOUCEF SAAD ◽

KESHENG WU ◽

WILLIAM FERNG

Keyword(s):

Experimental Study ◽

Data Structures ◽

Sparse Matrix ◽

Sparse Matrices ◽

Matrix Computations ◽

Data Parallel ◽

Matrix Vector

This paper presents a preliminary experimental study of the performance of basic sparse matrix computations on the CM-5. We concentrate on examining various ways of performing general sparse matrix-vector operations and the basic primitives on which these are based. We compare various data structures for storing sparse matrices and their corresponding matrix — vector operations. Both SPMD and Data parallel modes are examined and a comparison of the two modes is made.

Download Full-text

A sparse matrix–vector multiplication based algorithm for accurate density matrix computations on systems of millions of atoms

Computer Physics Communications ◽

10.1016/j.cpc.2018.02.008 ◽

2018 ◽

Vol 227 ◽

pp. 17-26 ◽

Cited By ~ 2

Author(s):

Purnima Ghale ◽

Harley T. Johnson

Keyword(s):

Density Matrix ◽

Sparse Matrix ◽

Matrix Computations ◽

Matrix Vector Multiplication ◽

Matrix Vector

Download Full-text

Lower bounds for sparse matrix vector multiplication on hypercubic networks

Discrete Mathematics & Theoretical Computer Science ◽

10.46298/dmtcs.249 ◽

1998 ◽

Vol Vol. 2 ◽

Author(s):

Giovanni Manzini

Keyword(s):

Sparse Matrix ◽

Sparse Matrices ◽

Log P ◽

Probability Measures ◽

Worst Case ◽

Average Case ◽

Local Memory ◽

Matrix Vector Multiplication ◽

International Audience ◽

Matrix Vector

International audience In this paper we consider the problem of computing on a local memory machine the product y = Ax,where A is a random n×n sparse matrix with Θ (n) nonzero elements. To study the average case communication cost of this problem, we introduce four different probability measures on the set of sparse matrices. We prove that on most local memory machines with p processors, this computation requires Ω ((n/p) \log p) time on the average. We prove that the same lower bound also holds, in the worst case, for matrices with only 2n or 3n nonzero elements.

Download Full-text

Realization of best practices in software engineering and scientific writing through ready-to-use project skeletons

Optical and Quantum Electronics ◽

10.1007/s11082-021-03192-4 ◽

2021 ◽

Vol 53 (10) ◽

Author(s):

Michael Haider ◽

Michael Riesch ◽

Christian Jirauschek

Keyword(s):

Software Engineering ◽

Best Practices ◽

Software Design ◽

Design Patterns ◽

Development Process ◽

Scientific Output ◽

Scientific Writing ◽

Scientific Software ◽

Software Design Patterns ◽

Academic Publications

AbstractEfforts in providing high-quality scientific software are hardly rewarded, as scientific output is typically measured in terms of publications in high ranking journals. As a result, scientific software is often developed without proper documentation and support of modern software design patterns. Ready-to-use project skeletons can be employed to accelerate the development process, while at the same time taking care of the implementation of best practices in software engineering. In this work, we revisit best practices in software engineering and review existing project skeletons. Special emphasis is given on the realization of best practices. Finally, we present a new project skeleton for scientific writing in "Image missing", which takes care of the attainment of best practices, adapted for being used in academic publications.

Download Full-text

Developing a Multi-GPU-Enabled Preconditioned GMRES with Inexact Triangular Solves for Block Sparse Matrices

Mathematical Problems in Engineering ◽

10.1155/2021/6804723 ◽

2021 ◽

Vol 2021 ◽

pp. 1-17

Author(s):

Wenpeng Ma ◽

Yiwen Hu ◽

Wu Yuan ◽

Xiazhen Liu

Keyword(s):

Building Block ◽

Sparse Matrix ◽

Sparse Matrices ◽

Matrix Vector Multiplication ◽

Triangular Systems ◽

Direct Technique ◽

Inexact Preconditioning ◽

Gmres Algorithm ◽

Matrix Vector ◽

Preconditioned Gmres

Solving triangular systems is the building block for preconditioned GMRES algorithm. Inexact preconditioning becomes attractive because of the feature of high parallelism on accelerators. In this paper, we propose and implement an iterative, inexact block triangular solve on multi-GPUs based on PETSc’s framework. In addition, by developing a distributed block sparse matrix-vector multiplication procedure and investigating the optimized vector operations, we form the multi-GPU-enabled preconditioned GMRES with the block Jacobi preconditioner. In the implementation, the GPU-Direct technique is employed to avoid host-device memory copies. The preconditioning step used by PETSc’s structure and the cuSPARSE library are also investigated for performance comparisons. The experiments show that the developed GMRES with inexact preconditioning on 8 GPUs can achieve up to 4.4x speedup over the CPU-only implementation with exact preconditioning using 8 MPI processes.

Download Full-text

Sparse matrix-vector multiplication on network-on-chip

Advances in Radio Science ◽

10.5194/ars-8-289-2010 ◽

2010 ◽

Vol 8 ◽

pp. 289-294 ◽

Cited By ~ 6

Author(s):

C.-C. Sun ◽

J. Götze ◽

H.-Y. Jheng ◽

S.-J. Ruan

Keyword(s):

Parallel Implementation ◽

Sparse Matrix ◽

Sparse Matrices ◽

Network On Chip ◽

Main Step ◽

Local Data ◽

Matrix Vector Multiplication ◽

Data Transfers ◽

On Chip ◽

Matrix Vector

Abstract. In this paper, we present an idea for performing matrix-vector multiplication by using Network-on-Chip (NoC) architecture. In traditional IC design on-chip communications have been designed with dedicated point-to-point interconnections. Therefore, regular local data transfer is the major concept of many parallel implementations. However, when dealing with the parallel implementation of sparse matrix-vector multiplication (SMVM), which is the main step of all iterative algorithms for solving systems of linear equation, the required data transfers depend on the sparsity structure of the matrix and can be extremely irregular. Using the NoC architecture makes it possible to deal with arbitrary structure of the data transfers; i.e. with the irregular structure of the sparse matrices. So far, we have already implemented the proposed SMVM-NoC architecture with the size 4×4 and 5×5 in IEEE 754 single float point precision using FPGA.

Download Full-text