scholarly journals Design Patterns for Sparse-Matrix Computations on Hybrid CPU/GPU Platforms

2014 ◽  
Vol 22 (1) ◽  
pp. 1-19 ◽  
Author(s):  
Valeria Cardellini ◽  
Salvatore Filippone ◽  
Damian W.I. Rouson

We apply object-oriented software design patterns to develop code for scientific software involving sparse matrices. Design patterns arise when multiple independent developments produce similar designs which converge onto a generic solution. We demonstrate how to use design patterns to implement an interface for sparse matrix computations on NVIDIA GPUs starting from PSBLAS, an existing sparse matrix library, and from existing sets of GPU kernels for sparse matrices. We also compare the throughput of the PSBLAS sparse matrix–vector multiplication on two platforms exploiting the GPU with that obtained by a CPU-only PSBLAS implementation. Our experiments exhibit encouraging results regarding the comparison between CPU and GPU executions in double precision, obtaining a speedup of up to 35.35 on NVIDIA GTX 285 with respect to AMD Athlon 7750, and up to 10.15 on NVIDIA Tesla C2050 with respect to Intel Xeon X5650.

Author(s):  
Vikalp Mishra ◽  
Krishnan Suresh

A serious computational bottle-neck in finite element analysis today is the solution of the underlying system of equations. To alleviate this problem, researchers have proposed the use of graphics programmable units (GPU) for fast iterative solution of such equations. Indeed, researchers have shown that a GPU-implementation of a double-precision sparse-matrix-vector multiplication (that underlies all iterative methods) is approximately an order of magnitude faster than that of an optimized CPU implementation. Unfortunately, fast matrix-vector multiplication alone is insufficient… a good preconditioner is necessary for rapid convergence. Furthermore, most modern preconditioners, such as incomplete Cholesky, are expensive to compute, and cannot be easily ported to the GPU. In this paper, we propose a special class of preconditioners for the analysis of thin structures, such as beams and plates. The proposed preconditioners are developed by combining the multi-grid method, with recently developed dual-representation method for thin structures. It is shown, that these preconditioners are computationally inexpensive, perform better than standard pre-conditioners, and can be easily ported to the GPU.


2016 ◽  
Vol 2016 ◽  
pp. 1-12 ◽  
Author(s):  
Guixia He ◽  
Jiaquan Gao

Sparse matrix-vector multiplication (SpMV) is an important operation in scientific computations. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMVs on graphic processing units (GPUs), for example, CSR-scalar and CSR-vector, usually have poor performance due to irregular memory access patterns. This motivates us to propose a perfect CSR-based SpMV on the GPU that is called PCSR. PCSR involves two kernels and accesses CSR arrays in a fully coalesced manner by introducing a middle array, which greatly alleviates the deficiencies of CSR-scalar (rare coalescing) and CSR-vector (partial coalescing). Test results on a single C2050 GPU show that PCSR fully outperforms CSR-scalar, CSR-vector, and CSRMV and HYBMV in the vendor-tuned CUSPARSE library and is comparable with a most recently proposed CSR-based algorithm, CSR-Adaptive. Furthermore, we extend PCSR on a single GPU to multiple GPUs. Experimental results on four C2050 GPUs show that no matter whether the communication between GPUs is considered or not PCSR on multiple GPUs achieves good performance and has high parallel efficiency.


Author(s):  
Rob H. Bisseling

This chapter introduces irregular algorithms and presents the example of parallel sparse matrix-vector multiplication (SpMV), which is the central operation in iterative linear system solvers. The irregular sparsity pattern of the matrix does not change during the multiplication, which may be repeated many times. This justifies putting a lot of effort into finding a good data distribution. The Mondriaan distribution of a sparse matrix is a useful non-Cartesian distribution that can be found by hypergraph-based partitioning. The Mondriaan package implements such a partitioning and also the newer medium-grain partitioning method. The chapter analyses the special cases of random sparse matrices and Laplacian matrices. It uses performance profiles and geometric means to compare different partitioning methods. Furthermore, it presents the hybrid-BSP model and a hybrid-BSP SpMV, which are aimed at hybrid distributed/shared-memory architectures. The parallel SpMV can be incorporated in applications, ranging from PageRank computation to artificial neural networks.


1993 ◽  
Vol 04 (01) ◽  
pp. 65-83 ◽  
Author(s):  
SERGE PETITON ◽  
YOUCEF SAAD ◽  
KESHENG WU ◽  
WILLIAM FERNG

This paper presents a preliminary experimental study of the performance of basic sparse matrix computations on the CM-5. We concentrate on examining various ways of performing general sparse matrix-vector operations and the basic primitives on which these are based. We compare various data structures for storing sparse matrices and their corresponding matrix — vector operations. Both SPMD and Data parallel modes are examined and a comparison of the two modes is made.


1998 ◽  
Vol Vol. 2 ◽  
Author(s):  
Giovanni Manzini

International audience In this paper we consider the problem of computing on a local memory machine the product y = Ax,where A is a random n×n sparse matrix with Θ (n) nonzero elements. To study the average case communication cost of this problem, we introduce four different probability measures on the set of sparse matrices. We prove that on most local memory machines with p processors, this computation requires Ω ((n/p) \log p) time on the average. We prove that the same lower bound also holds, in the worst case, for matrices with only 2n or 3n nonzero elements.


2021 ◽  
Vol 53 (10) ◽  
Author(s):  
Michael Haider ◽  
Michael Riesch ◽  
Christian Jirauschek

AbstractEfforts in providing high-quality scientific software are hardly rewarded, as scientific output is typically measured in terms of publications in high ranking journals. As a result, scientific software is often developed without proper documentation and support of modern software design patterns. Ready-to-use project skeletons can be employed to accelerate the development process, while at the same time taking care of the implementation of best practices in software engineering. In this work, we revisit best practices in software engineering and review existing project skeletons. Special emphasis is given on the realization of best practices. Finally, we present a new project skeleton for scientific writing in "Image missing", which takes care of the attainment of best practices, adapted for being used in academic publications.


2021 ◽  
Vol 2021 ◽  
pp. 1-17
Author(s):  
Wenpeng Ma ◽  
Yiwen Hu ◽  
Wu Yuan ◽  
Xiazhen Liu

Solving triangular systems is the building block for preconditioned GMRES algorithm. Inexact preconditioning becomes attractive because of the feature of high parallelism on accelerators. In this paper, we propose and implement an iterative, inexact block triangular solve on multi-GPUs based on PETSc’s framework. In addition, by developing a distributed block sparse matrix-vector multiplication procedure and investigating the optimized vector operations, we form the multi-GPU-enabled preconditioned GMRES with the block Jacobi preconditioner. In the implementation, the GPU-Direct technique is employed to avoid host-device memory copies. The preconditioning step used by PETSc’s structure and the cuSPARSE library are also investigated for performance comparisons. The experiments show that the developed GMRES with inexact preconditioning on 8 GPUs can achieve up to 4.4x speedup over the CPU-only implementation with exact preconditioning using 8 MPI processes.


2010 ◽  
Vol 8 ◽  
pp. 289-294 ◽  
Author(s):  
C.-C. Sun ◽  
J. Götze ◽  
H.-Y. Jheng ◽  
S.-J. Ruan

Abstract. In this paper, we present an idea for performing matrix-vector multiplication by using Network-on-Chip (NoC) architecture. In traditional IC design on-chip communications have been designed with dedicated point-to-point interconnections. Therefore, regular local data transfer is the major concept of many parallel implementations. However, when dealing with the parallel implementation of sparse matrix-vector multiplication (SMVM), which is the main step of all iterative algorithms for solving systems of linear equation, the required data transfers depend on the sparsity structure of the matrix and can be extremely irregular. Using the NoC architecture makes it possible to deal with arbitrary structure of the data transfers; i.e. with the irregular structure of the sparse matrices. So far, we have already implemented the proposed SMVM-NoC architecture with the size 4×4 and 5×5 in IEEE 754 single float point precision using FPGA.


Sign in / Sign up

Export Citation Format

Share Document