scholarly journals Acceleration of Sparse Matrix-Vector Multiplication by Region Traversal

10.14311/1029 ◽  
2008 ◽  
Vol 48 (4) ◽  
Author(s):  
I. Šimeček

Sparse matrix-vector multiplication (shortly SpM×V) is one of most common subroutines in numerical linear algebra. The problem is that the memory access patterns during SpM×V are irregular, and utilization of the cache can suffer from low spatial or temporal locality. Approaches to improve the performance of SpM×V are based on matrix reordering and register blocking. These matrix transformations are designed to handle randomly occurring dense blocks in a sparse matrix. The efficiency of these transformations depends strongly on the presence of suitable blocks. The overhead of reorganization of a matrix from one format to another is often of the order of tens of executions ofSpM×V. For this reason, such a reorganization pays off only if the same matrix A is multiplied by multiple different vectors, e.g., in iterative linear solvers.This paper introduces an unusual approach to accelerate SpM×V. This approach can be combined with other acceleration approaches andconsists of three steps:1) dividing matrix A into non-empty regions,2) choosing an efficient way to traverse these regions (in other words, choosing an efficient ordering of partial multiplications),3) choosing the optimal type of storage for each region.All these three steps are tightly coupled. The first step divides the whole matrix into smaller parts (regions) that can fit in the cache. The second step improves the locality during multiplication due to better utilization of distant references. The last step maximizes the machine computation performance of the partial multiplication for each region.In this paper, we describe aspects of these 3 steps in more detail (including fast and time-inexpensive algorithms for all steps). Ourmeasurements prove that our approach gives a significant speedup for almost all matrices arising from various technical areas.

10.14311/826 ◽  
2006 ◽  
Vol 46 (3) ◽  
Author(s):  
I. Šimeček

Sparse matrix-vector multiplication (shortly SpM×V) is an important building block in algorithms solving sparse systems of linear equations, e.g., FEM. Due to matrix sparsity, the memory access patterns are irregular and utilization of the cache can suffer from low spatial or temporal locality. Approaches to improve the performance of SpM×V are based on matrix reordering and register blocking [1, 2], sometimes combined with software-pipelining [3]. Due to its overhead, register blocking achieves good speedups only for a large number of executions of SpM×V with the same matrix A.We have investigated the impact of two simple SW transformation techniques (software-pipelining and loop unrolling) on the performance of SpM×V, and have compared it with several implementation modifications aimed at reducing computational and memory complexity and improving the spatial locality. We investigate performance gains of these modifications on four CPU platforms.


2016 ◽  
Vol 2016 ◽  
pp. 1-12 ◽  
Author(s):  
Guixia He ◽  
Jiaquan Gao

Sparse matrix-vector multiplication (SpMV) is an important operation in scientific computations. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMVs on graphic processing units (GPUs), for example, CSR-scalar and CSR-vector, usually have poor performance due to irregular memory access patterns. This motivates us to propose a perfect CSR-based SpMV on the GPU that is called PCSR. PCSR involves two kernels and accesses CSR arrays in a fully coalesced manner by introducing a middle array, which greatly alleviates the deficiencies of CSR-scalar (rare coalescing) and CSR-vector (partial coalescing). Test results on a single C2050 GPU show that PCSR fully outperforms CSR-scalar, CSR-vector, and CSRMV and HYBMV in the vendor-tuned CUSPARSE library and is comparable with a most recently proposed CSR-based algorithm, CSR-Adaptive. Furthermore, we extend PCSR on a single GPU to multiple GPUs. Experimental results on four C2050 GPUs show that no matter whether the communication between GPUs is considered or not PCSR on multiple GPUs achieves good performance and has high parallel efficiency.


2017 ◽  
Vol 43 (4) ◽  
pp. 1-49 ◽  
Author(s):  
Salvatore Filippone ◽  
Valeria Cardellini ◽  
Davide Barbieri ◽  
Alessandro Fanfarillo

Sign in / Sign up

Export Citation Format

Share Document