On the Memory Wall and Performance of Symmetric Sparse Matrix Vector Multiplications In Different Data Structures on Shared Memory Machines

Author(s):  
Tongxiang Gu ◽  
Xingping Liu ◽  
Zeyao Mo ◽  
Xiaowen Xu ◽  
Shengxin Zhu
1993 ◽  
Vol 04 (01) ◽  
pp. 65-83 ◽  
Author(s):  
SERGE PETITON ◽  
YOUCEF SAAD ◽  
KESHENG WU ◽  
WILLIAM FERNG

This paper presents a preliminary experimental study of the performance of basic sparse matrix computations on the CM-5. We concentrate on examining various ways of performing general sparse matrix-vector operations and the basic primitives on which these are based. We compare various data structures for storing sparse matrices and their corresponding matrix — vector operations. Both SPMD and Data parallel modes are examined and a comparison of the two modes is made.


Author(s):  
Hartwig Anzt ◽  
Moritz Kreutzer ◽  
Eduardo Ponce ◽  
Gregory D Peterson ◽  
Gerhard Wellein ◽  
...  

In this paper, we present an optimized GPU implementation for the induced dimension reduction algorithm. We improve data locality, combine it with an efficient sparse matrix vector kernel, and investigate the potential of overlapping computation with communication as well as the possibility of concurrent kernel execution. A comprehensive performance evaluation is conducted using a suitable performance model. The analysis reveals efficiency of up to 90%, which indicates that the implementation achieves performance close to the theoretically attainable bound.


2014 ◽  
Vol 11 (supp01) ◽  
pp. 1344007 ◽  
Author(s):  
ABUL MUKID MOHAMMAD MUKADDES ◽  
MASAO OGINO ◽  
RYUJI SHIOYA

The use of proper data structures with corresponding algorithms is critical to achieve good performance in scientific computing. The need of sparse matrix vector multiplication in each iteration of the iterative domain decomposition method has led to implementation of a variety of sparse matrix storage formats. Many storage formats have been presented to represent sparse matrix and integrated in the method. In this paper, the storage efficiency of those sparse matrix storage formats are evaluated and compared. The performance results of sparse matrix vector multiplication used in the domain decomposition method is considered. Based on our experiments in the FX10 supercomputer system, some useful conclusions that can serve as guidelines for the optimization of domain decomposition method are extracted.


1999 ◽  
Vol 7 (3-4) ◽  
pp. 313-326 ◽  
Author(s):  
Jan F. Prins ◽  
Siddhartha Chatterjee ◽  
Martin Simons

Modern dialects of Fortran enjoy wide use and good support on high‐performance computers as performance‐oriented programming languages. By providing the ability to express nested data parallelism, modern Fortran dialects enable irregular computations to be incorporated into existing applications with minimal rewriting and without sacrificing performance within the regular portions of the application. Since performance of nested data‐parallel computation is unpredictable and often poor using current compilers, we investigatethreadingandflattening, two source‐to‐source transformation techniques that can improve performance and performance stability. For experimental validation of these techniques, we explore nested data‐parallel implementations of the sparse matrix‐vector product and the Barnes–Hut n‐body algorithm by hand‐coding thread‐based (using OpenMP directives) and flattening‐based versions of these algorithms and evaluating their performance on an SGI Origin 2000 and an NEC SX‐4, two shared‐memory machines.


2016 ◽  
Vol 2016 ◽  
pp. 1-14 ◽  
Author(s):  
Jiaquan Gao ◽  
Panpan Qi ◽  
Guixia He

Sparse matrix-vector multiplication (SpMV) is an important operation in computational science and needs be accelerated because it often represents the dominant cost in many widely used iterative methods and eigenvalue problems. We achieve this objective by proposing a novel SpMV algorithm based on the compressed sparse row (CSR) on the GPU. Our method dynamically assigns different numbers of rows to each thread block and executes different optimization implementations on the basis of the number of rows it involves for each block. The process of accesses to the CSR arrays is fully coalesced, and the GPU’s DRAM bandwidth is efficiently utilized by loading data into the shared memory, which alleviates the bottleneck of many existing CSR-based algorithms (i.e., CSR-scalar and CSR-vector). Test results on C2050 and K20c GPUs show that our method outperforms a perfect-CSR algorithm that inspires our work, the vendor tuned CUSPARSE V6.5 and CUSP V0.5.1, and three popular algorithms clSpMV, CSR5, and CSR-Adaptive.


Sign in / Sign up

Export Citation Format

Share Document