Compression and load balancing for efficient sparse matrix‐vector product on multicore processors and graphics processing units

SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs

Applied Sciences ◽

10.3390/app9050947 ◽

2019 ◽

Vol 9 (5) ◽

pp. 947 ◽

Cited By ~ 9

Author(s):

Thaha Muhammed ◽

Rashid Mehmood ◽

Aiiad Albeshri ◽

Iyad Katib

Keyword(s):

Load Balancing ◽

Graphics Processing Units ◽

Sparse Matrix ◽

Memory Access ◽

Group Matrix ◽

The Matrix ◽

Novel Method ◽

Coalesced Memory ◽

Graphics Processing ◽

Matrix Vector

Sparse matrix-vector (SpMV) multiplication is a vital building block for numerous scientific and engineering applications. This paper proposes SURAA (translates to speed in arabic), a novel method for SpMV computations on graphics processing units (GPUs). The novelty lies in the way we group matrix rows into different segments, and adaptively schedule various segments to different types of kernels. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row ( n p r) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman–Diaconis rule. The segments are assembled into three groups based on the mean n p r of the segments. For each group, we use multiple kernels to execute the group segments on different streams. Hence, the number of threads to execute each segment is adaptively chosen. Dynamic Parallelism available in Nvidia GPUs is utilized to execute the group containing segments with the largest mean n p r, providing improved load balancing and coalesced memory access, and hence more efficient SpMV computations on GPUs. Therefore, SURAA minimizes the adverse effects of the n p r variance by uniformly distributing the load using equal sized segments. We implement the SURAA method as a tool and compare its performance with the de facto best commercial (cuSPARSE) and open source (CUSP, MAGMA) tools using widely used benchmarks comprising 26 high n p r v a r i a n c e matrices from 13 diverse domains. SURAA outperforms the other tools by delivering 13.99x speedup on average. We believe that our approach provides a fundamental shift in addressing SpMV related challenges on GPUs including coalesced memory access, thread divergence, and load balancing, and is set to open new avenues for further improving SpMV performance in the future.

Download Full-text

An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units

2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems ◽

10.1109/hpcc.2012.68 ◽

2012 ◽

Cited By ~ 7

Author(s):

Walid Abu-Sufah ◽

Asma Abdel Karim

Keyword(s):

Graphics Processing Units ◽

Sparse Matrix ◽

Matrix Vector Multiplication ◽

Graphics Processing ◽

Matrix Vector

Download Full-text

Performance Analysis of Sparse Matrix-Vector Multiplication (SpMV) on Graphics Processing Units (GPUs)

Electronics ◽

10.3390/electronics9101675 ◽

2020 ◽

Vol 9 (10) ◽

pp. 1675

Author(s):

Sarah AlAhmadi ◽

Thaha Mohammed ◽

Aiiad Albeshri ◽

Iyad Katib ◽

Rashid Mehmood

Keyword(s):

Performance Analysis ◽

Graphics Processing Units ◽

High Performance ◽

Heterogeneous Computing ◽

Performance Metrics ◽

Sparse Matrix ◽

Extensive Literature ◽

Compressed Sparse Row ◽

Graphics Processing ◽

Matrix Vector

Graphics processing units (GPUs) have delivered a remarkable performance for a variety of high performance computing (HPC) applications through massive parallelism. One such application is sparse matrix-vector (SpMV) computations, which is central to many scientific, engineering, and other applications including machine learning. No single SpMV storage or computation scheme provides consistent and sufficiently high performance for all matrices due to their varying sparsity patterns. An extensive literature review reveals that the performance of SpMV techniques on GPUs has not been studied in sufficient detail. In this paper, we provide a detailed performance analysis of SpMV performance on GPUs using four notable sparse matrix storage schemes (compressed sparse row (CSR), ELLAPCK (ELL), hybrid ELL/COO (HYB), and compressed sparse row 5 (CSR5)), five performance metrics (execution time, giga floating point operations per second (GFLOPS), achieved occupancy, instructions per warp, and warp execution efficiency), five matrix sparsity features (nnz, anpr, nprvariance, maxnpr, and distavg), and 17 sparse matrices from 10 application domains (chemical simulations, computational fluid dynamics (CFD), electromagnetics, linear programming, economics, etc.). Subsequently, based on the deeper insights gained through the detailed performance analysis, we propose a technique called the heterogeneous CPU–GPU Hybrid (HCGHYB) scheme. It utilizes both the CPU and GPU in parallel and provides better performance over the HYB format by an average speedup of 1.7x. Heterogeneous computing is an important direction for SpMV and other application areas. Moreover, to the best of our knowledge, this is the first work where the SpMV performance on GPUs has been discussed in such depth. We believe that this work on SpMV performance analysis and the heterogeneous scheme will open up many new directions and improvements for the SpMV computing field in the future.

Download Full-text

Load-balancing Sparse Matrix Vector Product Kernels on GPUs

ACM Transactions on Parallel Computing ◽

10.1145/3380930 ◽

2020 ◽

Vol 7 (1) ◽

pp. 1-26 ◽

Cited By ~ 4

Author(s):

Hartwig Anzt ◽

Terry Cojean ◽

Chen Yen-Chen ◽

Jack Dongarra ◽

Goran Flegar ◽

...

Keyword(s):

Load Balancing ◽

Sparse Matrix ◽

Vector Product ◽

Matrix Vector

Download Full-text

On sparse matrix-vector product optimization

The 3rd ACS/IEEE International Conference onComputer Systems and Applications, 2005. ◽

10.1109/aiccsa.2005.1387022 ◽

2005 ◽

Cited By ~ 2

Author(s):

N. Emad ◽

O. Hamdi-Larbi ◽

Z. Mahjoub

Keyword(s):

Sparse Matrix ◽

Vector Product ◽

Product Optimization ◽

Matrix Vector

Download Full-text

Balanced CSR Sparse Matrix-Vector Product on Graphics Processors

Lecture Notes in Computer Science - Euro-Par 2017: Parallel Processing ◽

10.1007/978-3-319-64203-1_50 ◽

2017 ◽

pp. 697-709 ◽

Cited By ~ 3

Author(s):

Goran Flegar ◽

Enrique S. Quintana-Ortí

Keyword(s):

Sparse Matrix ◽

Vector Product ◽

Graphics Processors ◽

Matrix Vector

Download Full-text

A distributed computing approach to improve the performance of the Parallel Ocean Program (v2.1)

Geoscientific Model Development ◽

10.5194/gmd-7-267-2014 ◽

2014 ◽

Vol 7 (1) ◽

pp. 267-281 ◽

Cited By ~ 12

Author(s):

B. van Werkhoven ◽

J. Maassen ◽

M. Kliphuis ◽

H. A. Dijkstra ◽

S. E. Brunnabend ◽

...

Keyword(s):

Distributed Computing ◽

Load Balancing ◽

Ocean Circulation ◽

Graphics Processing Units ◽

Block Partitioning ◽

Model Code ◽

Graphics Processing ◽

Computing Approach ◽

Parallel Ocean Program

Abstract. The Parallel Ocean Program (POP) is used in many strongly eddying ocean circulation simulations. Ideally it would be desirable to be able to do thousand-year-long simulations, but the current performance of POP prohibits these types of simulations. In this work, using a new distributed computing approach, two methods to improve the performance of POP are presented. The first is a block-partitioning scheme for the optimization of the load balancing of POP such that it can be run efficiently in a multi-platform setting. The second is the implementation of part of the POP model code on graphics processing units (GPUs). We show that the combination of both innovations also leads to a substantial performance increase when running POP simultaneously over multiple computational platforms.

Download Full-text

FPGA Coprocessor for Simulation of Neural Networks Using Compressed Matrix Storage

System and Circuit Design for Biologically-Inspired Intelligent Learning ◽

10.4018/978-1-60960-018-1.ch011 ◽

2011 ◽

pp. 255-275

Author(s):

Jörg Bornschein

Keyword(s):

Neural Network ◽

Neural Networks ◽

Sparse Matrix ◽

Receptive Fields ◽

The State ◽

Connectivity Matrix ◽

Vector Product ◽

Sparse Connectivity ◽

Matrix Vector ◽

Direct Implementation

An FPGA-based coprocessor has been implemented which simulates the dynamics of a large recurrent neural network composed of binary neurons. The design has been used for unsupervised learning of receptive fields. Since the number of neurons to be simulated (>104) exceeds the available FPGA logic capacity for direct implementation, a set of streaming processors has been designed. Given the state- and activity vectors of the neurons at time t and a sparse connectivity matrix, these streaming processors calculate the state- and activity vectors for time t + 1. The operation implemented by the streaming processors can be understood as a generalized form of a sparse matrix vector product (SpMxV). The largest dataset, the sparse connectivity matrix, is stored and processed in a compressed format to better utilize the available memory bandwidth.

Download Full-text

Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019886628 ◽

2019 ◽

Vol 34 (1) ◽

pp. 66-80 ◽

Cited By ~ 1

Author(s):

Akrem Benatia ◽

Weixing Ji ◽

Yizhuo Wang ◽

Feng Shi

Keyword(s):

Graphics Processing Units ◽

Sparse Matrix ◽

Sparse Matrices ◽

Heterogeneous Systems ◽

Input Matrix ◽

Heterogeneous Platforms ◽

Mapping Algorithm ◽

Matrix Vector Multiplication ◽

Graphics Processing ◽

Matrix Partitioning

Sparse matrix–vector multiplication (SpMV) kernel dominates the computing cost in numerous applications. Most of the existing studies dedicated to improving this kernel have been targeting just one type of processing units, mainly multicore CPUs or graphics processing units (GPUs), and have not explored the potential of the recent, rapidly emerging, CPU-GPU heterogeneous platforms. To take full advantage of these heterogeneous systems, the input sparse matrix has to be partitioned on different available processing units. The partitioning problem is more challenging with the existence of many sparse formats whose performances depend both on the sparsity of the input matrix and the used hardware. Thus, the best performance does not only depend on how to partition the input sparse matrix but also on which sparse format to use for each partition. To address this challenge, we propose in this article a new CPU-GPU heterogeneous method for computing the SpMV kernel that combines between different sparse formats to achieve better performance and better utilization of CPU-GPU heterogeneous platforms. The proposed solution horizontally partitions the input matrix into multiple block-rows and predicts their best sparse formats using machine learning-based performance models. A mapping algorithm is then used to assign the block-rows to the CPU and GPU(s) available in the system. Our experimental results using real-world large unstructured sparse matrices on two different machines show a noticeable performance improvement.

Download Full-text

A MEMORY EFFICIENT AND FAST SPARSE MATRIX VECTOR PRODUCT ON A GPU

Progress In Electromagnetics Research ◽

10.2528/pier11031607 ◽

2011 ◽

Vol 116 ◽

pp. 49-63 ◽

Cited By ~ 40

Author(s):

Adam Dziekonski ◽

Adam Lamecki ◽

Michal Mrozowski

Keyword(s):

Sparse Matrix ◽

Vector Product ◽

Matrix Vector ◽

Memory Efficient

Download Full-text