Sparse matrix partitioning for optimizing SpMV on CPU-GPU heterogeneous platforms

Sparse matrix–vector multiplication (SpMV) kernel dominates the computing cost in numerous applications. Most of the existing studies dedicated to improving this kernel have been targeting just one type of processing units, mainly multicore CPUs or graphics processing units (GPUs), and have not explored the potential of the recent, rapidly emerging, CPU-GPU heterogeneous platforms. To take full advantage of these heterogeneous systems, the input sparse matrix has to be partitioned on different available processing units. The partitioning problem is more challenging with the existence of many sparse formats whose performances depend both on the sparsity of the input matrix and the used hardware. Thus, the best performance does not only depend on how to partition the input sparse matrix but also on which sparse format to use for each partition. To address this challenge, we propose in this article a new CPU-GPU heterogeneous method for computing the SpMV kernel that combines between different sparse formats to achieve better performance and better utilization of CPU-GPU heterogeneous platforms. The proposed solution horizontally partitions the input matrix into multiple block-rows and predicts their best sparse formats using machine learning-based performance models. A mapping algorithm is then used to assign the block-rows to the CPU and GPU(s) available in the system. Our experimental results using real-world large unstructured sparse matrices on two different machines show a noticeable performance improvement.

Download Full-text

An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units

2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems ◽

10.1109/hpcc.2012.68 ◽

2012 ◽

Cited By ~ 7

Author(s):

Walid Abu-Sufah ◽

Asma Abdel Karim

Keyword(s):

Graphics Processing Units ◽

Sparse Matrix ◽

Matrix Vector Multiplication ◽

Graphics Processing ◽

Matrix Vector

Download Full-text

Compression and load balancing for efficient sparse matrix‐vector product on multicore processors and graphics processing units

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.6515 ◽

2021 ◽

Author(s):

José I. Aliaga ◽

Hartwig Anzt ◽

Thomas Grützmacher ◽

Enrique S. Quintana‐Ortí ◽

Andrés E. Tomás

Keyword(s):

Load Balancing ◽

Graphics Processing Units ◽

Sparse Matrix ◽

Multicore Processors ◽

Vector Product ◽

Graphics Processing ◽

Matrix Vector

Download Full-text

Heterogenous Computing on Iris Matching with OpenCL

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.850.129 ◽

2016 ◽

Vol 850 ◽

pp. 129-135

Author(s):

Buğra Şimşek ◽

Nursel Akçam

Keyword(s):

Graphics Processing Units ◽

Iris Recognition ◽

Heterogeneous Computing ◽

Hamming Distance ◽

Heterogeneous Systems ◽

Digital Signal ◽

Mobile Platforms ◽

Central Processing ◽

Field Programmable ◽

Graphics Processing

This study presents parallelization of Hamming Distance algorithm, which is used for iris comparison on iris recognition systems, for heterogeneous systems that can be included Central Processing Units (CPUs), Graphics Processing Units (GPUs), Digital Signal Processing (DSP) boards, Field Programmable Gate Array (FPGA) and some other mobile platforms with OpenCL. OpenCL allows to run same code on CPUs, GPUs, FPGAs and DSP boards. Heterogeneous computing refers to systems include different kind of devices (CPUs, GPUs, FPGAs and other accelerators). Heterogeneous computing gains performance or reduces power for suitable algorithms on these OpenCL supported devices. In this study, Hamming Distance algorithm has been coded with C++ as a sequential code and has been parallelized a designated method by us with OpenCL. Our OpenCL code has been executed on Nvidia GT430 GPU and Intel Xeon 5650 processor. The OpenCL code implementation demonstrates that speed up to 87 times with parallelization. Also our study differs from other studies, which accelerate iris matching, with regard to ensure heterogeneous computing by using OpenCL.

Download Full-text

A Novel CSR-Based Sparse Matrix-Vector Multiplication on GPUs

Mathematical Problems in Engineering ◽

10.1155/2016/8471283 ◽

2016 ◽

Vol 2016 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Guixia He ◽

Jiaquan Gao

Keyword(s):

Sparse Matrix ◽

Sparse Matrices ◽

Poor Performance ◽

Test Results ◽

Graphic Processing Units ◽

Multiple Gpus ◽

Matrix Vector Multiplication ◽

Compressed Sparse Row ◽

Access Patterns ◽

Matrix Vector

Sparse matrix-vector multiplication (SpMV) is an important operation in scientific computations. Compressed sparse row (CSR) is the most frequently used format to store sparse matrices. However, CSR-based SpMVs on graphic processing units (GPUs), for example, CSR-scalar and CSR-vector, usually have poor performance due to irregular memory access patterns. This motivates us to propose a perfect CSR-based SpMV on the GPU that is called PCSR. PCSR involves two kernels and accesses CSR arrays in a fully coalesced manner by introducing a middle array, which greatly alleviates the deficiencies of CSR-scalar (rare coalescing) and CSR-vector (partial coalescing). Test results on a single C2050 GPU show that PCSR fully outperforms CSR-scalar, CSR-vector, and CSRMV and HYBMV in the vendor-tuned CUSPARSE library and is comparable with a most recently proposed CSR-based algorithm, CSR-Adaptive. Furthermore, we extend PCSR on a single GPU to multiple GPUs. Experimental results on four C2050 GPUs show that no matter whether the communication between GPUs is considered or not PCSR on multiple GPUs achieves good performance and has high parallel efficiency.

Download Full-text

Rough assessment of GPU capabilities for parallel PCC-based biclustering method applied to microarray data sets

Bio-Algorithms and Med-Systems ◽

10.1515/bams-2015-0033 ◽

2015 ◽

Vol 11 (4) ◽

Cited By ~ 3

Author(s):

Patryk Orzechowski ◽

Krzysztof Boryczko

Keyword(s):

Microarray Data ◽

Graphics Processing Units ◽

Pearson Correlation ◽

Clustering Algorithms ◽

Computation Time ◽

Data Sets ◽

Heterogeneous Environments ◽

Heterogeneous Architecture ◽

Input Matrix ◽

Graphics Processing

AbstractParallel computing architectures are proven to significantly shorten computation time for different clustering algorithms. Nonetheless, some characteristics of the architecture limit the application of graphics processing units (GPUs) for biclustering task, whose function is to find focal similarities within the data. This might be one of the reasons why there have not been many biclustering algorithms proposed so far. In this article, we verify if there is any potential for application of complex biclustering calculations (CPU+GPU). We introduce minimax with Pearson correlation – a complex biclustering method. The algorithm utilizes Pearson’s correlation to determine similarity between rows of input matrix. We present two implementations of the algorithm, sequential and parallel, which are dedicated for heterogeneous environments. We verify the weak scaling efficiency to assess if a heterogeneous architecture may successfully shorten heavy biclustering computation time.

Download Full-text

GPU Accelerated Reconstruction in Compton Scattering Tomography Using Matrix Compression

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.519-520.102 ◽

2014 ◽

Vol 519-520 ◽

pp. 102-107

Author(s):

Yu Fei Yu ◽

Bin Yan ◽

Biao Wang ◽

Lei Li ◽

Yu Han ◽

...

Keyword(s):

Compton Scattering ◽

Graphics Processing Unit ◽

Sparse Matrix ◽

Reconstruction Algorithm ◽

Processing Unit ◽

Matrix Vector Multiplication ◽

Speedup Ratio ◽

Parallel Features ◽

Graphics Processing ◽

Matrix Vector

An acceleration strategy for TV-ADM reconstruction algorithm in Compton scattering tomography (CST) is proposed. By analyzing the sparse characteristic of CST projection matrixes, firstly, the sparse matrix vector CSR format and ELL format are used to store them, which greatly reduce the memory consumption. Then, a Sparse Matrix Vector multiplication (SpMV) method is utilized to accelerate the projector and back projector process. Finally, based on the parallel features, the TV-ADM is computed with Graphics Processing Unit (GPU). Numerical experiments show that the TV-ADM with the presented acceleration strategy could achieve a 96 times speedup ratio and 224 times memory compression ratio without precision loss.

Download Full-text

Multiple-precision matrix-vector multiplication on graphics processing units

Program systems theory and applications ◽

10.25209/2079-3316-2020-11-3-33-59 ◽

2020 ◽

Vol 11 (3) ◽

pp. 33-59 ◽

Cited By ~ 1

Author(s):

Константин Сергеевич Исупов ◽

Владимир Сергеевич Князьков

Keyword(s):

Graphics Processing Units ◽

Precision Matrix ◽

Matrix Vector Multiplication ◽

Multiple Precision ◽

Graphics Processing ◽

Matrix Vector

Мы рассматриваем параллельную реализацию матрично/векторного умножения (GEMV, уровень 2 BLAS) для графических процессоров (GPU) с использованием арифметики многократной точности на основе системы остаточных классов. В нашей реализации GEMV покомпонентные операции с многоразрядными векторами и матрицами разбиваются на части, каждая из которых выполняется отдельным CUDA ядром. Это исключает ветвление логики исполнения и позволяет добиться более полного использования ресурсов GPU. Эффективная структура данных для хранения многоразрядных массивов обеспечивает объединение доступов параллельных потоков к глобальной памяти GPU в транзакции. Для предложенной реализации GEMV выполнен анализ ошибок округления и получены оценки точности. Представлены экспериментальные результаты, показывающие высокую эффективность разработанной реализации по сравнению с существующими программными пакетами многократной точности для GPU.

Download Full-text

Sparse matrix–vector multiplication

Parallel Scientific Computation ◽

10.1093/oso/9780198788348.003.0004 ◽

2020 ◽

pp. 190-290

Author(s):

Rob H. Bisseling

Keyword(s):

Sparse Matrix ◽

Sparse Matrices ◽

Distributed Shared Memory ◽

Sparsity Pattern ◽

Matrix Vector Multiplication ◽

Special Cases ◽

The Matrix ◽

Matrix Vector ◽

Memory Architectures ◽

Shared Memory Architectures

This chapter introduces irregular algorithms and presents the example of parallel sparse matrix-vector multiplication (SpMV), which is the central operation in iterative linear system solvers. The irregular sparsity pattern of the matrix does not change during the multiplication, which may be repeated many times. This justifies putting a lot of effort into finding a good data distribution. The Mondriaan distribution of a sparse matrix is a useful non-Cartesian distribution that can be found by hypergraph-based partitioning. The Mondriaan package implements such a partitioning and also the newer medium-grain partitioning method. The chapter analyses the special cases of random sparse matrices and Laplacian matrices. It uses performance profiles and geometric means to compare different partitioning methods. Furthermore, it presents the hybrid-BSP model and a hybrid-BSP SpMV, which are aimed at hybrid distributed/shared-memory architectures. The parallel SpMV can be incorporated in applications, ranging from PageRank computation to artificial neural networks.

Download Full-text

Efficient strategy for compressing sparse matrices on Graphics Processing Units

2013 International Conference on Computational Problem-Solving (ICCP) ◽

10.1109/iccps.2013.6893496 ◽

2013 ◽

Cited By ~ 2

Author(s):

Wei-Shu Hsu ◽

Che Lun Hung ◽

Chun-Yuan Lin ◽

Kual-Zheng Lee

Keyword(s):

Graphics Processing Units ◽

Sparse Matrices ◽

Efficient Strategy ◽

Graphics Processing

Download Full-text

A Novel Multi-GPU Parallel Optimization Model for The Sparse Matrix-Vector Multiplication

Parallel Processing Letters ◽

10.1142/s0129626416400016 ◽

2016 ◽

Vol 26 (04) ◽

pp. 1640001

Author(s):

Jiaquan Gao ◽

Yuanshen Zhou ◽

Kesong Wu

Keyword(s):

Optimization Model ◽

Graphics Processing Units ◽

High Efficiency ◽

Sparse Matrix ◽

Performance Model ◽

Parallel Optimization ◽

Multiple Gpus ◽

Matrix Vector Multiplication ◽

Storage Format ◽

Matrix Vector

Accelerating the sparse matrix-vector multiplication (SpMV) on the graphics processing units (GPUs) has attracted considerable attention recently. We observe that on a specific multiple-GPU platform, the SpMV performance can usually be greatly improved when a matrix is partitioned into several blocks according to a predetermined rule and each block is assigned to a GPU with an appropriate storage format. This motivates us to propose a novel multi-GPU parallel SpMV optimization model. Our model involves two stages. In the first stage, a simple rule is defined to divide any given matrix among multiple GPUs, and then a performance model, which is independent of the problems and dependent on the resources of devices, is proposed to accurately predict the execution time of SpMV kernels. Using these models, we construct in the second stage an optimally multi-GPU parallel SpMV algorithm that is automatically and rapidly generated for the platform for any problem. Given that our model for SpMV is general, independent of the problems, and dependent on the resources of devices, this model is constructed only once for each type of GPU. The experiments validate the high efficiency of our proposed model.

Download Full-text