Parallel forming of preconditioners based on the approximation of the Sherman-Morrison inversion formula

Исследуются возможности ускорения предобусловленных методов бисопряженных градиентов (BiCGStab, Bi-Conjugate Gradient Stabilized) с предобусловливателем на основе аппроксимации обращения матрицы по формуле Шермана-Моррисона. Рассмотрена новая форма параллельного алгоритма, использующая матрично-векторные произведения при формирования матриц предобусловливателя. Показана эффективность распараллеливания наиболее ресурсоемких операций этого предобусловливателя на графических процессорах. Acceleration of preconditioned bi-conjugate gradient stabilized (BiCGStab) methods with preconditioners based on the matrix approximation by the Sherman-Morrison inversion formula is studied. A new form of the parallel algorithm using matrix-vector products to generate preconditioning matrices is proposed. A parallelization efficiency of the most resource-intensive operations of such preconditioners on multi-core central and graphics processing units (CPUs and GPUs) is shown.

Download Full-text

SURAA: A Novel Method and Tool for Loadbalanced and Coalesced SpMV Computations on GPUs

Applied Sciences ◽

10.3390/app9050947 ◽

2019 ◽

Vol 9 (5) ◽

pp. 947 ◽

Cited By ~ 9

Author(s):

Thaha Muhammed ◽

Rashid Mehmood ◽

Aiiad Albeshri ◽

Iyad Katib

Keyword(s):

Load Balancing ◽

Graphics Processing Units ◽

Sparse Matrix ◽

Memory Access ◽

Group Matrix ◽

The Matrix ◽

Novel Method ◽

Coalesced Memory ◽

Graphics Processing ◽

Matrix Vector

Sparse matrix-vector (SpMV) multiplication is a vital building block for numerous scientific and engineering applications. This paper proposes SURAA (translates to speed in arabic), a novel method for SpMV computations on graphics processing units (GPUs). The novelty lies in the way we group matrix rows into different segments, and adaptively schedule various segments to different types of kernels. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row ( n p r) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman–Diaconis rule. The segments are assembled into three groups based on the mean n p r of the segments. For each group, we use multiple kernels to execute the group segments on different streams. Hence, the number of threads to execute each segment is adaptively chosen. Dynamic Parallelism available in Nvidia GPUs is utilized to execute the group containing segments with the largest mean n p r, providing improved load balancing and coalesced memory access, and hence more efficient SpMV computations on GPUs. Therefore, SURAA minimizes the adverse effects of the n p r variance by uniformly distributing the load using equal sized segments. We implement the SURAA method as a tool and compare its performance with the de facto best commercial (cuSPARSE) and open source (CUSP, MAGMA) tools using widely used benchmarks comprising 26 high n p r v a r i a n c e matrices from 13 diverse domains. SURAA outperforms the other tools by delivering 13.99x speedup on average. We believe that our approach provides a fundamental shift in addressing SpMV related challenges on GPUs including coalesced memory access, thread divergence, and load balancing, and is set to open new avenues for further improving SpMV performance in the future.

Download Full-text

Compression and load balancing for efficient sparse matrix‐vector product on multicore processors and graphics processing units

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.6515 ◽

2021 ◽

Author(s):

José I. Aliaga ◽

Hartwig Anzt ◽

Thomas Grützmacher ◽

Enrique S. Quintana‐Ortí ◽

Andrés E. Tomás

Keyword(s):

Load Balancing ◽

Graphics Processing Units ◽

Sparse Matrix ◽

Multicore Processors ◽

Vector Product ◽

Graphics Processing ◽

Matrix Vector

Download Full-text

Multiple-precision matrix-vector multiplication on graphics processing units

Program systems theory and applications ◽

10.25209/2079-3316-2020-11-3-33-59 ◽

2020 ◽

Vol 11 (3) ◽

pp. 33-59 ◽

Cited By ~ 1

Author(s):

Константин Сергеевич Исупов ◽

Владимир Сергеевич Князьков

Keyword(s):

Graphics Processing Units ◽

Precision Matrix ◽

Matrix Vector Multiplication ◽

Multiple Precision ◽

Graphics Processing ◽

Matrix Vector

Мы рассматриваем параллельную реализацию матрично/векторного умножения (GEMV, уровень 2 BLAS) для графических процессоров (GPU) с использованием арифметики многократной точности на основе системы остаточных классов. В нашей реализации GEMV покомпонентные операции с многоразрядными векторами и матрицами разбиваются на части, каждая из которых выполняется отдельным CUDA ядром. Это исключает ветвление логики исполнения и позволяет добиться более полного использования ресурсов GPU. Эффективная структура данных для хранения многоразрядных массивов обеспечивает объединение доступов параллельных потоков к глобальной памяти GPU в транзакции. Для предложенной реализации GEMV выполнен анализ ошибок округления и получены оценки точности. Представлены экспериментальные результаты, показывающие высокую эффективность разработанной реализации по сравнению с существующими программными пакетами многократной точности для GPU.

Download Full-text

A PARALLEL FAST MULTIPOLE METHOD FOR THE HELMHOLTZ EQUATION

Parallel Processing Letters ◽

10.1142/s0129626495000242 ◽

1995 ◽

Vol 05 (02) ◽

pp. 263-274 ◽

Cited By ~ 3

Author(s):

MARK A. STALZER

Keyword(s):

Helmholtz Equation ◽

Parallel Algorithm ◽

Fast Multipole Method ◽

Iterative Solvers ◽

Dense Matrix ◽

Vector Product ◽

Fast Multipole ◽

Multipole Method ◽

The Matrix ◽

Matrix Vector

Presented is a parallel algorithm based on the fast multipole method (FMM) for the Helmholtz equation. This variant of the FMM is useful for computing radar cross sections and antenna radiation patterns. The FMM decomposes the impedance matrix into sparse components, reducing the operation count of the matrix-vector multiplication in iterative solvers to O(N3/2) (where N is the number of unknowns). The parallel algorithm divides the problem into groups and assigns the computation involved with each group to a processor node. Careful consideration is given to the communications costs. A time complexity analysis of the algorithm is presented and compared with empirical results from a Paragon XP/S running the lightweight Sandia/University of New Mexico operating system (SUNMOS). For a 90,000 unknown problem running on 60 nodes, the sparse representation fits in memory and the algorithm computes the matrix-vector product in 1.26 seconds. It sustains an aggregate rate of 1.4 Gflop/s. The corresponding dense matrix would occupy over 100 Gbytes and, assuming that I/O is free, would require on the order of 50 seconds to form the matrix-vector product.

Download Full-text

Efficient parallel algorithm for multiple sequence alignments with regular expression constraints on graphics processing units

International Journal of Computational Science and Engineering ◽

10.1504/ijcse.2014.058687 ◽

2014 ◽

Vol 9 (1/2) ◽

pp. 11 ◽

Cited By ~ 7

Author(s):

Chun Yuan Lin ◽

Yu Shiang Lin

Keyword(s):

Parallel Algorithm ◽

Graphics Processing Units ◽

Regular Expression ◽

Sequence Alignments ◽

Multiple Sequence ◽

Multiple Sequence Alignments ◽

Graphics Processing

Download Full-text

A Factored Sparse Approximate Inverse Preconditioned Conjugate Gradient Solver on Graphics Processing Units

SIAM Journal on Scientific Computing ◽

10.1137/15m1027826 ◽

2016 ◽

Vol 38 (1) ◽

pp. C53-C72 ◽

Cited By ~ 8

Author(s):

Massimo Bernaschi ◽

Mauro Bisson ◽

Carlo Fantozzi ◽

Carlo Janna

Keyword(s):

Conjugate Gradient ◽

Graphics Processing Units ◽

Preconditioned Conjugate Gradient ◽

Approximate Inverse ◽

Sparse Approximate Inverse ◽

Graphics Processing ◽

Conjugate Gradient Solver

Download Full-text

Parallel algorithm for solving Kepler’s equation on Graphics Processing Units: Application to analysis of Doppler exoplanet searches

New Astronomy ◽

10.1016/j.newast.2008.12.001 ◽

2009 ◽

Vol 14 (4) ◽

pp. 406-412 ◽

Cited By ~ 31

Author(s):

Eric B. Ford

Keyword(s):

Parallel Algorithm ◽

Graphics Processing Units ◽

Kepler's Equation ◽

Kepler’S Equation ◽

Graphics Processing

Download Full-text

An Effective Approach for Implementing Sparse Matrix-Vector Multiplication on Graphics Processing Units

2012 IEEE 14th International Conference on High Performance Computing and Communication & 2012 IEEE 9th International Conference on Embedded Software and Systems ◽

10.1109/hpcc.2012.68 ◽

2012 ◽

Cited By ~ 7

Author(s):

Walid Abu-Sufah ◽

Asma Abdel Karim

Keyword(s):

Graphics Processing Units ◽

Sparse Matrix ◽

Matrix Vector Multiplication ◽

Graphics Processing ◽

Matrix Vector

Download Full-text

Method of parallelization of loops for grid calculation problems on GPU accelerators

PROBLEMS IN PROGRAMMING ◽

10.15407/pp2017.01.059 ◽

2017 ◽

pp. 059-066

Author(s):

А.Yu. Doroshenko ◽

◽

O.G. Beketov ◽

Keyword(s):

Parallel Algorithm ◽

Graphics Processing Units ◽

Suggested Method ◽

Automated System ◽

Practical Implementation ◽

Heterogeneous Clusters ◽

Cuda Technology ◽

Nested Loops ◽

Simd Architecture ◽

Graphics Processing

The formal parallelizing transformation of a nest of calculation loop for SIMD architecture devices, particularly for graphics processing units applying CUDA technology and heterogeneous clusters is developed. Procedure of transition from sequential to parallel algorithm is described and illustrated. Serialization of data is applied to optimize processing of large volumes of data. The advantage of the suggested method is its applicability for transformation of data which volumes exceed the memory of operating device. The experiment is conducted to demonstrate feasibility of the proposed approach. Technique presented in the provides the basis for further practical implementation of the automated system for parallelizing of nested loops.

Download Full-text

Multiple-precision matrix-vector multiplication on graphics processing units

Program systems theory and applications ◽

10.25209/2079-3316-2020-11-3-61-84 ◽

2020 ◽

Vol 11 (3) ◽

pp. 61-84

Author(s):

Konstantin Isupov ◽

Vladimir Knyazkov

Keyword(s):

Graphics Processing Units ◽

High Efficiency ◽

Parallel Implementation ◽

Number System ◽

Residue Number System ◽

Global Memory ◽

Matrix Vector Multiplication ◽

Multiple Precision ◽

Graphics Processing ◽

Matrix Vector

We are considering a parallel implementation of matrix-vector multiplication (GEMV, Level 2 of the BLAS) for graphics processing units (GPUs) using multiple-precision arithmetic based on the residue number system. In our GEMV implementation, element-wise operations with multiple-precision vectors and matrices consist of several parts, each of which is calculated by a separate CUDA kernel. This feature eliminates branch divergence when performing sequential parts of multiple-precision operations and allows the full utilization of the GPU’s resources. An efficient data structure for storing arrays with multiple-precision entries provides a coalesced access pattern to the GPU global memory. We have performed a rounding error analysis and derived error bounds for the proposed GEMV implementation. Experimental results show the high efficiency of the proposed solution compared to existing high-precision packages deployed on GPU.

Download Full-text