SPARSE COMPUTATION WITH PEI

PEI formalism has been designed to reason and develop parallel programs in the context of data parallelism. In this paper, we focus on the use of PEI to transform a program involving dense matrices into a new program involving sparse matrices, using the example of the matrix-vector product.

Download Full-text

The Impact of Voltage-Frequency Scaling for the Matrix-Vector Product on the IBM POWER8

Euro-Par 2016: Parallel Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-319-43659-3_8 ◽

2016 ◽

pp. 103-116 ◽

Cited By ~ 1

Author(s):

Sandra Catalán ◽

A. Cristiano I. Malossi ◽

Costas Bekas ◽

Enrique S. Quintana-Ortí

Keyword(s):

Vector Product ◽

Frequency Scaling ◽

The Matrix ◽

Voltage Frequency ◽

The Impact ◽

Matrix Vector

Download Full-text

QUANTUM DYNAMICS ON MASSIVELY PARALLEL COMPUTERS: EFFICIENT NUMERICAL IMPLEMENTATION FOR PRECONDITIONED LINEAR SOLVERS AND EIGENSOLVERS

Journal of Theoretical and Computational Chemistry ◽

10.1142/s021963361000602x ◽

2010 ◽

Vol 09 (05) ◽

pp. 825-846 ◽

Cited By ~ 16

Author(s):

WENWU CHEN ◽

BILL POIRIER

Keyword(s):

Cross Sections ◽

Quantum Dynamics ◽

Parallel Implementation ◽

Sparse Matrices ◽

Massively Parallel ◽

Vector Product ◽

Parallel Scalability ◽

Product Operation ◽

Wide Range ◽

Matrix Vector

The eigenvalue/eigenvector and linear solve problems arising in computational quantum dynamics applications (e.g. rovibrational spectroscopy, reaction cross-sections, etc.) often involve large sparse matrices that exhibit a certain block structure. In such cases, specialized iterative methods that employ optimal separable basis (OSB) preconditioners (derived from a block Jacobi diagonalization procedure) have been found to be very efficient, vis-à-vis reducing the required CPU effort on serial computing platforms. Recently,1,2 a parallel implementation was introduced, based on a nonstandard domain decomposition scheme. Near-perfect parallel scalability was observed for the OSB preconditioner construction routines up to hundreds of nodes; however, the fundamental matrix–vector product operation itself was found not to scale well, in general. In addition, the number of nodes was selectively chosen, so as to ensure perfect load balancing. In this paper, two essential improvements are discussed: (1) new algorithm for the matrix–vector product operation with greatly improved parallel scalability and (2) generalization for arbitrary number of nodes and basis sizes. These improvements render the resultant parallel quantum dynamics codes suitable for robust application to a wide range of real molecular problems, running on massively parallel computing architectures.

Download Full-text

Sparse matrix–vector multiplication

Parallel Scientific Computation ◽

10.1093/oso/9780198788348.003.0004 ◽

2020 ◽

pp. 190-290

Author(s):

Rob H. Bisseling

Keyword(s):

Sparse Matrix ◽

Sparse Matrices ◽

Distributed Shared Memory ◽

Sparsity Pattern ◽

Matrix Vector Multiplication ◽

Special Cases ◽

The Matrix ◽

Matrix Vector ◽

Memory Architectures ◽

Shared Memory Architectures

This chapter introduces irregular algorithms and presents the example of parallel sparse matrix-vector multiplication (SpMV), which is the central operation in iterative linear system solvers. The irregular sparsity pattern of the matrix does not change during the multiplication, which may be repeated many times. This justifies putting a lot of effort into finding a good data distribution. The Mondriaan distribution of a sparse matrix is a useful non-Cartesian distribution that can be found by hypergraph-based partitioning. The Mondriaan package implements such a partitioning and also the newer medium-grain partitioning method. The chapter analyses the special cases of random sparse matrices and Laplacian matrices. It uses performance profiles and geometric means to compare different partitioning methods. Furthermore, it presents the hybrid-BSP model and a hybrid-BSP SpMV, which are aimed at hybrid distributed/shared-memory architectures. The parallel SpMV can be incorporated in applications, ranging from PageRank computation to artificial neural networks.

Download Full-text

A PARALLEL FAST MULTIPOLE METHOD FOR THE HELMHOLTZ EQUATION

Parallel Processing Letters ◽

10.1142/s0129626495000242 ◽

1995 ◽

Vol 05 (02) ◽

pp. 263-274 ◽

Cited By ~ 3

Author(s):

MARK A. STALZER

Keyword(s):

Helmholtz Equation ◽

Parallel Algorithm ◽

Fast Multipole Method ◽

Iterative Solvers ◽

Dense Matrix ◽

Vector Product ◽

Fast Multipole ◽

Multipole Method ◽

The Matrix ◽

Matrix Vector

Presented is a parallel algorithm based on the fast multipole method (FMM) for the Helmholtz equation. This variant of the FMM is useful for computing radar cross sections and antenna radiation patterns. The FMM decomposes the impedance matrix into sparse components, reducing the operation count of the matrix-vector multiplication in iterative solvers to O(N3/2) (where N is the number of unknowns). The parallel algorithm divides the problem into groups and assigns the computation involved with each group to a processor node. Careful consideration is given to the communications costs. A time complexity analysis of the algorithm is presented and compared with empirical results from a Paragon XP/S running the lightweight Sandia/University of New Mexico operating system (SUNMOS). For a 90,000 unknown problem running on 60 nodes, the sparse representation fits in memory and the algorithm computes the matrix-vector product in 1.26 seconds. It sustains an aggregate rate of 1.4 Gflop/s. The corresponding dense matrix would occupy over 100 Gbytes and, assuming that I/O is free, would require on the order of 50 seconds to form the matrix-vector product.

Download Full-text

Efficient calculation of a normal matrix–vector product for anisotropic full-matrix least-squares refinement of macromolecular structures

Journal of Applied Crystallography ◽

10.1107/s0021889809040989 ◽

2009 ◽

Vol 42 (6) ◽

pp. 1020-1029 ◽

Cited By ~ 2

Author(s):

Boris V. Strokopytov

Keyword(s):

Protein Structures ◽

Normal Matrix ◽

Explicit Calculation ◽

Normal Equation ◽

Conjugate Directions ◽

Matrix Equations ◽

Vector Product ◽

Efficient Calculation ◽

The Matrix ◽

Matrix Vector

A novel algorithm is described for multiplying a normal equation matrix by an arbitrary real vector using the fast Fourier transform technique during anisotropic crystallographic refinement. The matrix–vector algorithm allows one to solve normal matrix equations using the conjugate-gradients or conjugate-directions technique without explicit calculation of a normal matrix. The anisotropic version of the algorithm has been implemented in a new version of the computer programFMLSQ. The updated program has been tested on several protein structures at high resolution. In addition, rapid methods for preconditioner and normal matrix–vector product calculations are described.

Download Full-text

Convolutional neural nets for estimating the run time and energy consumption of the sparse matrix-vector product

The International Journal of High Performance Computing Applications ◽

10.1177/1094342020953196 ◽

2020 ◽

pp. 109434202095319

Author(s):

Maria Barreda ◽

Manuel F Dolz ◽

M Asunción Castaño

Keyword(s):

Energy Consumption ◽

Computer Architecture ◽

Ad Hoc ◽

Sparse Matrix ◽

Neural Nets ◽

Vector Product ◽

Consumption Ratio ◽

The Matrix ◽

Memory Accesses ◽

Matrix Vector

Modeling the performance and energy consumption of the sparse matrix-vector product (SpMV) is essential to perform off-line analysis and, for example, choose a target computer architecture that delivers the best performance-energy consumption ratio. However, this task is especially complex given the memory-bounded nature and irregular memory accesses of the SpMV, mainly dictated by the input sparse matrix. In this paper, we propose a Machine Learning (ML)-driven approach that leverages Convolutional Neural Networks (CNNs) to provide accurate estimations of the performance and energy consumption of the SpMV kernel. The proposed CNN-based models use a blockwise approach to make the CNN architecture independent of the matrix size. These models are trained to estimate execution time as well as total, package, and DRAM energy consumption at different processor frequencies. The experimental results reveal that the overall relative error ranges between 0.5% and 14%, while at matrix level is not superior to 10%. To demonstrate the applicability and accuracy of the SpMV CNN-based models, this study is complemented with an ad-hoc time-energy model for the PageRank algorithm, a popular algorithm for web information retrieval used by search engines, which internally realizes the SpMV kernel.

Download Full-text

Computing the sparse matrix vector product using block-based kernels without zero padding on processors with AVX-512 instructions

PeerJ Computer Science ◽

10.7717/peerj-cs.151 ◽

2018 ◽

Vol 4 ◽

pp. e151 ◽

Cited By ~ 3

Author(s):

Bérenger Bramas ◽

Pavel Kus

Keyword(s):

Open Source ◽

High Performance ◽

Sparse Matrix ◽

Assembly Language ◽

Memory Storage ◽

Vector Product ◽

Zero Padding ◽

The Matrix ◽

Block Based ◽

Matrix Vector

The sparse matrix-vector product (SpMV) is a fundamental operation in many scientific applications from various fields. The High Performance Computing (HPC) community has therefore continuously invested a lot of effort to provide an efficient SpMV kernel on modern CPU architectures. Although it has been shown that block-based kernels help to achieve high performance, they are difficult to use in practice because of the zero padding they require. In the current paper, we propose new kernels using the AVX-512 instruction set, which makes it possible to use a blocking scheme without any zero padding in the matrix memory storage. We describe mask-based sparse matrix formats and their corresponding SpMV kernels highly optimized in assembly language. Considering that the optimal blocking size depends on the matrix, we also provide a method to predict the best kernel to be used utilizing a simple interpolation of results from previous executions. We compare the performance of our approach to that of the Intel MKL CSR kernel and the CSR5 open-source package on a set of standard benchmark matrices. We show that we can achieve significant improvements in many cases, both for sequential and for parallel executions. Finally, we provide the corresponding code in an open source library, called SPC5.

Download Full-text

Efficient Three-Way Split Formulas for Binary Polynomial Multiplication and Toeplitz Matrix Vector Product

IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences ◽

10.1587/transfun.e101.a.239 ◽

2018 ◽

Vol E101.A (1) ◽

pp. 239-248

Author(s):

Sun-Mi PARK ◽

Ku-Young CHANG ◽

Dowon HONG ◽

Changho SEO

Keyword(s):

Toeplitz Matrix ◽

Vector Product ◽

Polynomial Multiplication ◽

Matrix Vector

Download Full-text

Compression and load balancing for efficient sparse matrix‐vector product on multicore processors and graphics processing units

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.6515 ◽

2021 ◽

Author(s):

José I. Aliaga ◽

Hartwig Anzt ◽

Thomas Grützmacher ◽

Enrique S. Quintana‐Ortí ◽

Andrés E. Tomás

Keyword(s):

Load Balancing ◽

Graphics Processing Units ◽

Sparse Matrix ◽

Multicore Processors ◽

Vector Product ◽

Graphics Processing ◽

Matrix Vector

Download Full-text

Selecting optimal SpMV realizations for GPUs via machine learning

The International Journal of High Performance Computing Applications ◽

10.1177/1094342021990738 ◽

2021 ◽

pp. 109434202199073

Author(s):

Ernesto Dufrechou ◽

Pablo Ezzatti ◽

Enrique S Quintana-Ortí

Keyword(s):

Machine Learning ◽

Sparse Matrix ◽

Machine Learning Techniques ◽

Optimal Method ◽

Learning Techniques ◽

General Rules ◽

Machine Learning Approach ◽

The Matrix ◽

Time And Energy ◽

Matrix Vector

More than 10 years of research related to the development of efficient GPU routines for the sparse matrix-vector product (SpMV) have led to several realizations, each with its own strengths and weaknesses. In this work, we review some of the most relevant efforts on the subject, evaluate a few prominent routines that are publicly available using more than 3000 matrices from different applications, and apply machine learning techniques to anticipate which SpMV realization will perform best for each sparse matrix on a given parallel platform. Our numerical experiments confirm the methods offer such varied behaviors depending on the matrix structure that the identification of general rules to select the optimal method for a given matrix becomes extremely difficult, though some useful strategies (heuristics) can be defined. Using a machine learning approach, we show that it is possible to obtain unexpensive classifiers that predict the best method for a given sparse matrix with over 80% accuracy, demonstrating that this approach can deliver important reductions in both execution time and energy consumption.

Download Full-text