Accelerating the general band matrix multiplication using graphics processors

2019 ◽

pp. 144-157

Author(s):

A. Myasishchev ◽

S. Lienkov ◽

V. Dzhulii ◽

I. Muliar

Keyword(s):

Linear Algebra ◽

Linear Equations ◽

Matrix Multiplication ◽

Performance Comparison ◽

Double Precision ◽

Graphics Processors ◽

Performance Study ◽

Software Module ◽

Maximum Acceleration ◽

Computational Procedures

Research goals and objectives: the purpose of the article is to study the feasibility of graphics processors using in solving linear equations systems and calculating matrix multiplication as compared with conventional multi-core processors. The peculiarities of the MAGMA and CUBLAS libraries use for various graphics processors are considered. A performance comparison is made between the Tesla C2075 and GeForce GTX 480 GPUs and a six-core AMD processor. Subject of research: the software is developed basing on the MAGMA and CUBLAS libraries for the purpose of the NVIDIA Tesla C2075 and GeForce GTX 480 GPUs performance study for linear equation systems solving and matrix multiplication calculating. Research methods used: libraries were used to parallelize the linear algebra problems solution. For GPUs, these are MAGMA and CUBLAS, for multi-core processors, the ScaLAPACK and ATLAS libraries. To study the operational speed there are used methods and algorithms of computational procedures parallelization similar to these libraries. A software module has been developed for linear equations systems solving and matrix multiplication calculating by parallel systems. Results of the research: it has been determined that for double-precision numbers the GPU GeForce GTX 480 and the GPU Tesla C2075 performance is approximately 3.5 and 6.3 times higher than that of the AMD CPU. And the GPU GeForce GTX 480 performance is 1.3 times higher than the GPU Tesla C2075 performance for single precision numbers. To achieve maximum performance of the NVIDIA CUDA GPU, you need to use the MAGMA or CUBLAS libraries, which accelerate the calculations by about 6.4 times as compared to the traditional programming method. It has been determined that in equations systems solving on a 6-core CPU, it is possible to achieve a maximum acceleration of 3.24 times as compared to calculations on the 1st core using the ScaLAPACK and ATLAS libraries instead of 6-fold theoretical acceleration. Therefore, it is impossible to efficiently use processors with a large number of cores with considered libraries. It is demonstrated that the advantage of the GPU over the CPU increases with the number of equations.

Download Full-text

VLSI systems for band matrix multiplication

Parallel Computing ◽

10.1016/0167-8191(87)90024-x ◽

1987 ◽

Vol 4 (3) ◽

pp. 239-258 ◽

Cited By ~ 4

Author(s):

Kam Hoi Cheng ◽

Sartaj Sahni

Keyword(s):

Matrix Multiplication ◽

Band Matrix

Download Full-text

VLSI systolic arrays for band matrix multiplication

Integration ◽

10.1016/s0167-9260(83)80021-2 ◽

1983 ◽

Vol 1 (2-3) ◽

pp. 233-249 ◽

Cited By ~ 3

Author(s):

G. Alia

Keyword(s):

Matrix Multiplication ◽

Systolic Arrays ◽

Band Matrix

Download Full-text

High-Performance Systolic Arrays for Band Matrix Multiplication

2005 IEEE International Symposium on Circuits and Systems ◽

10.1109/iscas.2005.1464792 ◽

2005 ◽

Cited By ~ 4

Author(s):

Yun Yang ◽

Wenqing Zhao ◽

Y. Inoue

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Systolic Arrays ◽

Band Matrix

Download Full-text

ALGORITHMIC OPTIMIZATION OF SOFTWARE IMPLEMENTATION OF ALGORITHMS FOR MULTIPLYING DENSE REAL MATRICES ON GRAPHICS PROCESSORS WITH OPENGL TECHNOLOGY SUPPORT

Proceedings of Southwest State University ◽

10.21869/2223-1560-2017-21-5-06-15 ◽

2017 ◽

Vol 21 (5) ◽

pp. 6-15

Author(s):

Y. A. Zatolokin ◽

E. I. Vatutin ◽

V. S. Titov

Keyword(s):

Matrix Multiplication ◽

Software Implementation ◽

Graphics Processors ◽

Local Memory ◽

Algorithmic Optimization ◽

Real Performance ◽

Software Implementations ◽

On Chip ◽

High Level ◽

Global And Local

In the article was given statement of a problem of matrix multiplication. Is is show that desired problem can be simpl formulated but for its solving may be required both heuristic methods and set of algorithmic modifications relating to algorithmic and high-level software optimization taking into account the particular problem and allow to increase the multiplication performance. These include: a comparative analysis of the performance of the actions performed without GPU-specific optimizations and with optimizations, which showed that computations without optimizing the work with global GPU memory have low processing performance. Optimizing data distribution in global and local memory The GPU allows you to reuse the calculation time and increase real performance. To compare the performance of the developed software implementations for OpenGL and CUDA technologies, identical calculations on identical GPUs were performed, which showed higher real performance when using CUDA cores. Specific values of generation performance measured for multi-threaded software implementation on GPU are given for all of described optimizations. It is shown that the most effective approach is based on the method we can get much more performance by technique of caching sub-blocks of the matrices (tiles) in the GPU's on-chip local memory, that with specialized software implementation is provide the performance of 275,3 GFLOP/s for GPU GeForce GTX 960M.

Download Full-text

Efficient Symmetric Band Matrix-Matrix Multiplication on GPUs

Communications in Computer and Information Science - High Performance Computing ◽

10.1007/978-3-662-45483-1_1 ◽

2014 ◽

pp. 1-12 ◽

Cited By ~ 1

Author(s):

Ernesto Dufrechou ◽

Pablo Ezzatti ◽

Enrique S. Quintana-Ortí ◽

Alfredo Remón

Keyword(s):

Matrix Multiplication ◽

Band Matrix ◽

Symmetric Band

Download Full-text

Parallel Algorithm for Quasi-Band Matrix-Matrix Multiplication

Parallel Processing and Applied Mathematics - Lecture Notes in Computer Science ◽

10.1007/978-3-319-32149-3_11 ◽

2016 ◽

pp. 106-115

Author(s):

Dharma Teja Vooturi ◽

Kishore Kothapalli

Keyword(s):

Parallel Algorithm ◽

Matrix Multiplication ◽

Band Matrix

Download Full-text

Stable and Supported Semantics in Continuous Vector Spaces

Proceedings of the Seventeenth International Conference on Principles of Knowledge Representation and Reasoning ◽

10.24963/kr.2020/7 ◽

2020 ◽

Author(s):

Yaniv Aspis ◽

Krysia Broda ◽

Alessandra Russo ◽

Jorge Lobo

Keyword(s):

Logic Program ◽

Matrix Multiplication ◽

Vector Spaces ◽

Continuous Space ◽

Continuous Vector ◽

Novel Approach ◽

Gradient Based ◽

Normal Logic ◽

Parameter Values ◽

Normal Logic Program

We introduce a novel approach for the computation of stable and supported models of normal logic programs in continuous vector spaces by a gradient-based search method. Specifically, the application of the immediate consequence operator of a program reduct can be computed in a vector space. To do this, Herbrand interpretations of a propositional program are embedded as 0-1 vectors in $\mathbb{R}^N$ and program reducts are represented as matrices in $\mathbb{R}^{N \times N}$. Using these representations we prove that the underlying semantics of a normal logic program is captured through matrix multiplication and a differentiable operation. As supported and stable models of a normal logic program can now be seen as fixed points in a continuous space, non-monotonic deduction can be performed using an optimisation process such as Newton's method. We report the results of several experiments using synthetically generated programs that demonstrate the feasibility of the approach and highlight how different parameter values can affect the behaviour of the system.

Download Full-text