CPU versus GPU: which can perform matrix computation faster—performance comparison for basic linear algebra subprograms

Linear algebra is often the kernel of most numerical computations. It deals with vectors and matrices and simple operations like addition and multiplication on these objects. Vectors are one-dimensional arrays of say n real or complex numbers x0, x1, . . . , xn−1. We denote such a vector by x and think of it as a column vector, On a sequential computer, these numbers occupy n consecutive memory locations. This is also true, at least conceptually, on a shared memory multiprocessor computer. On distributed memory multicomputers, the primary issue is how to distribute vectors on the memory of the processors involved in the computation. Matrices are two-dimensional arrays of the form The n · m real (complex) matrix elements aij are stored in n · m (respectively 2 · n ·m if complex datatype is available) consecutive memory locations. This is achieved by either stacking the columns on top of each other or by appending row after row. The former is called column-major, the latter row-major order. The actual procedure depends on the programming language. In Fortran, matrices are stored in column-major order, in C in row-major order. There is no principal difference, but for writing efficient programs one has to respect how matrices are laid out. To be consistent with the libraries that we will use that are mostly written in Fortran, we will explicitly program in column-major order. Thus, the matrix element aij of the m×n matrix A is located i+j · m memory locations after a00. Therefore, in our C codes we will write a[i+j*m]. Notice that there is no such simple procedure for determining the memory location of an element of a sparse matrix. In Section 2.3, we outline data descriptors to handle sparse matrices. In this and later chapters we deal with one of the simplest operations one wants to do with vectors and matrices: the so-called saxpy operation (2.3). In Tables 2.1 and 2.2 are listed some of the acronyms and conventions for the basic linear algebra subprograms discussed in this book.

Download Full-text

Basic Linear Algebra Subprograms (BLAS)

Encyclopedia of Parallel Computing ◽

10.1007/978-0-387-09766-4_2066 ◽

2011 ◽

pp. 120-120

Author(s):

Thilo Kielmann ◽

Sergei Gorlatch ◽

Utpal Banerjee ◽

Rocco De Nicola ◽

Jack Dongarra ◽

...

Keyword(s):

Linear Algebra ◽

Basic Linear Algebra Subprograms

Download Full-text

PB-BLAS: a set of Parallel Block Basic Linear Algebra Subprograms

Proceedings of IEEE Scalable High Performance Computing Conference ◽

10.1109/shpcc.1994.296688 ◽

2002 ◽

Cited By ~ 3

Author(s):

Jaeyoung Choi ◽

J.J. Dongarra ◽

D.W. Walker

Keyword(s):

Linear Algebra ◽

Basic Linear Algebra Subprograms

Download Full-text

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software ◽

10.1145/77626.79170 ◽

1990 ◽

Vol 16 (1) ◽

pp. 1-17 ◽

Cited By ~ 978

Author(s):

J. J. Dongarra ◽

Jeremy Du Croz ◽

Sven Hammarling ◽

I. S. Duff

Keyword(s):

Linear Algebra ◽

Basic Linear Algebra Subprograms ◽

Level 3

Download Full-text

Algorithm 656: an extended set of basic linear algebra subprograms: model implementation and test programs

ACM Transactions on Mathematical Software ◽

10.1145/42288.42292 ◽

1988 ◽

Vol 14 (1) ◽

pp. 18-32 ◽

Cited By ~ 110

Author(s):

Jack J. Dongarra ◽

Jeremy Du Croz ◽

Sven Hammarling ◽

Richard J. Hanson

Keyword(s):

Linear Algebra ◽

Basic Linear Algebra Subprograms

Download Full-text

Preface: Basic Linear Algebra Subprograms Technical (Blast) Forum Standard

The International Journal of High Performance Computing Applications ◽

10.1177/10943420020160010101 ◽

2002 ◽

Vol 16 (1) ◽

pp. 1-1 ◽

Cited By ~ 48

Author(s):

Jack Dongarra

Keyword(s):

Linear Algebra ◽

Basic Linear Algebra Subprograms

Download Full-text

Performance Evaluation of Multi-Core Intel Xeon Processors on Basic Linear Algebra Subprograms

Parallel Processing Letters ◽

10.1142/s0129626409000134 ◽

2009 ◽

Vol 19 (01) ◽

pp. 159-174 ◽

Cited By ~ 3

Author(s):

Mostafa I. Soliman

Keyword(s):

Linear Algebra ◽

Problem Size ◽

Loop Unrolling ◽

Matrix Operations ◽

Core Technology ◽

Basic Linear Algebra Subprograms ◽

Level 1 ◽

Computing Platforms ◽

Level 2 ◽

Intel Xeon

Multi-core technology is a natural next step in delivering the benefits of Moore's law to computing platforms. On multi-core processors, the performance of many applications would be improved by parallel processing threads of codes using multi-threading techniques. This paper evaluates the performance of the multi-core Intel Xeon processors on the widely used basic linear algebra subprograms (BLAS). On two dual-core Intel Xeon processors with Hyper-Threading technology, our results show that a performance of around 20 GFLOPS is achieved on Level-3 (matrix-matrix operations) BLAS using multi-threading, SIMD, matrix blocking, and loop unrolling techniques. However, on a small size of Level-2 (matrix-vector operations) and Level-1 (vector operations) BLAS, the use of multi-threading technique speeds down the execution because of the thread creation overheads. Thus the use of Intel SIMD instruction set is the way to improve the performance of single-threaded Level-2 (6 GFLOPS) and Level-1 BLAS (3 GFLOPS). When the problem size becomes large (cannot fit in L2 cache), the performance of the four Xeon cores is less than 2 and 1 GFLOPS on Level-2 and Level-1 BLAS, respectively, even though eight threads are executed in parallel on eight logical processors.

Download Full-text

USING GPU NVIDIA FOR LINEAR ALGEBRA PROLEMS

Collection of scientific works of the Military Institute of Kyiv National Taras Shevchenko University ◽

10.17721/2519-481x/2019/64-14 ◽

2019 ◽

pp. 144-157

Author(s):

A. Myasishchev ◽

S. Lienkov ◽

V. Dzhulii ◽

I. Muliar

Keyword(s):

Linear Algebra ◽

Linear Equations ◽

Matrix Multiplication ◽

Performance Comparison ◽

Double Precision ◽

Graphics Processors ◽

Performance Study ◽

Software Module ◽

Maximum Acceleration ◽

Computational Procedures

Research goals and objectives: the purpose of the article is to study the feasibility of graphics processors using in solving linear equations systems and calculating matrix multiplication as compared with conventional multi-core processors. The peculiarities of the MAGMA and CUBLAS libraries use for various graphics processors are considered. A performance comparison is made between the Tesla C2075 and GeForce GTX 480 GPUs and a six-core AMD processor. Subject of research: the software is developed basing on the MAGMA and CUBLAS libraries for the purpose of the NVIDIA Tesla C2075 and GeForce GTX 480 GPUs performance study for linear equation systems solving and matrix multiplication calculating. Research methods used: libraries were used to parallelize the linear algebra problems solution. For GPUs, these are MAGMA and CUBLAS, for multi-core processors, the ScaLAPACK and ATLAS libraries. To study the operational speed there are used methods and algorithms of computational procedures parallelization similar to these libraries. A software module has been developed for linear equations systems solving and matrix multiplication calculating by parallel systems. Results of the research: it has been determined that for double-precision numbers the GPU GeForce GTX 480 and the GPU Tesla C2075 performance is approximately 3.5 and 6.3 times higher than that of the AMD CPU. And the GPU GeForce GTX 480 performance is 1.3 times higher than the GPU Tesla C2075 performance for single precision numbers. To achieve maximum performance of the NVIDIA CUDA GPU, you need to use the MAGMA or CUBLAS libraries, which accelerate the calculations by about 6.4 times as compared to the traditional programming method. It has been determined that in equations systems solving on a 6-core CPU, it is possible to achieve a maximum acceleration of 3.24 times as compared to calculations on the 1st core using the ScaLAPACK and ATLAS libraries instead of 6-fold theoretical acceleration. Therefore, it is impossible to efficiently use processors with a large number of cores with considered libraries. It is demonstrated that the advantage of the GPU over the CPU increases with the number of equations.

Download Full-text

Basic linear algebra subprograms for FORTRAN usage. [BLAS, in FORTRAN and assembly language for IBM 360/67, CDC 6600 and 7600, and Univac 1108]

10.2172/5172805 ◽

1977 ◽

Author(s):

C.L. Lawson ◽

R.J. Hanson ◽

D.R. Kincaid ◽

F.T. Krogh

Keyword(s):

Linear Algebra ◽

Assembly Language ◽

Basic Linear Algebra Subprograms

Download Full-text