PB-BLAS: a set of Parallel Block Basic Linear Algebra Subprograms

Linear algebra is often the kernel of most numerical computations. It deals with vectors and matrices and simple operations like addition and multiplication on these objects. Vectors are one-dimensional arrays of say n real or complex numbers x0, x1, . . . , xn−1. We denote such a vector by x and think of it as a column vector, On a sequential computer, these numbers occupy n consecutive memory locations. This is also true, at least conceptually, on a shared memory multiprocessor computer. On distributed memory multicomputers, the primary issue is how to distribute vectors on the memory of the processors involved in the computation. Matrices are two-dimensional arrays of the form The n · m real (complex) matrix elements aij are stored in n · m (respectively 2 · n ·m if complex datatype is available) consecutive memory locations. This is achieved by either stacking the columns on top of each other or by appending row after row. The former is called column-major, the latter row-major order. The actual procedure depends on the programming language. In Fortran, matrices are stored in column-major order, in C in row-major order. There is no principal difference, but for writing efficient programs one has to respect how matrices are laid out. To be consistent with the libraries that we will use that are mostly written in Fortran, we will explicitly program in column-major order. Thus, the matrix element aij of the m×n matrix A is located i+j · m memory locations after a00. Therefore, in our C codes we will write a[i+j*m]. Notice that there is no such simple procedure for determining the memory location of an element of a sparse matrix. In Section 2.3, we outline data descriptors to handle sparse matrices. In this and later chapters we deal with one of the simplest operations one wants to do with vectors and matrices: the so-called saxpy operation (2.3). In Tables 2.1 and 2.2 are listed some of the acronyms and conventions for the basic linear algebra subprograms discussed in this book.

Download Full-text

Basic Linear Algebra Subprograms (BLAS)

Encyclopedia of Parallel Computing ◽

10.1007/978-0-387-09766-4_2066 ◽

2011 ◽

pp. 120-120

Author(s):

Thilo Kielmann ◽

Sergei Gorlatch ◽

Utpal Banerjee ◽

Rocco De Nicola ◽

Jack Dongarra ◽

...

Keyword(s):

Linear Algebra ◽

Basic Linear Algebra Subprograms

Download Full-text

A set of level 3 basic linear algebra subprograms

ACM Transactions on Mathematical Software ◽

10.1145/77626.79170 ◽

1990 ◽

Vol 16 (1) ◽

pp. 1-17 ◽

Cited By ~ 978

Author(s):

J. J. Dongarra ◽

Jeremy Du Croz ◽

Sven Hammarling ◽

I. S. Duff

Keyword(s):

Linear Algebra ◽

Basic Linear Algebra Subprograms ◽

Level 3

Download Full-text

Algorithm 656: an extended set of basic linear algebra subprograms: model implementation and test programs

ACM Transactions on Mathematical Software ◽

10.1145/42288.42292 ◽

1988 ◽

Vol 14 (1) ◽

pp. 18-32 ◽

Cited By ~ 110

Author(s):

Jack J. Dongarra ◽

Jeremy Du Croz ◽

Sven Hammarling ◽

Richard J. Hanson

Keyword(s):

Linear Algebra ◽

Basic Linear Algebra Subprograms

Download Full-text

Preface: Basic Linear Algebra Subprograms Technical (Blast) Forum Standard

The International Journal of High Performance Computing Applications ◽

10.1177/10943420020160010101 ◽

2002 ◽

Vol 16 (1) ◽

pp. 1-1 ◽

Cited By ~ 48

Author(s):

Jack Dongarra

Keyword(s):

Linear Algebra ◽

Basic Linear Algebra Subprograms

Download Full-text

Performance Evaluation of Multi-Core Intel Xeon Processors on Basic Linear Algebra Subprograms

Parallel Processing Letters ◽

10.1142/s0129626409000134 ◽

2009 ◽

Vol 19 (01) ◽

pp. 159-174 ◽

Cited By ~ 3

Author(s):

Mostafa I. Soliman

Keyword(s):

Linear Algebra ◽

Problem Size ◽

Loop Unrolling ◽

Matrix Operations ◽

Core Technology ◽

Basic Linear Algebra Subprograms ◽

Level 1 ◽

Computing Platforms ◽

Level 2 ◽

Intel Xeon

Multi-core technology is a natural next step in delivering the benefits of Moore's law to computing platforms. On multi-core processors, the performance of many applications would be improved by parallel processing threads of codes using multi-threading techniques. This paper evaluates the performance of the multi-core Intel Xeon processors on the widely used basic linear algebra subprograms (BLAS). On two dual-core Intel Xeon processors with Hyper-Threading technology, our results show that a performance of around 20 GFLOPS is achieved on Level-3 (matrix-matrix operations) BLAS using multi-threading, SIMD, matrix blocking, and loop unrolling techniques. However, on a small size of Level-2 (matrix-vector operations) and Level-1 (vector operations) BLAS, the use of multi-threading technique speeds down the execution because of the thread creation overheads. Thus the use of Intel SIMD instruction set is the way to improve the performance of single-threaded Level-2 (6 GFLOPS) and Level-1 BLAS (3 GFLOPS). When the problem size becomes large (cannot fit in L2 cache), the performance of the four Xeon cores is less than 2 and 1 GFLOPS on Level-2 and Level-1 BLAS, respectively, even though eight threads are executed in parallel on eight logical processors.

Download Full-text