scholarly journals PB-BLAS: a set of Parallel Block Basic Linear Algebra Subprograms

Author(s):  
Jaeyoung Choi ◽  
J.J. Dongarra ◽  
D.W. Walker
Author(s):  
Wesley Petersen ◽  
Peter Arbenz

Linear algebra is often the kernel of most numerical computations. It deals with vectors and matrices and simple operations like addition and multiplication on these objects. Vectors are one-dimensional arrays of say n real or complex numbers x0, x1, . . . , xn−1. We denote such a vector by x and think of it as a column vector, On a sequential computer, these numbers occupy n consecutive memory locations. This is also true, at least conceptually, on a shared memory multiprocessor computer. On distributed memory multicomputers, the primary issue is how to distribute vectors on the memory of the processors involved in the computation. Matrices are two-dimensional arrays of the form The n · m real (complex) matrix elements aij are stored in n · m (respectively 2 · n ·m if complex datatype is available) consecutive memory locations. This is achieved by either stacking the columns on top of each other or by appending row after row. The former is called column-major, the latter row-major order. The actual procedure depends on the programming language. In Fortran, matrices are stored in column-major order, in C in row-major order. There is no principal difference, but for writing efficient programs one has to respect how matrices are laid out. To be consistent with the libraries that we will use that are mostly written in Fortran, we will explicitly program in column-major order. Thus, the matrix element aij of the m×n matrix A is located i+j · m memory locations after a00. Therefore, in our C codes we will write a[i+j*m]. Notice that there is no such simple procedure for determining the memory location of an element of a sparse matrix. In Section 2.3, we outline data descriptors to handle sparse matrices. In this and later chapters we deal with one of the simplest operations one wants to do with vectors and matrices: the so-called saxpy operation (2.3). In Tables 2.1 and 2.2 are listed some of the acronyms and conventions for the basic linear algebra subprograms discussed in this book.


Author(s):  
Thilo Kielmann ◽  
Sergei Gorlatch ◽  
Utpal Banerjee ◽  
Rocco De Nicola ◽  
Jack Dongarra ◽  
...  

1990 ◽  
Vol 16 (1) ◽  
pp. 1-17 ◽  
Author(s):  
J. J. Dongarra ◽  
Jeremy Du Croz ◽  
Sven Hammarling ◽  
I. S. Duff

1988 ◽  
Vol 14 (1) ◽  
pp. 18-32 ◽  
Author(s):  
Jack J. Dongarra ◽  
Jeremy Du Croz ◽  
Sven Hammarling ◽  
Richard J. Hanson

2009 ◽  
Vol 19 (01) ◽  
pp. 159-174 ◽  
Author(s):  
Mostafa I. Soliman

Multi-core technology is a natural next step in delivering the benefits of Moore's law to computing platforms. On multi-core processors, the performance of many applications would be improved by parallel processing threads of codes using multi-threading techniques. This paper evaluates the performance of the multi-core Intel Xeon processors on the widely used basic linear algebra subprograms (BLAS). On two dual-core Intel Xeon processors with Hyper-Threading technology, our results show that a performance of around 20 GFLOPS is achieved on Level-3 (matrix-matrix operations) BLAS using multi-threading, SIMD, matrix blocking, and loop unrolling techniques. However, on a small size of Level-2 (matrix-vector operations) and Level-1 (vector operations) BLAS, the use of multi-threading technique speeds down the execution because of the thread creation overheads. Thus the use of Intel SIMD instruction set is the way to improve the performance of single-threaded Level-2 (6 GFLOPS) and Level-1 BLAS (3 GFLOPS). When the problem size becomes large (cannot fit in L2 cache), the performance of the four Xeon cores is less than 2 and 1 GFLOPS on Level-2 and Level-1 BLAS, respectively, even though eight threads are executed in parallel on eight logical processors.


1991 ◽  
Vol 17 (2) ◽  
pp. 253-263 ◽  
Author(s):  
David S. Dodson ◽  
Roger G. Grimes ◽  
John G. Lewis

Sign in / Sign up

Export Citation Format

Share Document