Study on Dense Matrix Multiplication Algorithms and Performance Evaluation of HPCC in 81 Nodes IBM Power 8 Architecture

Optimizing HPC systems based on performance factors and bottlenecks is essential for designing an HPC infrastructure with the best characteristics and at a reasonable cost. Such insight can only be achieved through a detailed analysis of existing HPC systems and the execution of their workloads. The “Quinde I” is the only and most powerful supercomputer in Ecuador and is currently listed third on the South America. It was built with the IBM Power 8 servers. In this work, we measured its performance using different parameters from High-Performance Computing (HPC) to compare it with theoretical values and values obtained from tests on similar models. To measure its performance, we compiled and ran different benchmarks with the specific optimization flags for Power 8 to get the maximum performance with the current configuration in the hardware installed by the vendor. The inputs of the benchmarks were varied to analyze their impact on the system performance. In addition, we compile and compare the performance of two algorithms for dense matrix multiplication SRUMMA and DGEMM.

Download Full-text

SYSTEMC IMPLEMENTATION AND PERFORMANCE EVALUATION OF A DECOUPLED GENERAL-PURPOSE MATRIX PROCESSOR

Parallel Processing Letters ◽

10.1142/s0129626410000090 ◽

2010 ◽

Vol 20 (02) ◽

pp. 103-121 ◽

Cited By ~ 1

Author(s):

MOSTAFA I. SOLIMAN ◽

ABDULMAJID F. Al-JUNAID

Keyword(s):

Performance Evaluation ◽

Matrix Multiplication ◽

General Purpose ◽

System Level ◽

Memory Latency ◽

Single Chip ◽

Wide Range ◽

Matrix Unit ◽

And Performance ◽

Vector Matrix

Technological advances in IC manufacturing provide us with the capability to integrate more and more functionality into a single chip. Today's modern processors have nearly one billion transistors on a single chip. With the increasing complexity of today's system, the designs have to be modeled at a high-level of abstraction before partitioning into hardware and software components for final implementation. This paper explains in detail the implementation and performance evaluation of a matrix processor called Mat-Core with SystemC (system level modeling language). Mat-Core is a research processor aiming at exploiting the increasingly number of transistors per IC to improve the performance of a wide range of applications. It extends a general-purpose scalar processor with a matrix unit. To hide memory latency, the extended matrix unit is decoupled into two components: address generation and data computation, which communicate through data queues. Like vector architectures, the data computation unit is organized in parallel lanes. However, on parallel lanes, Mat-Core can execute matrix-scalar, matrix-vector, and matrix-matrix instructions in addition to vector-scalar and vector-vector instructions. For controlling the execution of vector/matrix instructions on the matrix core, this paper extends the well known scoreboard technique. Furthermore, the performance of Mat-Core is evaluated on vector and matrix kernels. Our results show that the performance of four lanes Mat-Core with matrix registers of size 4 × 4 or 16 elements each, queues size of 10, start up time of 6 clock cycles, and memory latency of 10 clock cycles is about 0.94, 1.3, 2.3, 1.6, 2.3, and 5.5 FLOPs per clock cycle; achieved on scalar-vector multiplication, SAXPY, Givens, rank-1 update, vector-matrix multiplication, and matrix-matrix multiplication, respectively.

Download Full-text