Effectiveness of performance tuning techniques for general matrix multiplication on the PEZY-SC2

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.

Download Full-text

Comparing the Performance of General Matrix Multiplication Routine on Heterogeneous Computing Systems

Journal of Parallel and Distributed Computing ◽

10.1016/j.jpdc.2021.10.002 ◽

2021 ◽

Author(s):

Aleksei Sorokin ◽

Sergey Malkovsky ◽

Georgiy Tsoy

Keyword(s):

Heterogeneous Computing ◽

Matrix Multiplication ◽

Computing Systems ◽

General Matrix ◽

Heterogeneous Computing Systems

Download Full-text

A mechanism for balancing accuracy and scope in cross-machine black-box GPU performance modeling

The International Journal of High Performance Computing Applications ◽

10.1177/1094342020921340 ◽

2020 ◽

Vol 34 (6) ◽

pp. 589-614

Author(s):

James D Stevens ◽

Andreas Klöckner

Keyword(s):

Performance Optimization ◽

Heterogeneous Computing ◽

Performance Modeling ◽

Matrix Multiplication ◽

Black Box ◽

Ease Of Use ◽

Performance Tuning ◽

Parallel Applications ◽

Accuracy Evaluation ◽

Trade Offs

The ability to model, analyze, and predict execution time of computations is an important building block that supports numerous efforts, such as load balancing, benchmarking, job scheduling, developer-guided performance optimization, and the automation of performance tuning for high performance, parallel applications. In today’s increasingly heterogeneous computing environment, this task must be accomplished efficiently across multiple architectures, including massively parallel coprocessors like GPUs, which are increasingly prevalent in the world’s fastest supercomputers. To address this challenge, we present an approach for constructing customizable, cross-machine performance models for GPU kernels, including a mechanism to automatically and symbolically gather performance-relevant kernel operation counts, a tool for formulating mathematical models using these counts, and a customizable parameterized collection of benchmark kernels used to calibrate models to GPUs in a black-box fashion. With this approach, we empower the user to manage trade-offs between model accuracy, evaluation speed, and generalizability. A user can define their own model and customize the calibration process, making it as simple or complex as desired, and as application-targeted or general as desired. As application examples of our approach, we demonstrate both linear and nonlinear models; these examples are designed to predict execution times for multiple variants of a particular computation: two matrix-matrix multiplication variants, four discontinuous Galerkin differentiation operation variants, and two 2D five-point finite difference stencil variants. For each variant, we present accuracy results on GPUs from multiple vendors and hardware generations. We view this highly user-customizable approach as a response to a central question arising in GPU performance modeling: how can we model GPU performance in a cost-explanatory fashion while maintaining accuracy, evaluation speed, portability, and ease of use, an attribute we believe precludes approaches requiring manual collection of kernel or hardware statistics.

Download Full-text

Extending the four Russians' bound to general matrix multiplication

Information Processing Letters ◽

10.1016/s0020-0190(80)90080-0 ◽

1980 ◽

Vol 10 (2) ◽

pp. 87-88 ◽

Cited By ~ 1

Author(s):

Nicola Santoro

Keyword(s):

Matrix Multiplication ◽

General Matrix

Download Full-text

A Flexible-blocking Based Approach for Performance Tuning of Matrix Multiplication Routines for Large Matrices with Edge Cases

2018 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2018.8622013 ◽

2018 ◽

Author(s):

Md Mosharaf Hossain ◽

Thomas M. Hines ◽

Sheikh K. Ghafoor ◽

Sheikh Rabiul Islam ◽

Ramakrishnan Kannan ◽

...

Keyword(s):

Matrix Multiplication ◽

Performance Tuning

Download Full-text

Optimizing Hardware Accelerated General Matrix-Matrix Multiplication for CNNs on FPGAs

IEEE Transactions on Circuits & Systems II Express Briefs ◽

10.1109/tcsii.2020.2965154 ◽

2020 ◽

Vol 67 (11) ◽

pp. 2692-2696

Author(s):

Afzal Ahmad ◽

Muhammad Adeel Pasha

Keyword(s):

Matrix Multiplication ◽

General Matrix

Download Full-text

Register-based implementation of the sparse general matrix-matrix multiplication on GPUs

Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming - PPoPP '18 ◽

10.1145/3178487.3178529 ◽

2018 ◽

Cited By ~ 9

Author(s):

Junhong Liu ◽

Xin He ◽

Weifeng Liu ◽

Guangming Tan

Keyword(s):

Matrix Multiplication ◽

General Matrix

Download Full-text

Parallel Multi Channel convolution using General Matrix Multiplication

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP) ◽

10.1109/asap.2017.7995254 ◽

2017 ◽

Cited By ~ 26

Author(s):

Aravind Vasudevan ◽

Andrew Anderson ◽

David Gregg

Keyword(s):

Matrix Multiplication ◽

General Matrix

Download Full-text

An Approximate GEMM Unit for Energy-Efficient Object Detection

Sensors ◽

10.3390/s21124195 ◽

2021 ◽

Vol 21 (12) ◽

pp. 4195

Author(s):

Ratko Pilipović ◽

Vladimir Risojević ◽

Janko Božič ◽

Patricio Bulić ◽

Uroš Lotrič

Keyword(s):

Neural Network ◽

Artificial Intelligence ◽

Neural Networks ◽

Convolutional Neural Network ◽

Convolutional Neural Networks ◽

Energy Efficient ◽

Graphics Processing Units ◽

Matrix Multiplication ◽

Approximate Computing ◽

General Matrix

Edge computing brings artificial intelligence algorithms and graphics processing units closer to data sources, making autonomy and energy-efficient processing vital for their design. Approximate computing has emerged as a popular strategy for energy-efficient circuit design, where the challenge is to achieve the best tradeoff between design efficiency and accuracy. The essential operation in artificial intelligence algorithms is the general matrix multiplication (GEMM) operation comprised of matrix multiplication and accumulation. This paper presents an approximate general matrix multiplication (AGEMM) unit that employs approximate multipliers to perform matrix–matrix operations on four-by-four matrices given in sixteen-bit signed fixed-point format. The synthesis of the proposed AGEMM unit to the 45 nm Nangate Open Cell Library revealed that it consumed only up to 36% of the area and 25% of the energy required by the exact general matrix multiplication unit. The AGEMM unit is ideally suited to convolutional neural networks, which can adapt to the error induced in the computation. We evaluated the AGEMM units’ usability for honeybee detection with the YOLOv4-tiny convolutional neural network. The results implied that we can deploy the AGEMM units in convolutional neural networks without noticeable performance degradation. Moreover, the AGEMM unit’s employment can lead to more area- and energy-efficient convolutional neural network processing, which in turn could prolong sensors’ and edge nodes’ autonomy.

Download Full-text