NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.

Download Full-text

FPGA Based High Performance Double-Precision Matrix Multiplication

2009 22nd International Conference on VLSI Design ◽

10.1109/vlsi.design.2009.13 ◽

2009 ◽

Cited By ~ 20

Author(s):

Vinay B.Y. Kumar ◽

Siddharth Joshi ◽

Sachin B. Patkar ◽

H. Narayanan

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Double Precision ◽

Precision Matrix

Download Full-text

Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework

ACM Transactions on Mathematical Software ◽

10.1145/3402225 ◽

2021 ◽

Vol 47 (2) ◽

pp. 1-26

Author(s):

Field G. Van Zee ◽

Devangi N. Parikh ◽

Robert A. Van De Geijn

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Software Framework ◽

Matrix Product ◽

Double Precision ◽

Precision Matrix ◽

Implementation Approach ◽

Mixed Precision ◽

The Matrix ◽

Performance Results

We approach the problem of implementing mixed-datatype support within the general matrix multiplication ( gemm ) operation of the BLAS-like Library Instantiation Software framework, whereby each matrix operand A , B , and C may be stored as single- or double-precision real or complex values. Another factor of complexity, whereby the matrix product and accumulation are allowed to take place in a precision different from the storage precisions of either A or B , is also discussed. We first break the problem into orthogonal dimensions, considering the mixing of domains separately from mixing precisions. Support for all combinations of matrix operands stored in either the real or complex domain is mapped out by enumerating the cases and describing an implementation approach for each. Supporting all combinations of storage and computation precisions is handled by typecasting the matrices at key stages of the computation—during packing and/or accumulation, as needed. Several optional optimizations are also documented. Performance results gathered on a 56-core Marvell ThunderX2 and a 52-core Intel Xeon Platinum demonstrate that high performance is mostly preserved, with modest slowdowns incurred from unavoidable typecast instructions. The mixed-datatype implementation confirms that combinatorial intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.

Download Full-text

First Steps in Porting the LFRic Weather and Climate Model to the FPGAs of the EuroExa Architecture

Scientific Programming ◽

10.1155/2019/7807860 ◽

2019 ◽

Vol 2019 ◽

pp. 1-18 ◽

Cited By ~ 2

Author(s):

Mike Ashworth ◽

Graham D. Riley ◽

Andrew Attwood ◽

John Mawer

Keyword(s):

Energy Efficiency ◽

High Performance ◽

Climate Model ◽

Forecast Model ◽

Peak Performance ◽

Multiprocessor System ◽

Double Precision ◽

Power Budget ◽

Weather And Climate ◽

Fpga Acceleration

In recent years, there has been renewed interest in the use of field-programmable gate arrays (FPGAs) for high-performance computing (HPC). In this paper, we explore the techniques required by traditional HPC programmers in porting HPC applications to FPGAs, using as an example the LFRic weather and climate model. We report on the first steps in porting LFRic to the FPGAs of the EuroExa architecture. We have used Vivado High-Level Syntheusywwi to implement a matrix-vector kernel from the LFRic code on a Xilinx UltraScale+ development board containing an XCZU9EG multiprocessor system-on-chip. We describe the porting of the code, discuss the optimization decisions, and report performance of 5.34 Gflop/s with double precision and 5.58 Gflop/s with single precision. We discuss sources of inefficiencies, comparisons with peak performance, comparisons with CPU and GPU performance (taking into account power and price), comparisons with published techniques, and comparisons with published performance, and we conclude with some comments on the prospects for future progress with FPGA acceleration of the weather forecast model. The realization of practical exascale-class high-performance computinems requires significant improvements in the energy efficiency of such systems and their components. This has generated interest in computer architectures which utilize accelerators alongside traditional CPUs. FPGAs offer huge potential as an accelerator which can deliver performance for scientific applications at high levels of energy efficiency. The EuroExa project is developing and building a high-performance architecture based upon ARM CPUs with FPGA acceleration targeting exascale-class performance within a realistic power budget.

Download Full-text

CENNA: Cost-Effective Neural Network Accelerator

Electronics ◽

10.3390/electronics9010134 ◽

2020 ◽

Vol 9 (1) ◽

pp. 134

Author(s):

Sang-Soo Park ◽

Ki-Seok Chung

Keyword(s):

Neural Network ◽

High Performance ◽

Data Exchange ◽

Matrix Multiplication ◽

Cost Effective ◽

Classification Performance ◽

Memory Access ◽

Control Logic ◽

Silicon Area ◽

Data Movement

Convolutional neural networks (CNNs) are widely adopted in various applications. State-of-the-art CNN models deliver excellent classification performance, but they require a large amount of computation and data exchange because they typically employ many processing layers. Among these processing layers, convolution layers, which carry out many multiplications and additions, account for a major portion of computation and memory access. Therefore, reducing the amount of computation and memory access is the key for high-performance CNNs. In this study, we propose a cost-effective neural network accelerator, named CENNA, whose hardware cost is reduced by employing a cost-centric matrix multiplication that employs both Strassen’s multiplication and a naïve multiplication. Furthermore, the convolution method using the proposed matrix multiplication can minimize data movement by reusing both the feature map and the convolution kernel without any additional control logic. In terms of throughput, power consumption, and silicon area, the efficiency of CENNA is up to 88 times higher than that of conventional designs for the CNN inference.

Download Full-text

FPGA Based High Performance Double-Precision Matrix Multiplication

International Journal of Parallel Programming ◽

10.1007/s10766-010-0131-8 ◽

2010 ◽

Vol 38 (3-4) ◽

pp. 322-338 ◽

Cited By ~ 18

Author(s):

Vinay B. Y. Kumar ◽

Siddharth Joshi ◽

Sachin B. Patkar ◽

H. Narayanan

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Double Precision ◽

Precision Matrix

Download Full-text

High-Performance and Memory-Saving Sparse General Matrix-Matrix Multiplication for NVIDIA Pascal GPU

2017 46th International Conference on Parallel Processing (ICPP) ◽

10.1109/icpp.2017.19 ◽

2017 ◽

Cited By ~ 13

Author(s):

Yusuke Nagasaka ◽

Akira Nukada ◽

Satoshi Matsuoka

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

General Matrix

Download Full-text

Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs

The International Journal of High Performance Computing Applications ◽

10.1177/1094342020965661 ◽

2020 ◽

Vol 35 (1) ◽

pp. 5-19

Author(s):

Dominik Ernst ◽

Georg Hager ◽

Jonas Thies ◽

Gerhard Wellein

Keyword(s):

Code Generation ◽

Large Range ◽

State Of The Art ◽

Matrix Multiplication ◽

Double Precision ◽

General Matrix ◽

Performance Engineering ◽

Key Characteristics ◽

Roofline Model ◽

Mapping Scheme

General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA’s current CUBLAS implementation delivers only a fraction of the potential performance as indicated by the roofline model in this case. We describe the challenges and key characteristics of an implementation that can achieve close to optimal performance. We further evaluate different strategies of parallelization and thread distribution and devise a flexible, configurable mapping scheme. To ensure flexibility and allow for highly tailored implementations we use code generation combined with autotuning. For a large range of matrix sizes in the domain of interest we achieve at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.

Download Full-text

Statistical and machine learning models for optimizing energy in parallel applications

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019842915 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1079-1097 ◽

Cited By ~ 2

Author(s):

Mark Endrei ◽

Chao Jin ◽

Minh Ngoc Dinh ◽

David Abramson ◽

Heidi Poxon ◽

...

Keyword(s):

Machine Learning ◽

Energy Efficiency ◽

High Performance ◽

Large Scale ◽

Energy Use ◽

Parallel Applications ◽

Learning Models ◽

Trade Off ◽

Time Required ◽

Machine Learning Models

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.

Download Full-text