scholarly journals NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

Electronics ◽  
2021 ◽  
Vol 10 (16) ◽  
pp. 1984
Author(s):  
Wei Zhang ◽  
Zihao Jiang ◽  
Zhiguang Chen ◽  
Nong Xiao ◽  
Yang Ou

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.

2021 ◽  
Vol 47 (2) ◽  
pp. 1-26
Author(s):  
Field G. Van Zee ◽  
Devangi N. Parikh ◽  
Robert A. Van De Geijn

We approach the problem of implementing mixed-datatype support within the general matrix multiplication ( gemm ) operation of the BLAS-like Library Instantiation Software framework, whereby each matrix operand A , B , and C may be stored as single- or double-precision real or complex values. Another factor of complexity, whereby the matrix product and accumulation are allowed to take place in a precision different from the storage precisions of either A or B , is also discussed. We first break the problem into orthogonal dimensions, considering the mixing of domains separately from mixing precisions. Support for all combinations of matrix operands stored in either the real or complex domain is mapped out by enumerating the cases and describing an implementation approach for each. Supporting all combinations of storage and computation precisions is handled by typecasting the matrices at key stages of the computation—during packing and/or accumulation, as needed. Several optional optimizations are also documented. Performance results gathered on a 56-core Marvell ThunderX2 and a 52-core Intel Xeon Platinum demonstrate that high performance is mostly preserved, with modest slowdowns incurred from unavoidable typecast instructions. The mixed-datatype implementation confirms that combinatorial intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.


2019 ◽  
Vol 2019 ◽  
pp. 1-18 ◽  
Author(s):  
Mike Ashworth ◽  
Graham D. Riley ◽  
Andrew Attwood ◽  
John Mawer

In recent years, there has been renewed interest in the use of field-programmable gate arrays (FPGAs) for high-performance computing (HPC). In this paper, we explore the techniques required by traditional HPC programmers in porting HPC applications to FPGAs, using as an example the LFRic weather and climate model. We report on the first steps in porting LFRic to the FPGAs of the EuroExa architecture. We have used Vivado High-Level Syntheusywwi to implement a matrix-vector kernel from the LFRic code on a Xilinx UltraScale+ development board containing an XCZU9EG multiprocessor system-on-chip. We describe the porting of the code, discuss the optimization decisions, and report performance of 5.34 Gflop/s with double precision and 5.58 Gflop/s with single precision. We discuss sources of inefficiencies, comparisons with peak performance, comparisons with CPU and GPU performance (taking into account power and price), comparisons with published techniques, and comparisons with published performance, and we conclude with some comments on the prospects for future progress with FPGA acceleration of the weather forecast model. The realization of practical exascale-class high-performance computinems requires significant improvements in the energy efficiency of such systems and their components. This has generated interest in computer architectures which utilize accelerators alongside traditional CPUs. FPGAs offer huge potential as an accelerator which can deliver performance for scientific applications at high levels of energy efficiency. The EuroExa project is developing and building a high-performance architecture based upon ARM CPUs with FPGA acceleration targeting exascale-class performance within a realistic power budget.


Electronics ◽  
2020 ◽  
Vol 9 (1) ◽  
pp. 134
Author(s):  
Sang-Soo Park ◽  
Ki-Seok Chung

Convolutional neural networks (CNNs) are widely adopted in various applications. State-of-the-art CNN models deliver excellent classification performance, but they require a large amount of computation and data exchange because they typically employ many processing layers. Among these processing layers, convolution layers, which carry out many multiplications and additions, account for a major portion of computation and memory access. Therefore, reducing the amount of computation and memory access is the key for high-performance CNNs. In this study, we propose a cost-effective neural network accelerator, named CENNA, whose hardware cost is reduced by employing a cost-centric matrix multiplication that employs both Strassen’s multiplication and a naïve multiplication. Furthermore, the convolution method using the proposed matrix multiplication can minimize data movement by reusing both the feature map and the convolution kernel without any additional control logic. In terms of throughput, power consumption, and silicon area, the efficiency of CENNA is up to 88 times higher than that of conventional designs for the CNN inference.


2010 ◽  
Vol 38 (3-4) ◽  
pp. 322-338 ◽  
Author(s):  
Vinay B. Y. Kumar ◽  
Siddharth Joshi ◽  
Sachin B. Patkar ◽  
H. Narayanan

Author(s):  
Dominik Ernst ◽  
Georg Hager ◽  
Jonas Thies ◽  
Gerhard Wellein

General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA’s current CUBLAS implementation delivers only a fraction of the potential performance as indicated by the roofline model in this case. We describe the challenges and key characteristics of an implementation that can achieve close to optimal performance. We further evaluate different strategies of parallelization and thread distribution and devise a flexible, configurable mapping scheme. To ensure flexibility and allow for highly tailored implementations we use code generation combined with autotuning. For a large range of matrix sizes in the domain of interest we achieve at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.


Author(s):  
Mark Endrei ◽  
Chao Jin ◽  
Minh Ngoc Dinh ◽  
David Abramson ◽  
Heidi Poxon ◽  
...  

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.


2018 ◽  
Vol 53 (1) ◽  
pp. 407-408
Author(s):  
Junhong Liu ◽  
Xin He ◽  
Weifeng Liu ◽  
Guangming Tan

Sign in / Sign up

Export Citation Format

Share Document