Dedicated architecture for double precision matrix multiplication in supercomputing environment

We approach the problem of implementing mixed-datatype support within the general matrix multiplication ( gemm ) operation of the BLAS-like Library Instantiation Software framework, whereby each matrix operand A , B , and C may be stored as single- or double-precision real or complex values. Another factor of complexity, whereby the matrix product and accumulation are allowed to take place in a precision different from the storage precisions of either A or B , is also discussed. We first break the problem into orthogonal dimensions, considering the mixing of domains separately from mixing precisions. Support for all combinations of matrix operands stored in either the real or complex domain is mapped out by enumerating the cases and describing an implementation approach for each. Supporting all combinations of storage and computation precisions is handled by typecasting the matrices at key stages of the computation—during packing and/or accumulation, as needed. Several optional optimizations are also documented. Performance results gathered on a 56-core Marvell ThunderX2 and a 52-core Intel Xeon Platinum demonstrate that high performance is mostly preserved, with modest slowdowns incurred from unavoidable typecast instructions. The mixed-datatype implementation confirms that combinatorial intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.

Download Full-text

FPGA Based High Performance Double-Precision Matrix Multiplication

International Journal of Parallel Programming ◽

10.1007/s10766-010-0131-8 ◽

2010 ◽

Vol 38 (3-4) ◽

pp. 322-338 ◽

Cited By ~ 18

Author(s):

Vinay B. Y. Kumar ◽

Siddharth Joshi ◽

Sachin B. Patkar ◽

H. Narayanan

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Double Precision ◽

Precision Matrix

Download Full-text

Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

Parallel Computing ◽

10.1016/j.parco.2011.08.006 ◽

2012 ◽

Vol 38 (4-5) ◽

pp. 260-276 ◽

Cited By ~ 12

Author(s):

Roman Wyrzykowski ◽

Krzysztof Rojek ◽

Lukasz Szustak

Keyword(s):

Matrix Multiplication ◽

Double Precision ◽

Cell Processor ◽

Processor Architecture ◽

Precision Matrix ◽

Model Driven

Download Full-text

NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

Electronics ◽

10.3390/electronics10161984 ◽

2021 ◽

Vol 10 (16) ◽

pp. 1984

Author(s):

Wei Zhang ◽

Zihao Jiang ◽

Zhiguang Chen ◽

Nong Xiao ◽

Yang Ou

Keyword(s):

Energy Efficiency ◽

High Performance ◽

Multicore Processors ◽

Matrix Multiplication ◽

Memory Access ◽

Double Precision ◽

Competitive Performance ◽

General Matrix ◽

Remarkable Improvement ◽

Task Independence

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.

Download Full-text

Accelerated multiple precision matrix multiplication using Strassen^|^apos;s algorithm and Winograd^|^apos;s variant

JSIAM Letters ◽

10.14495/jsiaml.6.81 ◽

2014 ◽

Vol 6 (0) ◽

pp. 81-84 ◽

Cited By ~ 3

Author(s):

Tomonori Kouya

Keyword(s):

Matrix Multiplication ◽

Precision Matrix ◽

Multiple Precision

Download Full-text

Effects of Data's Bit-Width on Digital Logic Design

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.411-414.1670 ◽

2013 ◽

Vol 411-414 ◽

pp. 1670-1673

Author(s):

Sheng Chang ◽

Heng Cai ◽

Hao Wang ◽

Jin He ◽

Qi Jun Huang

Keyword(s):

Circuit Design ◽

Matrix Multiplication ◽

Optimization Method ◽

Digital Logic ◽

Logic Design ◽

Double Precision ◽

Single Precision ◽

Obvious Effect ◽

Multiplication Operation ◽

Resource Cost

Single precision can only achieve 6-7 decimal places, which does not satisfy accuracy demand in many calculations. Double precision can get 13-14 decimal places, but the resource cost is high. In this paper, effects of data bit-width on digital logic design are studied. The accuracy of different bit-width is determined. Then, addition operation, multiplication operation and matrix multiplication with different bit-width are tested on FPGA platform. The results show bit width and circuit design platform have obvious effect on resource cost and circuit efficiency. Finally, a bit-width based circuit design optimization method is proposed.

Download Full-text

USING GPU NVIDIA FOR LINEAR ALGEBRA PROLEMS

Collection of scientific works of the Military Institute of Kyiv National Taras Shevchenko University ◽

10.17721/2519-481x/2019/64-14 ◽

2019 ◽

pp. 144-157

Author(s):

A. Myasishchev ◽

S. Lienkov ◽

V. Dzhulii ◽

I. Muliar

Keyword(s):

Linear Algebra ◽

Linear Equations ◽

Matrix Multiplication ◽

Performance Comparison ◽

Double Precision ◽

Graphics Processors ◽

Performance Study ◽

Software Module ◽

Maximum Acceleration ◽

Computational Procedures

Research goals and objectives: the purpose of the article is to study the feasibility of graphics processors using in solving linear equations systems and calculating matrix multiplication as compared with conventional multi-core processors. The peculiarities of the MAGMA and CUBLAS libraries use for various graphics processors are considered. A performance comparison is made between the Tesla C2075 and GeForce GTX 480 GPUs and a six-core AMD processor. Subject of research: the software is developed basing on the MAGMA and CUBLAS libraries for the purpose of the NVIDIA Tesla C2075 and GeForce GTX 480 GPUs performance study for linear equation systems solving and matrix multiplication calculating. Research methods used: libraries were used to parallelize the linear algebra problems solution. For GPUs, these are MAGMA and CUBLAS, for multi-core processors, the ScaLAPACK and ATLAS libraries. To study the operational speed there are used methods and algorithms of computational procedures parallelization similar to these libraries. A software module has been developed for linear equations systems solving and matrix multiplication calculating by parallel systems. Results of the research: it has been determined that for double-precision numbers the GPU GeForce GTX 480 and the GPU Tesla C2075 performance is approximately 3.5 and 6.3 times higher than that of the AMD CPU. And the GPU GeForce GTX 480 performance is 1.3 times higher than the GPU Tesla C2075 performance for single precision numbers. To achieve maximum performance of the NVIDIA CUDA GPU, you need to use the MAGMA or CUBLAS libraries, which accelerate the calculations by about 6.4 times as compared to the traditional programming method. It has been determined that in equations systems solving on a 6-core CPU, it is possible to achieve a maximum acceleration of 3.24 times as compared to calculations on the 1st core using the ScaLAPACK and ATLAS libraries instead of 6-fold theoretical acceleration. Therefore, it is impossible to efficiently use processors with a large number of cores with considered libraries. It is demonstrated that the advantage of the GPU over the CPU increases with the number of equations.

Download Full-text

Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

The Journal of Supercomputing ◽

10.1007/s11227-021-03636-4 ◽

2021 ◽

Author(s):

Pablo San Juan ◽

Rafael Rodríguez-Sánchez ◽

Francisco D. Igual ◽

Pedro Alonso-Jordá ◽

Enrique S. Quintana-Ortí

Keyword(s):

Deep Learning ◽

Matrix Multiplication ◽

Precision Matrix

Download Full-text

Dedicated architecture for double precision matrix multiplication in supercomputing environment

Adaptation of Double-Precision Matrix Multiplication to the Cell Broadband Engine Architecture

FPGA Based High Performance Double-Precision Matrix Multiplication

Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework

FPGA Based High Performance Double-Precision Matrix Multiplication

Model-driven adaptation of double-precision matrix multiplication to the Cell processor architecture

NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

Accelerated multiple precision matrix multiplication using Strassen^|^apos;s algorithm and Winograd^|^apos;s variant

Effects of Data's Bit-Width on Digital Logic Design

USING GPU NVIDIA FOR LINEAR ALGEBRA PROLEMS

Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

Export Citation Format