Dedicated architecture for double precision matrix multiplication in supercomputing environment

Author(s):  
P. Russek ◽  
K. Wiatr
2021 ◽  
Vol 47 (2) ◽  
pp. 1-26
Author(s):  
Field G. Van Zee ◽  
Devangi N. Parikh ◽  
Robert A. Van De Geijn

We approach the problem of implementing mixed-datatype support within the general matrix multiplication ( gemm ) operation of the BLAS-like Library Instantiation Software framework, whereby each matrix operand A , B , and C may be stored as single- or double-precision real or complex values. Another factor of complexity, whereby the matrix product and accumulation are allowed to take place in a precision different from the storage precisions of either A or B , is also discussed. We first break the problem into orthogonal dimensions, considering the mixing of domains separately from mixing precisions. Support for all combinations of matrix operands stored in either the real or complex domain is mapped out by enumerating the cases and describing an implementation approach for each. Supporting all combinations of storage and computation precisions is handled by typecasting the matrices at key stages of the computation—during packing and/or accumulation, as needed. Several optional optimizations are also documented. Performance results gathered on a 56-core Marvell ThunderX2 and a 52-core Intel Xeon Platinum demonstrate that high performance is mostly preserved, with modest slowdowns incurred from unavoidable typecast instructions. The mixed-datatype implementation confirms that combinatorial intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.


2010 ◽  
Vol 38 (3-4) ◽  
pp. 322-338 ◽  
Author(s):  
Vinay B. Y. Kumar ◽  
Siddharth Joshi ◽  
Sachin B. Patkar ◽  
H. Narayanan

Electronics ◽  
2021 ◽  
Vol 10 (16) ◽  
pp. 1984
Author(s):  
Wei Zhang ◽  
Zihao Jiang ◽  
Zhiguang Chen ◽  
Nong Xiao ◽  
Yang Ou

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.


2013 ◽  
Vol 411-414 ◽  
pp. 1670-1673
Author(s):  
Sheng Chang ◽  
Heng Cai ◽  
Hao Wang ◽  
Jin He ◽  
Qi Jun Huang

Single precision can only achieve 6-7 decimal places, which does not satisfy accuracy demand in many calculations. Double precision can get 13-14 decimal places, but the resource cost is high. In this paper, effects of data bit-width on digital logic design are studied. The accuracy of different bit-width is determined. Then, addition operation, multiplication operation and matrix multiplication with different bit-width are tested on FPGA platform. The results show bit width and circuit design platform have obvious effect on resource cost and circuit efficiency. Finally, a bit-width based circuit design optimization method is proposed.


Author(s):  
A. Myasishchev ◽  
S. Lienkov ◽  
V. Dzhulii ◽  
I. Muliar

Research goals and objectives: the purpose of the article is to study the feasibility of graphics processors using in solving linear equations systems and calculating matrix multiplication as compared with conventional multi-core processors. The peculiarities of the MAGMA and CUBLAS libraries use for various graphics processors are considered. A performance comparison is made between the Tesla C2075 and GeForce GTX 480 GPUs and a six-core AMD processor. Subject of research: the software is developed basing on the MAGMA and CUBLAS libraries for the purpose of the NVIDIA Tesla C2075 and GeForce GTX 480 GPUs performance study for linear equation systems solving and matrix multiplication calculating. Research methods used: libraries were used to parallelize the linear algebra problems solution. For GPUs, these are MAGMA and CUBLAS, for multi-core processors, the ScaLAPACK and ATLAS libraries. To study the operational speed there are used methods and algorithms of computational procedures parallelization similar to these libraries. A software module has been developed for linear equations systems solving and matrix multiplication calculating by parallel systems. Results of the research: it has been determined that for double-precision numbers the GPU GeForce GTX 480 and the GPU Tesla C2075 performance is approximately 3.5 and 6.3 times higher than that of the AMD CPU. And the GPU GeForce GTX 480 performance is 1.3 times higher than the GPU Tesla C2075 performance for single precision numbers. To achieve maximum performance of the NVIDIA CUDA GPU, you need to use the MAGMA or CUBLAS libraries, which accelerate the calculations by about 6.4 times as compared to the traditional programming method. It has been determined that in equations systems solving on a 6-core CPU, it is possible to achieve a maximum acceleration of 3.24 times as compared to calculations on the 1st core using the ScaLAPACK and ATLAS libraries instead of 6-fold theoretical acceleration. Therefore, it is impossible to efficiently use processors with a large number of cores with considered libraries. It is demonstrated that the advantage of the GPU over the CPU increases with the number of equations.


Author(s):  
Pablo San Juan ◽  
Rafael Rodríguez-Sánchez ◽  
Francisco D. Igual ◽  
Pedro Alonso-Jordá ◽  
Enrique S. Quintana-Ortí

Sign in / Sign up

Export Citation Format

Share Document