Task Parallel Implementation of Matrix Multiplication on Multi-socket Multi-core Architectures

Author(s):  
Yizhuo Wang ◽  
Weixing Ji ◽  
Xu Chen ◽  
Sensen Hu
1996 ◽  
Vol 06 (01) ◽  
pp. 3-12 ◽  
Author(s):  
BRIAN GRAYSON ◽  
ROBERT VAN DE GEIJN

In this paper, we give a practical high performance parallel implementation of Strassen’s algorithm for matrix multiplication. We show how under restricted conditions, this algorithm can be implemented plug compatible with standard parallel matrix multiplication algorithms. Results obtained on a large Intel Paragon system show a 10– 20% reduction in execution time compared to what we believe to be the fastest standard parallel matrix multiplication implementation available at this time.


2020 ◽  
Vol 8 ◽  
Author(s):  
Zhaolong Luo ◽  
Xinming Qin ◽  
Lingyun Wan ◽  
Wei Hu ◽  
Jinlong Yang

Linear-scaling density functional theory (DFT) is an efficient method to describe the electronic structures of molecules, semiconductors, and insulators to avoid the high cubic-scaling cost in conventional DFT calculations. Here, we present a parallel implementation of linear-scaling density matrix trace correcting (TC) purification algorithm to solve the Kohn–Sham (KS) equations with the numerical atomic orbitals in the HONPAS package. Such a linear-scaling density matrix purification algorithm is based on the Kohn's nearsightedness principle, resulting in a sparse Hamiltonian matrix with localized basis sets in the DFT calculations. Therefore, sparse matrix multiplication is the most time-consuming step in the density matrix purification algorithm for linear-scaling DFT calculations. We propose to use the MPI_Allgather function for parallel programming to deal with the sparse matrix multiplication within the compressed sparse row (CSR) format, which can scale up to hundreds of processing cores on modern heterogeneous supercomputers. We demonstrate the computational accuracy and efficiency of this parallel density matrix purification algorithm by performing large-scale DFT calculations on boron nitrogen nanotubes containing tens of thousands of atoms.


2018 ◽  
Vol 2018 ◽  
pp. 1-10 ◽  
Author(s):  
Taehwan Park ◽  
Hwajeong Seo ◽  
Junsub Kim ◽  
Haeryong Park ◽  
Howon Kim

Recently, various types of postquantum cryptography algorithms have been proposed for the National Institute of Standards and Technology’s Postquantum Cryptography Standardization competition. Lattice-based cryptography, which is based on Learning with Errors, is based on matrix multiplication. A large-size matrix multiplication requires a long execution time for key generation, encryption, and decryption. In this paper, we propose an efficient parallel implementation of matrix multiplication and vector addition with matrix transpose using ARM NEON instructions on ARM Cortex-A platforms. The proposed method achieves performance enhancements of 36.93%, 6.95%, 32.92%, and 7.66%. The optimized method is applied to the Lizard. CCA key generation step enhances the performance by 7.04%, 3.66%, 7.57%, and 9.32% over previous state-of-the-art implementations.


Sensors ◽  
2020 ◽  
Vol 20 (19) ◽  
pp. 5558
Author(s):  
Yunping Zhao ◽  
Jianzhuang Lu ◽  
Xiaowen Chen

Due to the high throughput and high computing capability of convolutional neural networks (CNNs), researchers are paying increasing attention to the design of CNNs hardware accelerator architecture. Accordingly, in this paper, we propose a block parallel computing algorithm based on the matrix transformation computing algorithm (MTCA) to realize the convolution expansion and resolve the block problem of the intermediate matrix. It enables high parallel implementation on hardware. Moreover, we also provide a specific calculation method for the optimal partition of matrix multiplication to optimize performance. In our evaluation, our proposed method saves more than 60% of hardware storage space compared with the im2col(image to column) approach. More specifically, in the case of large-scale convolutions, it saves nearly 82% of storage space. Under the accelerator architecture framework designed in this paper, we realize the performance of 26.7GFLOPS-33.4GFLOPS (depending on convolution type) on FPGA(Field Programmable Gate Array) by reducing bandwidth and improving data reusability. It is 1.2×–4.0× faster than memory-efficient convolution (MEC) and im2col, respectively, and represents an effective solution for a large-scale convolution accelerator.


Sign in / Sign up

Export Citation Format

Share Document