Task Parallel Implementation of Matrix Multiplication on Multi-socket Multi-core Architectures

A Task Parallel Implementation of Fast Multipole Methods

2012 SC Companion: High Performance Computing, Networking Storage and Analysis ◽

10.1109/sc.companion.2012.86 ◽

2012 ◽

Cited By ~ 7

Author(s):

Kenjiro Taura ◽

Jun Nakashima ◽

Rio Yokota ◽

Naoya Maruyama

Keyword(s):

Parallel Implementation ◽

Fast Multipole ◽

Fast Multipole Methods ◽

Task Parallel ◽

Multipole Methods

Download Full-text

A HIGH PERFORMANCE PARALLEL STRASSEN IMPLEMENTATION

Parallel Processing Letters ◽

10.1142/s0129626496000029 ◽

1996 ◽

Vol 06 (01) ◽

pp. 3-12 ◽

Cited By ~ 20

Author(s):

BRIAN GRAYSON ◽

ROBERT VAN DE GEIJN

Keyword(s):

Execution Time ◽

High Performance ◽

Parallel Implementation ◽

Matrix Multiplication ◽

Intel Paragon ◽

Strassen’S Algorithm ◽

Strassen's Algorithm

In this paper, we give a practical high performance parallel implementation of Strassen’s algorithm for matrix multiplication. We show how under restricted conditions, this algorithm can be implemented plug compatible with standard parallel matrix multiplication algorithms. Results obtained on a large Intel Paragon system show a 10– 20% reduction in execution time compared to what we believe to be the fastest standard parallel matrix multiplication implementation available at this time.

Download Full-text

Task Parallel Implementation of Belief Propagation in Factor Graphs

2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum ◽

10.1109/ipdpsw.2012.238 ◽

2012 ◽

Author(s):

Nam Ma ◽

Yinglong Xia ◽

Viktor K. Prasanna

Keyword(s):

Belief Propagation ◽

Parallel Implementation ◽

Factor Graphs ◽

Task Parallel

Download Full-text

Parallel Implementation of Large-Scale Linear Scaling Density Functional Theory Calculations With Numerical Atomic Orbitals in HONPAS

Frontiers in Chemistry ◽

10.3389/fchem.2020.589910 ◽

2020 ◽

Vol 8 ◽

Author(s):

Zhaolong Luo ◽

Xinming Qin ◽

Lingyun Wan ◽

Wei Hu ◽

Jinlong Yang

Keyword(s):

Density Matrix ◽

Dft Calculations ◽

Density Functional ◽

Large Scale ◽

Parallel Implementation ◽

Sparse Matrix ◽

Matrix Multiplication ◽

Linear Scaling ◽

Functional Theory ◽

Atomic Orbitals

Linear-scaling density functional theory (DFT) is an efficient method to describe the electronic structures of molecules, semiconductors, and insulators to avoid the high cubic-scaling cost in conventional DFT calculations. Here, we present a parallel implementation of linear-scaling density matrix trace correcting (TC) purification algorithm to solve the Kohn–Sham (KS) equations with the numerical atomic orbitals in the HONPAS package. Such a linear-scaling density matrix purification algorithm is based on the Kohn's nearsightedness principle, resulting in a sparse Hamiltonian matrix with localized basis sets in the DFT calculations. Therefore, sparse matrix multiplication is the most time-consuming step in the density matrix purification algorithm for linear-scaling DFT calculations. We propose to use the MPI_Allgather function for parallel programming to deal with the sparse matrix multiplication within the compressed sparse row (CSR) format, which can scale up to hundreds of processing cores on modern heterogeneous supercomputers. We demonstrate the computational accuracy and efficiency of this parallel density matrix purification algorithm by performing large-scale DFT calculations on boron nitrogen nanotubes containing tens of thousands of atoms.

Download Full-text

Task-parallel implementation of 3D shortest path raytracing for geophysical applications

Computers & Geosciences ◽

10.1016/j.cageo.2012.12.005 ◽

2013 ◽

Vol 54 ◽

pp. 130-141 ◽

Cited By ~ 7

Author(s):

Bernard Giroux ◽

Benoît Larouche

Keyword(s):

Shortest Path ◽

Parallel Implementation ◽

Task Parallel

Download Full-text

Efficient Parallel Implementation of Matrix Multiplication for Lattice-Based Cryptography on Modern ARM Processor

Security and Communication Networks ◽

10.1155/2018/7012056 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10 ◽

Cited By ~ 1

Author(s):

Taehwan Park ◽

Hwajeong Seo ◽

Junsub Kim ◽

Haeryong Park ◽

Howon Kim

Keyword(s):

Parallel Implementation ◽

Matrix Multiplication ◽

Key Generation ◽

Arm Processor ◽

Large Size ◽

Vector Addition ◽

Encryption And Decryption ◽

Previous State ◽

Learning With Errors ◽

Lattice Based Cryptography

Recently, various types of postquantum cryptography algorithms have been proposed for the National Institute of Standards and Technology’s Postquantum Cryptography Standardization competition. Lattice-based cryptography, which is based on Learning with Errors, is based on matrix multiplication. A large-size matrix multiplication requires a long execution time for key generation, encryption, and decryption. In this paper, we propose an efficient parallel implementation of matrix multiplication and vector addition with matrix transpose using ARM NEON instructions on ARM Cortex-A platforms. The proposed method achieves performance enhancements of 36.93%, 6.95%, 32.92%, and 7.66%. The optimized method is applied to the Lizard. CCA key generation step enhances the performance by 7.04%, 3.66%, 7.57%, and 9.32% over previous state-of-the-art implementations.

Download Full-text

An Accelerator Design Using a MTCA Decomposition Algorithm for CNNs

Sensors ◽

10.3390/s20195558 ◽

2020 ◽

Vol 20 (19) ◽

pp. 5558

Author(s):

Yunping Zhao ◽

Jianzhuang Lu ◽

Xiaowen Chen

Keyword(s):

Large Scale ◽

Parallel Implementation ◽

Matrix Multiplication ◽

Storage Space ◽

Specific Calculation ◽

Computing Algorithm ◽

Architecture Framework ◽

The Matrix ◽

Field Programmable ◽

Accelerator Design

Due to the high throughput and high computing capability of convolutional neural networks (CNNs), researchers are paying increasing attention to the design of CNNs hardware accelerator architecture. Accordingly, in this paper, we propose a block parallel computing algorithm based on the matrix transformation computing algorithm (MTCA) to realize the convolution expansion and resolve the block problem of the intermediate matrix. It enables high parallel implementation on hardware. Moreover, we also provide a specific calculation method for the optimal partition of matrix multiplication to optimize performance. In our evaluation, our proposed method saves more than 60% of hardware storage space compared with the im2col(image to column) approach. More specifically, in the case of large-scale convolutions, it saves nearly 82% of storage space. Under the accelerator architecture framework designed in this paper, we realize the performance of 26.7GFLOPS-33.4GFLOPS (depending on convolution type) on FPGA(Field Programmable Gate Array) by reducing bandwidth and improving data reusability. It is 1.2×–4.0× faster than memory-efficient convolution (MEC) and im2col, respectively, and represents an effective solution for a large-scale convolution accelerator.

Download Full-text

A parallel implementation of Strassen’s matrix multiplication algorithm for wormhole-routed all-port 2D torus networks

The Journal of Supercomputing ◽

10.1007/s11227-011-0730-1 ◽

2011 ◽

Vol 62 (1) ◽

pp. 486-509 ◽

Cited By ~ 5

Author(s):

Cesur Baransel ◽

Kayhan M. İmre

Keyword(s):

Parallel Implementation ◽

Matrix Multiplication ◽

Matrix Multiplication Algorithm ◽

Multiplication Algorithm ◽

Torus Networks ◽

Wormhole Routed ◽

2D Torus

Download Full-text

Parallel implementation of Strassen's matrix multiplication algorithm for heterogeneous clusters

18th International Parallel and Distributed Processing Symposium, 2004. Proceedings. ◽

10.1109/ipdps.2004.1303066 ◽

2004 ◽

Cited By ~ 10

Author(s):

Y. Ohtaki ◽

D. Takahashi ◽

T. Boku ◽

M. Sato

Keyword(s):

Parallel Implementation ◽

Matrix Multiplication ◽

Heterogeneous Clusters ◽

Matrix Multiplication Algorithm ◽

Multiplication Algorithm

Download Full-text

OpenMP-based parallel implementation of matrix-matrix multiplication on the intel knights landing

Proceedings of Workshops of HPC Asia on - HPC Asia '18 ◽

10.1145/3176364.3176374 ◽

2018 ◽

Cited By ~ 4

Author(s):

Roktaek Lim ◽

Yeongha Lee ◽

Raehyun Kim ◽

Jaeyoung Choi

Keyword(s):

Parallel Implementation ◽

Matrix Multiplication ◽

Knights Landing

Download Full-text