A high-performance batched matrix multiplication framework for GPUs under unbalanced input distribution

TSM2X: High-performance tall-and-skinny matrix–matrix multiplication on GPUs

Journal of Parallel and Distributed Computing ◽

10.1016/j.jpdc.2021.02.013 ◽

2021 ◽

Vol 151 ◽

pp. 70-85

Author(s):

Cody Rivera ◽

Jieyang Chen ◽

Nan Xiong ◽

Jing Zhang ◽

Shuaiwen Leon Song ◽

...

Keyword(s):

High Performance ◽

Matrix Multiplication

Get full-text (via PubEx)

NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

Electronics ◽

10.3390/electronics10161984 ◽

2021 ◽

Vol 10 (16) ◽

pp. 1984

Author(s):

Wei Zhang ◽

Zihao Jiang ◽

Zhiguang Chen ◽

Nong Xiao ◽

Yang Ou

Keyword(s):

Energy Efficiency ◽

High Performance ◽

Multicore Processors ◽

Matrix Multiplication ◽

Memory Access ◽

Double Precision ◽

Competitive Performance ◽

General Matrix ◽

Remarkable Improvement ◽

Task Independence

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.

Get full-text (via PubEx)

Multi-Softcore Architecture on FPGA

International Journal of Reconfigurable Computing ◽

10.1155/2014/979327 ◽

2014 ◽

Vol 2014 ◽

pp. 1-13 ◽

Cited By ~ 4

Author(s):

Mouna Baklouti ◽

Mohamed Abid

Keyword(s):

High Performance ◽

Design Methodology ◽

Matrix Multiplication ◽

Rapid Prototype ◽

General Purpose ◽

Parallel Applications ◽

Multicore Systems ◽

Processor Core ◽

Nios Ii ◽

Wide Range

To meet the high performance demands of embedded multimedia applications, embedded systems are integrating multiple processing units. However, they are mostly based on custom-logic design methodology. Designing parallel multicore systems using available standards intellectual properties yet maintaining high performance is also a challenging issue. Softcore processors and field programmable gate arrays (FPGAs) are a cheap and fast option to develop and test such systems. This paper describes a FPGA-based design methodology to implement a rapid prototype of parametric multicore systems. A study of the viability of making the SoC using the NIOS II soft-processor core from Altera is also presented. The NIOS II features a general-purpose RISC CPU architecture designed to address a wide range of applications. The performance of the implemented architecture is discussed, and also some parallel applications are used for testing speedup and efficiency of the system. Experimental results demonstrate the performance of the proposed multicore system, which achieves better speedup than the GPU (29.5% faster for the FIR filter and 23.6% faster for the matrix-matrix multiplication).

Get full-text (via PubEx)

High performance and memory efficient implementation of matrix multiplication on FPGAs

2010 International Conference on Field-Programmable Technology ◽

10.1109/fpt.2010.5681769 ◽

2010 ◽

Cited By ~ 5

Author(s):

Guiming Wu ◽

Yong Dou ◽

Miao Wang

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Efficient Implementation ◽

Memory Efficient

Get full-text (via PubEx)

Implementing High-performance Complex Matrix Multiplication via the 3m and 4m Methods

ACM Transactions on Mathematical Software ◽

10.1145/3086466 ◽

2017 ◽

Vol 44 (1) ◽

pp. 1-36 ◽

Cited By ~ 4

Author(s):

Field G. Van Zee ◽

Tyler M. Smith

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Complex Matrix

Get full-text (via PubEx)

HIERARCHICAL MAPPING FOR HPC APPLICATIONS

Parallel Processing Letters ◽

10.1142/s0129626411000229 ◽

2011 ◽

Vol 21 (03) ◽

pp. 279-299 ◽

Cited By ~ 1

Author(s):

I-HSIN CHUNG ◽

CHE-RUNG LEE ◽

JIAZHENG ZHOU ◽

YEH-CHING CHUNG

Keyword(s):

High Performance ◽

Large Scale ◽

Scale Up ◽

Matrix Multiplication ◽

Spectral Graph Theory ◽

Communication Patterns ◽

Fine Tuning ◽

Mapping Algorithm ◽

Communication Time ◽

Run Time

As the high performance computing systems scale up, mapping the tasks of a parallel application onto physical processors to allow efficient communication becomes one of the critical performance issues. Existing algorithms were usually designed to map applications with regular communication patterns. Their mapping criterion usually overlooks the size of communicated messages, which is the primary factor of communication time. In addition, most of their time complexities are too high to process large scale problems. In this paper, we present a hierarchical mapping algorithm (HMA), which is capable of mapping applications with irregular communication patterns. It first partitions tasks according to their run-time communication information. The tasks that communicate with each other more frequently are regarded as strongly connected. Based on their connectivity strength, the tasks are partitioned into supernodes based on the algorithms in spectral graph theory. The hierarchical partitioning reduces the mapping algorithm complexity to achieve scalability. Finally, the run-time communication information will be used again in fine tuning to explore better mappings. With the experiments, we show how the mapping algorithm helps to reduce the point-to-point communication time for the PDGEMM, a ScaLAPACK matrix multiplication computation kernel, up to 20% and the AMG2006, a tier 1 application of the Sequoia benchmark, up to 7%.

Get full-text (via PubEx)

Anatomy of high-performance matrix multiplication

ACM Transactions on Mathematical Software ◽

10.1145/1356052.1356053 ◽

2008 ◽

Vol 34 (3) ◽

pp. 1-25 ◽

Cited By ~ 296

Author(s):

Kazushige Goto ◽

Robert A. van de Geijn

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Performance Matrix

Get full-text (via PubEx)

A framework for high-performance matrix multiplication based on hierarchical abstractions, algorithms and optimized low-level kernels

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.630 ◽

2002 ◽

Vol 14 (10) ◽

pp. 805-839 ◽

Cited By ~ 21

Author(s):

Vinod Valsalam ◽

Anthony Skjellum

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Low Level ◽

Performance Matrix

Get full-text (via PubEx)

A high-performance matrix–matrix multiplication methodology for CPU and GPU architectures

The Journal of Supercomputing ◽

10.1007/s11227-015-1613-7 ◽

2016 ◽

Vol 72 (3) ◽

pp. 804-844 ◽

Cited By ~ 3

Author(s):

Vasilios Kelefouras ◽

A. Kritikakou ◽

Iosif Mporas ◽

Vasilios Kolonias

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Performance Matrix ◽

Gpu Architectures

Get full-text (via PubEx)

FPGA Based High Performance Double-Precision Matrix Multiplication

2009 22nd International Conference on VLSI Design ◽

10.1109/vlsi.design.2009.13 ◽

2009 ◽

Cited By ~ 20

Author(s):

Vinay B.Y. Kumar ◽

Siddharth Joshi ◽

Sachin B. Patkar ◽

H. Narayanan

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Double Precision ◽

Precision Matrix

Get full-text (via PubEx)