Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication

Closed-Form Solutions for Dense Matrix-Matrix Multiplication on Heterogeneous Platforms Using Divisible Load Analysis

2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP) ◽

10.1109/pdp2018.2018.00067 ◽

2018 ◽

Author(s):

Gerassimos Barlas ◽

Lamees El Hiny

Keyword(s):

Closed Form ◽

Matrix Multiplication ◽

Dense Matrix ◽

Heterogeneous Platforms ◽

Closed Form Solutions ◽

Divisible Load ◽

Load Analysis

Algorithm-oriented design of efficient many-core architectures applied to dense matrix multiplication

Analog Integrated Circuits and Signal Processing ◽

10.1007/s10470-014-0441-7 ◽

2014 ◽

Vol 82 (1) ◽

pp. 147-158

Author(s):

Wilson M. José ◽

Ana Rita Silva ◽

Mário P. Véstias ◽

Horácio C. Neto

Keyword(s):

Matrix Multiplication ◽

Dense Matrix ◽

Many Core

The I/O Complexity of Sparse Matrix Dense Matrix Multiplication

LATIN 2010: Theoretical Informatics - Lecture Notes in Computer Science ◽

10.1007/978-3-642-12200-2_14 ◽

2010 ◽

pp. 143-156 ◽

Cited By ~ 5

Author(s):

Gero Greiner ◽

Riko Jacob

Keyword(s):

Sparse Matrix ◽

Matrix Multiplication ◽

Dense Matrix

Auto-tuning Dense Matrix Multiplication for GPGPU with Cache

2010 IEEE 16th International Conference on Parallel and Distributed Systems ◽

10.1109/icpads.2010.64 ◽

2010 ◽

Cited By ~ 5

Author(s):

Xiang Cui ◽

Yifeng Chen ◽

Changyou Zhang ◽

Hong Mei

Keyword(s):

Matrix Multiplication ◽

Dense Matrix ◽

Auto Tuning

Remote Memory Access: A Case for Portable, Efficient and Library Independent Parallel Programming

Scientific Programming ◽

10.1155/2004/934718 ◽

2004 ◽

Vol 12 (3) ◽

pp. 169-183 ◽

Cited By ~ 6

Author(s):

Alexandros V. Gerbessiotis ◽

Seung-Yeop Lee

Keyword(s):

Message Passing ◽

Matrix Multiplication ◽

Memory Access ◽

Parallel Computer ◽

Remote Memory ◽

Dense Matrix ◽

Radix Sort ◽

Matrix Multiplication Algorithm ◽

Bulk Synchronous Parallel ◽

Remote Memory Access

In this work we make a strong case for remote memory access (RMA) as the effective way to program a parallel computer by proposing a framework that supports RMA in a library independent, simple and intuitive way. If one uses our approach the parallel code one writes will run transparently under MPI-2 enabled libraries but also bulk-synchronous parallel libraries. The advantage of using RMA is code simplicity, reduced programming complexity, and increased efficiency. We support the latter claims by implementing under this framework a collection of benchmark programs consisting of a communication and synchronization performance assessment program, a dense matrix multiplication algorithm, and two variants of a parallel radix-sort algorithm and examine their performance on a LINUX-based PC cluster under three different RMA enabled libraries: LAM MPI, BSPlib, and PUB. We conclude that implementations of such parallel algorithms using RMA communication primitives lead to code that is as efficient as the message-passing equivalent code and in the case of radix-sort substantially more efficient. In addition our work can be used as a comparative study of the relevant capabilities of the three libraries.

Architecture for dense matrix multiplication on a high-performance reconfigurable system

Proceedings of the 22nd Annual Symposium on Integrated Circuits and System Design Chip on the Dunes - SBCCI '09 ◽

10.1145/1601896.1601950 ◽

2009 ◽

Cited By ~ 2

Author(s):

Viviane L. S. de Souza ◽

Victor W. C. de Medeiros ◽

Manoel E. de Lima

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Dense Matrix ◽

Reconfigurable System

Exploiting Online Locality and Reduction Parallelism for Sampled Dense Matrix Multiplication on GPUs

10.1109/iccd53106.2021.00092 ◽

2021 ◽

Author(s):

Zhongming Yu ◽

Guohao Dai ◽

Guyue Huang ◽

Yu Wang ◽

Huazhong Yang

Keyword(s):

Matrix Multiplication ◽

Dense Matrix

Optimized Hybrid Execution of Dense Matrix-Matrix Multiplication on Clusters of Heterogeneous Multicore and Many-Core Platforms

10.1007/978-3-030-86359-3_14 ◽

2021 ◽

pp. 178-195

Author(s):

Gerassimos Barlas

Keyword(s):

Matrix Multiplication ◽

Dense Matrix ◽

Heterogeneous Multicore ◽

Many Core

Study on Dense Matrix Multiplication Algorithms and Performance Evaluation of HPCC in 81 Nodes IBM Power 8 Architecture

10.9734/bpi/ramrcs/v5/14371d ◽

2021 ◽

pp. 105-125

Author(s):

Eduardo Patricio Estévez Ruiz ◽

Giovanny Eduardo Caluña Chicaiza ◽

Fabian Rodolfo Jiménez Patiño ◽

Joaquín Cayetano López Lago ◽

Saravana Prakash Thirumuruganandham

Keyword(s):

Performance Evaluation ◽

Matrix Multiplication ◽

Dense Matrix ◽

And Performance

Achieving Native GPU Performance for Out-of-Card Large Dense Matrix Multiplication

Parallel Processing Letters ◽

10.1142/s0129626416500079 ◽

2016 ◽

Vol 26 (02) ◽

pp. 1650007 ◽

Cited By ~ 3

Author(s):

Jing Wu ◽

Joseph Jaja

Keyword(s):

Matrix Multiplication ◽

Large Data ◽

Dense Matrix ◽

Single Node ◽

Matrix Computations ◽

Heterogeneous Platforms ◽

Heterogeneous Platform ◽

Data Transfers ◽

Developing Strategies

In this paper, we illustrate the possibility of developing strategies to carry out matrix computations on heterogeneous platforms which achieve native GPU performance on very large data sizes up to the capacity of the CPU memory. More specifically, we present a dense matrix multiplication strategy on a heterogeneous platform, specifically tailored for the case when the input is too large to fit on the device memory, which achieves near peak GPU performance. Our strategy involves the development of CUDA stream based software pipelines that effectively overlap PCIe data transfers with kernel executions. As a result, we are able to achieve over 1 and 2 TFLOPS performance on a single node using 1 and 2 GPUs respectively.