scholarly journals Communication-Avoiding Parallel Sparse-Dense Matrix-Matrix Multiplication

Author(s):  
Penporn Koanantakool ◽  
Ariful Azad ◽  
Aydin Buluc ◽  
Dmitriy Morozov ◽  
Sang-Yun Oh ◽  
...  

2014 ◽  
Vol 82 (1) ◽  
pp. 147-158
Author(s):  
Wilson M. José ◽  
Ana Rita Silva ◽  
Mário P. Véstias ◽  
Horácio C. Neto




2004 ◽  
Vol 12 (3) ◽  
pp. 169-183 ◽  
Author(s):  
Alexandros V. Gerbessiotis ◽  
Seung-Yeop Lee

In this work we make a strong case for remote memory access (RMA) as the effective way to program a parallel computer by proposing a framework that supports RMA in a library independent, simple and intuitive way. If one uses our approach the parallel code one writes will run transparently under MPI-2 enabled libraries but also bulk-synchronous parallel libraries. The advantage of using RMA is code simplicity, reduced programming complexity, and increased efficiency. We support the latter claims by implementing under this framework a collection of benchmark programs consisting of a communication and synchronization performance assessment program, a dense matrix multiplication algorithm, and two variants of a parallel radix-sort algorithm and examine their performance on a LINUX-based PC cluster under three different RMA enabled libraries: LAM MPI, BSPlib, and PUB. We conclude that implementations of such parallel algorithms using RMA communication primitives lead to code that is as efficient as the message-passing equivalent code and in the case of radix-sort substantially more efficient. In addition our work can be used as a comparative study of the relevant capabilities of the three libraries.



2021 ◽  
Author(s):  
Zhongming Yu ◽  
Guohao Dai ◽  
Guyue Huang ◽  
Yu Wang ◽  
Huazhong Yang


2021 ◽  
pp. 105-125
Author(s):  
Eduardo Patricio Estévez Ruiz ◽  
Giovanny Eduardo Caluña Chicaiza ◽  
Fabian Rodolfo Jiménez Patiño ◽  
Joaquín Cayetano López Lago ◽  
Saravana Prakash Thirumuruganandham


2016 ◽  
Vol 26 (02) ◽  
pp. 1650007 ◽  
Author(s):  
Jing Wu ◽  
Joseph Jaja

In this paper, we illustrate the possibility of developing strategies to carry out matrix computations on heterogeneous platforms which achieve native GPU performance on very large data sizes up to the capacity of the CPU memory. More specifically, we present a dense matrix multiplication strategy on a heterogeneous platform, specifically tailored for the case when the input is too large to fit on the device memory, which achieves near peak GPU performance. Our strategy involves the development of CUDA stream based software pipelines that effectively overlap PCIe data transfers with kernel executions. As a result, we are able to achieve over 1 and 2 TFLOPS performance on a single node using 1 and 2 GPUs respectively.



Sign in / Sign up

Export Citation Format

Share Document