DESIGNING PARALLEL ALGORITHMS FOR HIERARCHICAL SMP CLUSTERS

Clusters of symmetric multiprocessor nodes (SMP clusters) are one of the most important parallel architectures at the moment. The architecture consists of shared-memory nodes with multiple processors and a fast interconnection network between the nodes. New programming models try to exploit this architecture by using threads in the nodes and using message-passing-libraries for inter-node communication. In order to develop efficient algorithms, it is necessary to consider the hybrid nature of the architecture and of the programming models. We present the κNUMA-model and a methodology that build a good base for designing efficient algorithms for SMP clusters. The κNUMA-model is a computational model that extends the bulk-synchronous parallel (BSP) model with the characteristics of SMP clusters and new hybrid programming models. The κNUMA-methodology suggests to develop efficient overall algorithms by developing efficient algorithms for each level in the hierarchy. We use the problem of personalized one-to-all-broadcast and the dense matrix-vector-multiplication for the presentation. The theoretical results of the analysis of the dense matrix-vector-multiplication are verified practically. We show results of experiments, made on a Linux-cluster of dual Pentium-III nodes.

Download Full-text

Algorithms for SMP-clusters dense matrix-vector multiplication

Proceedings 16th International Parallel and Distributed Processing Symposium ◽

10.1109/ipdps.2002.1016540 ◽

2002 ◽

Cited By ~ 1

Author(s):

M. Schmollinger ◽

M. Kaufmann

Keyword(s):

Dense Matrix ◽

Matrix Vector Multiplication ◽

Smp Clusters ◽

Matrix Vector

Download Full-text

Efficient dense matrix-vector multiplication on GPU

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.4705 ◽

2018 ◽

Vol 30 (19) ◽

pp. e4705 ◽

Cited By ~ 7

Author(s):

Guixia He ◽

Jiaquan Gao ◽

Jun Wang

Keyword(s):

Dense Matrix ◽

Matrix Vector Multiplication ◽

Matrix Vector

Download Full-text

Remote Memory Access: A Case for Portable, Efficient and Library Independent Parallel Programming

Scientific Programming ◽

10.1155/2004/934718 ◽

2004 ◽

Vol 12 (3) ◽

pp. 169-183 ◽

Cited By ~ 6

Author(s):

Alexandros V. Gerbessiotis ◽

Seung-Yeop Lee

Keyword(s):

Message Passing ◽

Matrix Multiplication ◽

Memory Access ◽

Parallel Computer ◽

Remote Memory ◽

Dense Matrix ◽

Radix Sort ◽

Matrix Multiplication Algorithm ◽

Bulk Synchronous Parallel ◽

Remote Memory Access

In this work we make a strong case for remote memory access (RMA) as the effective way to program a parallel computer by proposing a framework that supports RMA in a library independent, simple and intuitive way. If one uses our approach the parallel code one writes will run transparently under MPI-2 enabled libraries but also bulk-synchronous parallel libraries. The advantage of using RMA is code simplicity, reduced programming complexity, and increased efficiency. We support the latter claims by implementing under this framework a collection of benchmark programs consisting of a communication and synchronization performance assessment program, a dense matrix multiplication algorithm, and two variants of a parallel radix-sort algorithm and examine their performance on a LINUX-based PC cluster under three different RMA enabled libraries: LAM MPI, BSPlib, and PUB. We conclude that implementations of such parallel algorithms using RMA communication primitives lead to code that is as efficient as the message-passing equivalent code and in the case of radix-sort substantially more efficient. In addition our work can be used as a comparative study of the relevant capabilities of the three libraries.

Download Full-text

Optimizing symmetric dense matrix-vector multiplication on GPUs

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 ◽

10.1145/2063384.2063392 ◽

2011 ◽

Cited By ~ 23

Author(s):

Rajib Nath ◽

Stanimire Tomov ◽

Tingxing "Tim" Dong ◽

Jack Dongarra

Keyword(s):

Dense Matrix ◽

Matrix Vector Multiplication ◽

Matrix Vector

Download Full-text

A Two-pronged Progress in Structured Dense Matrix Vector Multiplication

Proceedings of the Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms ◽

10.1137/1.9781611975031.69 ◽

2018 ◽

pp. 1060-1079 ◽

Cited By ~ 1

Author(s):

Christopher De Sa ◽

Albert Cu ◽

Rohan Puttagunta ◽

Christopher Ré ◽

Atri Rudra

Keyword(s):

Dense Matrix ◽

Matrix Vector Multiplication ◽

Matrix Vector

Download Full-text

DENSE MATRIX-VECTOR MULTIPLICATION ON THE CUDA ARCHITECTURE

Parallel Processing Letters ◽

10.1142/s0129626408003545 ◽

2008 ◽

Vol 18 (04) ◽

pp. 511-530 ◽

Cited By ~ 7

Author(s):

NORIYUKI FUJIMOTO

Keyword(s):

Iterative Method ◽

Data Transfer ◽

Linear Equations ◽

General Purpose ◽

Dense Matrix ◽

Transfer Time ◽

Nvidia Cuda ◽

Matrix Vector Multiplication ◽

Cuda Architecture ◽

Matrix Vector

Recently GPUs have acquired the ability to perform fast general purpose computation by running thousands of threads concurrently. This paper presents a new algorithm for dense matrix-vector multiplication on the NVIDIA CUDA architecture. The experiments are conducted on a PC with GeForce 8800GTX and 2.0 GHz Intel Xeon E5335 CPU. The results show that the proposed algorithm runs a maximum of 11.19 times faster than NVIDIA's BLAS library CUBLAS 1.1 on the GPU and 35.15 times faster than the Intel Math Kernel Library 9.1 on a single core x86 with SSE3 SIMD instructions. The performance of Jacobi's iterative method for solving linear equations, which includes the data transfer time between CPU and GPU, shows that the proposed algorithm is practical for real applications.

Download Full-text

SpaceA: Sparse Matrix Vector Multiplication on Processing-in-Memory Accelerator

2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA) ◽

10.1109/hpca51647.2021.00055 ◽

2021 ◽

Author(s):

Xinfeng Xie ◽

Zheng Liang ◽

Peng Gu ◽

Abanti Basak ◽

Lei Deng ◽

...

Keyword(s):

Sparse Matrix ◽

Matrix Vector Multiplication ◽

Matrix Vector

Download Full-text

Conflict-free symmetric sparse matrix-vector multiplication on multicore architectures

Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis ◽

10.1145/3295500.3356148 ◽

2019 ◽

Cited By ~ 1

Author(s):

Athena Elafrou ◽

Georgios Goumas ◽

Nectarios Koziris

Keyword(s):

Sparse Matrix ◽

Multicore Architectures ◽

Matrix Vector Multiplication ◽

Matrix Vector

Download Full-text

Sparse Matrix-Vector Multiplication on GPGPUs

ACM Transactions on Mathematical Software ◽

10.1145/3017994 ◽

2017 ◽

Vol 43 (4) ◽

pp. 1-49 ◽

Cited By ~ 34

Author(s):

Salvatore Filippone ◽

Valeria Cardellini ◽

Davide Barbieri ◽

Alessandro Fanfarillo

Keyword(s):

Sparse Matrix ◽

Matrix Vector Multiplication ◽

Matrix Vector

Download Full-text

Private and rateless adaptive coded matrix-vector multiplication

EURASIP Journal on Wireless Communications and Networking ◽

10.1186/s13638-020-01887-y ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Rawad Bitar ◽

Yuxuan Xing ◽

Yasaman Keshtkarjahromi ◽

Venkat Dasari ◽

Salim El Rouayheb ◽

...

Keyword(s):

Computing Methods ◽

Erasure Codes ◽

Time Varying ◽

New Paradigm ◽

Matrix Vector Multiplication ◽

Processing Data ◽

Computationally Intensive ◽

Matrix Vector ◽

Monitoring Devices ◽

Theoretical Results

AbstractEdge computing is emerging as a new paradigm to allow processing data near the edge of the network, where the data is typically generated and collected. This enables critical computations at the edge in applications such as Internet of Things (IoT), in which an increasing number of devices (sensors, cameras, health monitoring devices, etc.) collect data that needs to be processed through computationally intensive algorithms with stringent reliability, security and latency constraints. Our key tool is the theory of coded computation, which advocates mixing data in computationally intensive tasks by employing erasure codes and offloading these tasks to other devices for computation. Coded computation is recently gaining interest, thanks to its higher reliability, smaller delay, and lower communication costs. In this paper, we develop a private and rateless adaptive coded computation (PRAC) algorithm for distributed matrix-vector multiplication by taking into account (1) the privacy requirements of IoT applications and devices, and (2) the heterogeneous and time-varying resources of edge devices. We show that PRAC outperforms known secure coded computing methods when resources are heterogeneous. We provide theoretical guarantees on the performance of PRAC and its comparison to baselines. Moreover, we confirm our theoretical results through simulations and implementations on Android-based smartphones.

Download Full-text