DESIGNING PARALLEL ALGORITHMS FOR HIERARCHICAL SMP CLUSTERS

2003 ◽  
Vol 14 (01) ◽  
pp. 59-78
Author(s):  
MARTIN SCHMOLLINGER ◽  
MICHAEL KAUFMANN

Clusters of symmetric multiprocessor nodes (SMP clusters) are one of the most important parallel architectures at the moment. The architecture consists of shared-memory nodes with multiple processors and a fast interconnection network between the nodes. New programming models try to exploit this architecture by using threads in the nodes and using message-passing-libraries for inter-node communication. In order to develop efficient algorithms, it is necessary to consider the hybrid nature of the architecture and of the programming models. We present the κNUMA-model and a methodology that build a good base for designing efficient algorithms for SMP clusters. The κNUMA-model is a computational model that extends the bulk-synchronous parallel (BSP) model with the characteristics of SMP clusters and new hybrid programming models. The κNUMA-methodology suggests to develop efficient overall algorithms by developing efficient algorithms for each level in the hierarchy. We use the problem of personalized one-to-all-broadcast and the dense matrix-vector-multiplication for the presentation. The theoretical results of the analysis of the dense matrix-vector-multiplication are verified practically. We show results of experiments, made on a Linux-cluster of dual Pentium-III nodes.

2018 ◽  
Vol 30 (19) ◽  
pp. e4705 ◽  
Author(s):  
Guixia He ◽  
Jiaquan Gao ◽  
Jun Wang

2004 ◽  
Vol 12 (3) ◽  
pp. 169-183 ◽  
Author(s):  
Alexandros V. Gerbessiotis ◽  
Seung-Yeop Lee

In this work we make a strong case for remote memory access (RMA) as the effective way to program a parallel computer by proposing a framework that supports RMA in a library independent, simple and intuitive way. If one uses our approach the parallel code one writes will run transparently under MPI-2 enabled libraries but also bulk-synchronous parallel libraries. The advantage of using RMA is code simplicity, reduced programming complexity, and increased efficiency. We support the latter claims by implementing under this framework a collection of benchmark programs consisting of a communication and synchronization performance assessment program, a dense matrix multiplication algorithm, and two variants of a parallel radix-sort algorithm and examine their performance on a LINUX-based PC cluster under three different RMA enabled libraries: LAM MPI, BSPlib, and PUB. We conclude that implementations of such parallel algorithms using RMA communication primitives lead to code that is as efficient as the message-passing equivalent code and in the case of radix-sort substantially more efficient. In addition our work can be used as a comparative study of the relevant capabilities of the three libraries.


Author(s):  
Christopher De Sa ◽  
Albert Cu ◽  
Rohan Puttagunta ◽  
Christopher Ré ◽  
Atri Rudra

2008 ◽  
Vol 18 (04) ◽  
pp. 511-530 ◽  
Author(s):  
NORIYUKI FUJIMOTO

Recently GPUs have acquired the ability to perform fast general purpose computation by running thousands of threads concurrently. This paper presents a new algorithm for dense matrix-vector multiplication on the NVIDIA CUDA architecture. The experiments are conducted on a PC with GeForce 8800GTX and 2.0 GHz Intel Xeon E5335 CPU. The results show that the proposed algorithm runs a maximum of 11.19 times faster than NVIDIA's BLAS library CUBLAS 1.1 on the GPU and 35.15 times faster than the Intel Math Kernel Library 9.1 on a single core x86 with SSE3 SIMD instructions. The performance of Jacobi's iterative method for solving linear equations, which includes the data transfer time between CPU and GPU, shows that the proposed algorithm is practical for real applications.


2017 ◽  
Vol 43 (4) ◽  
pp. 1-49 ◽  
Author(s):  
Salvatore Filippone ◽  
Valeria Cardellini ◽  
Davide Barbieri ◽  
Alessandro Fanfarillo

Author(s):  
Rawad Bitar ◽  
Yuxuan Xing ◽  
Yasaman Keshtkarjahromi ◽  
Venkat Dasari ◽  
Salim El Rouayheb ◽  
...  

AbstractEdge computing is emerging as a new paradigm to allow processing data near the edge of the network, where the data is typically generated and collected. This enables critical computations at the edge in applications such as Internet of Things (IoT), in which an increasing number of devices (sensors, cameras, health monitoring devices, etc.) collect data that needs to be processed through computationally intensive algorithms with stringent reliability, security and latency constraints. Our key tool is the theory of coded computation, which advocates mixing data in computationally intensive tasks by employing erasure codes and offloading these tasks to other devices for computation. Coded computation is recently gaining interest, thanks to its higher reliability, smaller delay, and lower communication costs. In this paper, we develop a private and rateless adaptive coded computation (PRAC) algorithm for distributed matrix-vector multiplication by taking into account (1) the privacy requirements of IoT applications and devices, and (2) the heterogeneous and time-varying resources of edge devices. We show that PRAC outperforms known secure coded computing methods when resources are heterogeneous. We provide theoretical guarantees on the performance of PRAC and its comparison to baselines. Moreover, we confirm our theoretical results through simulations and implementations on Android-based smartphones.


Sign in / Sign up

Export Citation Format

Share Document