Parallel Scientific Computation
Latest Publications


TOTAL DOCUMENTS

5
(FIVE YEARS 5)

H-INDEX

0
(FIVE YEARS 0)

Published By Oxford University Press

9780198788348, 9780191830273

Author(s):  
Rob H. Bisseling

This chapter explores parallel algorithms for graph matching. Here, a graph is the mathematical representation of a network, with vertices representing the nodes of the network and edges representing their connections. The edges have positive weights, and the aim is to find a matching with maximum total weight. The chapter first presents a sequential, parallelizable approximation algorithm based on local dominance that guarantees attaining at least half the optimal weight in near-linear time. This algorithm, coupled with a vertex partitioning, is the basis for developing a parallel algorithm. The BSP approach is shown to be especially advantageous for graph problems, both in developing a parallel algorithm and in proving it correct. The basic parallel algorithm is enhanced by giving preference to local matches when breaking ties and by adding a load-balancing mechanism. The scalability of the parallel algorithm is put to the test using graphs of up to 150 million edges.


Author(s):  
Rob H. Bisseling

This chapter demonstrates the use of different data distributions in different phases of a parallel fast Fourier transform (FFT), which is a regular computation with a predictable but challenging data access pattern. Both the block and cyclic distributions are used and also intermediates between them. Each required data redistribution is a permutation that involves communication. By making careful choices, the number of such redistributions can be kept to a minimum. FFT algorithms can be concisely expressed using matrix/vector notation and Kronecker matrix products. This notation is also used here. The chapter then shows how permutations with a regular pattern can be implemented more efficiently by packing the data. The parallelization techniques discussed for the specific case of the FFT are also applicable to other related computations, for instance in signal processing and weather forecasting.


Author(s):  
Rob H. Bisseling

This chapter introduces irregular algorithms and presents the example of parallel sparse matrix-vector multiplication (SpMV), which is the central operation in iterative linear system solvers. The irregular sparsity pattern of the matrix does not change during the multiplication, which may be repeated many times. This justifies putting a lot of effort into finding a good data distribution. The Mondriaan distribution of a sparse matrix is a useful non-Cartesian distribution that can be found by hypergraph-based partitioning. The Mondriaan package implements such a partitioning and also the newer medium-grain partitioning method. The chapter analyses the special cases of random sparse matrices and Laplacian matrices. It uses performance profiles and geometric means to compare different partitioning methods. Furthermore, it presents the hybrid-BSP model and a hybrid-BSP SpMV, which are aimed at hybrid distributed/shared-memory architectures. The parallel SpMV can be incorporated in applications, ranging from PageRank computation to artificial neural networks.


Author(s):  
Rob H. Bisseling

This chapter discusses parallel dense matrix computations, in particular the solution of linear systems by LU decomposition with partial row pivoting. It first presents a general Cartesian scheme for the distribution of matrices. Based on BSP cost analysis, the square cyclic distribution is proposed as particularly suitable for matrix computations such as LU decomposition and Gaussian elimination. The chapter introduces two-phase broadcasting of vectors, which is a useful collective-communication method for sending copies of matrix rows or columns to a group of processors. It also discusses how to achieve high performance by delaying rank-1 matrix updates to create a multiple-rank update, which can be carried out by multiplying tall-and-skinny matrices in a cache-friendly manner. The high-performance parallel LU decomposition is tested on a top-ranking supercomputer, and its performance is analysed with respect to computation, communication, and synchronization.


Author(s):  
Rob H. Bisseling

This chapter is a self-contained tutorial which tells you how to get started with parallel programming and how to design and implement parallel algorithms in a structured way using supersteps. It introduces a simple target architecture for designing parallel algorithms, the bulk synchronous parallel (BSP) computer. Using the computation of the inner product of two vectors as an example, the chapter shows how an algorithm is designed, hand in hand with its cost analysis. The inner-product algorithm is implemented in a short program that demonstrates the most important primitives of the communication library, BSPlib. Furthermore, a benchmarking program is given for measuring the BSP parameters of a parallel computer. Its use is demonstrated on a desktop computer and a supercomputer. Finally, a parallel regular sampling sort algorithm is presented, implemented, and tested.


Sign in / Sign up

Export Citation Format

Share Document