Parallel Scientific Computation

Graph matching

Parallel Scientific Computation ◽

10.1093/oso/9780198788348.003.0005 ◽

2020 ◽

pp. 291-358

Author(s):

Rob H. Bisseling

Keyword(s):

Approximation Algorithm ◽

Parallel Algorithm ◽

Graph Matching ◽

Linear Time ◽

Total Weight ◽

Mathematical Representation ◽

Optimal Weight ◽

Graph Problems ◽

Positive Weights ◽

Local Dominance

This chapter explores parallel algorithms for graph matching. Here, a graph is the mathematical representation of a network, with vertices representing the nodes of the network and edges representing their connections. The edges have positive weights, and the aim is to find a matching with maximum total weight. The chapter first presents a sequential, parallelizable approximation algorithm based on local dominance that guarantees attaining at least half the optimal weight in near-linear time. This algorithm, coupled with a vertex partitioning, is the basis for developing a parallel algorithm. The BSP approach is shown to be especially advantageous for graph problems, both in developing a parallel algorithm and in proving it correct. The basic parallel algorithm is enhanced by giving preference to local matches when breaking ties and by adding a load-balancing mechanism. The scalability of the parallel algorithm is put to the test using graphs of up to 150 million edges.

Download Full-text

The fast Fourier transform

Parallel Scientific Computation ◽

10.1093/oso/9780198788348.003.0003 ◽

2020 ◽

pp. 134-189

Author(s):

Rob H. Bisseling

Keyword(s):

Signal Processing ◽

Fourier Transform ◽

Fast Fourier Transform ◽

Weather Forecasting ◽

Data Access ◽

Regular Pattern ◽

Matrix Products ◽

Data Redistribution ◽

Fft Algorithms ◽

Matrix Vector

This chapter demonstrates the use of different data distributions in different phases of a parallel fast Fourier transform (FFT), which is a regular computation with a predictable but challenging data access pattern. Both the block and cyclic distributions are used and also intermediates between them. Each required data redistribution is a permutation that involves communication. By making careful choices, the number of such redistributions can be kept to a minimum. FFT algorithms can be concisely expressed using matrix/vector notation and Kronecker matrix products. This notation is also used here. The chapter then shows how permutations with a regular pattern can be implemented more efficiently by packing the data. The parallelization techniques discussed for the specific case of the FFT are also applicable to other related computations, for instance in signal processing and weather forecasting.

Download Full-text

Sparse matrix–vector multiplication

Parallel Scientific Computation ◽

10.1093/oso/9780198788348.003.0004 ◽

2020 ◽

pp. 190-290

Author(s):

Rob H. Bisseling

Keyword(s):

Sparse Matrix ◽

Sparse Matrices ◽

Distributed Shared Memory ◽

Sparsity Pattern ◽

Matrix Vector Multiplication ◽

Special Cases ◽

The Matrix ◽

Matrix Vector ◽

Memory Architectures ◽

Shared Memory Architectures

This chapter introduces irregular algorithms and presents the example of parallel sparse matrix-vector multiplication (SpMV), which is the central operation in iterative linear system solvers. The irregular sparsity pattern of the matrix does not change during the multiplication, which may be repeated many times. This justifies putting a lot of effort into finding a good data distribution. The Mondriaan distribution of a sparse matrix is a useful non-Cartesian distribution that can be found by hypergraph-based partitioning. The Mondriaan package implements such a partitioning and also the newer medium-grain partitioning method. The chapter analyses the special cases of random sparse matrices and Laplacian matrices. It uses performance profiles and geometric means to compare different partitioning methods. Furthermore, it presents the hybrid-BSP model and a hybrid-BSP SpMV, which are aimed at hybrid distributed/shared-memory architectures. The parallel SpMV can be incorporated in applications, ranging from PageRank computation to artificial neural networks.

Download Full-text

LU decomposition

Parallel Scientific Computation ◽

10.1093/oso/9780198788348.003.0002 ◽

2020 ◽

pp. 74-133

Author(s):

Rob H. Bisseling

Keyword(s):

Cost Analysis ◽

Linear Systems ◽

High Performance ◽

Gaussian Elimination ◽

Collective Communication ◽

Dense Matrix ◽

Lu Decomposition ◽

Two Phase ◽

Matrix Computations ◽

Communication Method

This chapter discusses parallel dense matrix computations, in particular the solution of linear systems by LU decomposition with partial row pivoting. It first presents a general Cartesian scheme for the distribution of matrices. Based on BSP cost analysis, the square cyclic distribution is proposed as particularly suitable for matrix computations such as LU decomposition and Gaussian elimination. The chapter introduces two-phase broadcasting of vectors, which is a useful collective-communication method for sending copies of matrix rows or columns to a group of processors. It also discusses how to achieve high performance by delaying rank-1 matrix updates to create a multiple-rank update, which can be carried out by multiplying tall-and-skinny matrices in a cache-friendly manner. The high-performance parallel LU decomposition is tested on a top-ranking supercomputer, and its performance is analysed with respect to computation, communication, and synchronization.

Download Full-text

Introduction

Parallel Scientific Computation ◽

10.1093/oso/9780198788348.003.0001 ◽

2020 ◽

pp. 1-73

Author(s):

Rob H. Bisseling

Keyword(s):

Cost Analysis ◽

Parallel Algorithms ◽

Parallel Computer ◽

Inner Product ◽

Desktop Computer ◽

Bulk Synchronous Parallel ◽

Target Architecture ◽

Short Program ◽

Regular Sampling ◽

Sort Algorithm

This chapter is a self-contained tutorial which tells you how to get started with parallel programming and how to design and implement parallel algorithms in a structured way using supersteps. It introduces a simple target architecture for designing parallel algorithms, the bulk synchronous parallel (BSP) computer. Using the computation of the inner product of two vectors as an example, the chapter shows how an algorithm is designed, hand in hand with its cost analysis. The inner-product algorithm is implemented in a short program that demonstrates the most important primitives of the communication library, BSPlib. Furthermore, a benchmarking program is given for measuring the BSP parameters of a parallel computer. Its use is demonstrated on a desktop computer and a supercomputer. Finally, a parallel regular sampling sort algorithm is presented, implemented, and tested.

Download Full-text

Parallel Scientific Computation
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Oxford University Press

Graph matching

The fast Fourier transform

Sparse matrix–vector multiplication

LU decomposition

Introduction

Export Citation Format

Parallel Scientific ComputationLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Oxford University Press

Graph matching

The fast Fourier transform

Sparse matrix–vector multiplication

LU decomposition

Introduction

Parallel Scientific Computation
Latest Publications