Introduction to Parallel Computing
Latest Publications


TOTAL DOCUMENTS

5
(FIVE YEARS 0)

H-INDEX

0
(FIVE YEARS 0)

Published By Oxford University Press

9780198515760, 9780191916595

Author(s):  
Wesley Petersen ◽  
Peter Arbenz

Linear algebra is often the kernel of most numerical computations. It deals with vectors and matrices and simple operations like addition and multiplication on these objects. Vectors are one-dimensional arrays of say n real or complex numbers x0, x1, . . . , xn−1. We denote such a vector by x and think of it as a column vector, On a sequential computer, these numbers occupy n consecutive memory locations. This is also true, at least conceptually, on a shared memory multiprocessor computer. On distributed memory multicomputers, the primary issue is how to distribute vectors on the memory of the processors involved in the computation. Matrices are two-dimensional arrays of the form The n · m real (complex) matrix elements aij are stored in n · m (respectively 2 · n ·m if complex datatype is available) consecutive memory locations. This is achieved by either stacking the columns on top of each other or by appending row after row. The former is called column-major, the latter row-major order. The actual procedure depends on the programming language. In Fortran, matrices are stored in column-major order, in C in row-major order. There is no principal difference, but for writing efficient programs one has to respect how matrices are laid out. To be consistent with the libraries that we will use that are mostly written in Fortran, we will explicitly program in column-major order. Thus, the matrix element aij of the m×n matrix A is located i+j · m memory locations after a00. Therefore, in our C codes we will write a[i+j*m]. Notice that there is no such simple procedure for determining the memory location of an element of a sparse matrix. In Section 2.3, we outline data descriptors to handle sparse matrices. In this and later chapters we deal with one of the simplest operations one wants to do with vectors and matrices: the so-called saxpy operation (2.3). In Tables 2.1 and 2.2 are listed some of the acronyms and conventions for the basic linear algebra subprograms discussed in this book.


Author(s):  
Wesley Petersen ◽  
Peter Arbenz

The Multiple instruction, multiple data (MIMD) programming model usually refers to computing on distributed memory machines with multiple independent processors. Although processors may run independent instruction streams, we are interested in streams that are always portions of a single program. Between processors which share a coherent memory view (within a node), data access is immediate, whereas between nodes data access is effected by message passing. In this book, we use MPI for such message passing. MPI has emerged as a more/less standard message passing system used on both shared memory and distributed memory machines. It is often the case that although the system consists of multiple independent instruction streams, the programming model is not too different from SIMD. Namely, the totality of a program is logically split into many independent tasks each processed by a group (see Appendix D) of processes—but the overall program is effectively single threaded at the beginning, and likewise at the end. The MIMD model, however, is extremely flexible in that no one process is always master and the other processes slaves. A communicator group of processes performs certain tasks, usually with an arbitrary master/slave relationship. One process may be assigned to be master (or root) and coordinates the tasks of others in the group. We emphasize that the assignments of which is root is arbitrary—any processor may be chosen. Frequently, however, this choice is one of convenience—a file server node, for example. Processors and memory are connected by a network, for example, Figure 5.1. In this form, each processor has its own local memory. This is not always the case: The Cray X1, and NEC SX-6 through SX-8 series machines, have common memory within nodes. Within a node, memory coherency is maintained within local caches. Between nodes, it remains the programmer’s responsibility to assure a proper read–update relationship in the shared data. Data updated by one set of processes should not be clobbered by another set until the data are properly used.


Author(s):  
Wesley Petersen ◽  
Peter Arbenz

The single instruction, multiple data (SIMD) mode is the simplest method of parallelism and now becoming the most common. In most cases this SIMD mode means the same as vectorization. Ten years ago, ve ctor computers were expensive but reasonably simple to program. Today, encouraged by multimedia applications, vector hardware is now commonly available in Intel Pentium III and Pentium 4 PCs, and Apple/Motorola G-4 machines. In this chapter, we will cover both old and new and find that the old paradigms for programming were simpler because CMOS or ECL memories permitted easy non-unit stride memory access. Most of the ideas are the same, so the simpler programming methodology makes it easy to understand the concepts. As PC and Mac compilers improve, perhaps automatic vectorization will become as effective as on the older non-cache machines. In the meantime, on PCs and Macs we will often need to use intrinsics ([23, 22, 51]). It seems at first that the intrinsics keep a programmer close to the hardware, which is not a bad thing, but this is somewhat misleading. Hardware control in this method of programming is only indirect. Actual register assignments are made by the compiler and may not be quite what the programmer wants. The SSE2 or Altivec programming serves to illustrate a form of instruction level parallelism we wish to emphasize. This form, SIMD or vectorization, has single instructions which operate on multiple data. There are variants on this theme which use templates or macros which consist of multiple instructions carefully scheduled to accomplish the same objective, but are not strictly speaking SIMD, for example see Section 1.2.2.1. Intrinsics are C macros which contain one or more SIMD instructions to execute certain operations on multiple data, usually 4-words/time in our case. Data are explicitly declared mm128 datatypes in the Intel SSE case and vector variables using the G-4 Altivec. Our examples will show you how this works. Four basic concepts are important: Consistent with our notion that examples are the best way to learn, several will be illustrated: • from linear algebra, the Level 1 basic linear algebra subprograms (BLAS) — vector updates (-axpy) — reduction operations and linear searches • recurrence formulae and polynomial evaluations • uniform random number generation.


Author(s):  
Wesley Petersen ◽  
Peter Arbenz

Shared memory machines typically have relatively few processors, say 2–128. An intrinsic characteristic of these machines is a strategy for memory coherence and a fast tightly coupled network for distributing data from a commonly accessible memory system. Our test examples were run on two HP Superdome clusters: Stardust is a production machine with 64 PA-8700 processors, and Pegasus is a 32 CPU machine with the same kind of processors. The HP9000 is grouped into cells, each with 4 CPUs, a common memory/cell, and connected to a CCNUMA crossbar network. The network consists of sets of 4×4 crossbars and is shown in Figure 4.2. An effective bandwidth test, the EFF_BW benchmark [116], groups processors into two equally sized sets. Arbitrary pairings are made between elements from each group, Figure 4.3, and the cross-sectional bandwidth of the network is measured for a fixed number of processors and varying message sizes. The results from the HP9000 machine Stardust are shown in Figure 4.4. It is clear from this figure that the cross-sectional bandwidth of the network is quite high. Although not apparent from Figure 4.4, the latency for this test (the intercept near Message Size = 0) is not high. Due to the low incremental resolution of MPI_Wtime, multiple test runs must be done to quantify the latency. Dr Byrde’s tests show that minimum latency is ≳ 1.5μs. A clearer example of a shared memory architecture is the Cray X1 machine, shown in Figures 4.5 and 4.6. In Figure 4.6, the shared memory design is obvious. Each multi-streaming processor (MSP) shown in Figure 4.5 has 4 processors (custom designed processor chips forged by IBM), and 4 corresponding caches. Although not clear from available diagrams, vector memory access apparently permits cache by-pass; hence the term streaming in MSP. That is, vector registers are loaded directly from memory: see, for example, Figure 3.4. On each board (called nodes) are 4 such MSPs and 16 memory modules which share a common (coherent) memory view. Coherence is only maintained on each board, but not across multiple board systems.


Author(s):  
Wesley Petersen ◽  
Peter Arbenz

Since first proposed by Gordon Moore (an Intel founder) in 1965, his law [107] that the number of transistors on microprocessors doubles roughly every one to two years has proven remarkably astute. Its corollary, that central processing unit (CPU) performance would also double every two years or so has also remained prescient. Figure 1.1 shows Intel microprocessor data on the number of transistors beginning with the 4004 in 1972. Figure 1.2 indicates that when one includes multi-processor machines and algorithmic development, computer performance is actually better than Moore’s 2-year performance doubling time estimate. Alas, however, in recent years there has developed a disagreeable mismatch between CPU and memory performance: CPUs now outperform memory systems by orders of magnitude according to some reckoning [71]. This is not completely accurate, of course: it is mostly a matter of cost. In the 1980s and 1990s, Cray Research Y-MP series machines had well balanced CPU to memory performance. Likewise, NEC (Nippon Electric Corp.), using CMOS (see glossary, Appendix F) and direct memory access, has well balanced CPU/Memory performance. ECL (see glossary, Appendix F) and CMOS static random access memory (SRAM) systems were and remain expensive and like their CPU counterparts have to be carefully kept cool. Worse, because they have to be cooled, close packing is difficult and such systems tend to have small storage per volume. Almost any personal computer (PC) these days has a much larger memory than supercomputer memory systems of the 1980s or early 1990s. In consequence, nearly all memory systems these days are hierarchical, frequently with multiple levels of cache. Figure 1.3 shows the diverging trends between CPUs and memory performance. Dynamic random access memory (DRAM) in some variety has become standard for bulk memory. There are many projects and ideas about how to close this performance gap, for example, the IRAM [78] and RDRAM projects [85]. We are confident that this disparity between CPU and memory access performance will eventually be tightened, but in the meantime, we must deal with the world as it is.


Sign in / Sign up

Export Citation Format

Share Document