Shared Memory Parallelism
Shared memory machines typically have relatively few processors, say 2–128. An intrinsic characteristic of these machines is a strategy for memory coherence and a fast tightly coupled network for distributing data from a commonly accessible memory system. Our test examples were run on two HP Superdome clusters: Stardust is a production machine with 64 PA-8700 processors, and Pegasus is a 32 CPU machine with the same kind of processors. The HP9000 is grouped into cells, each with 4 CPUs, a common memory/cell, and connected to a CCNUMA crossbar network. The network consists of sets of 4×4 crossbars and is shown in Figure 4.2. An effective bandwidth test, the EFF_BW benchmark [116], groups processors into two equally sized sets. Arbitrary pairings are made between elements from each group, Figure 4.3, and the cross-sectional bandwidth of the network is measured for a fixed number of processors and varying message sizes. The results from the HP9000 machine Stardust are shown in Figure 4.4. It is clear from this figure that the cross-sectional bandwidth of the network is quite high. Although not apparent from Figure 4.4, the latency for this test (the intercept near Message Size = 0) is not high. Due to the low incremental resolution of MPI_Wtime, multiple test runs must be done to quantify the latency. Dr Byrde’s tests show that minimum latency is ≳ 1.5μs. A clearer example of a shared memory architecture is the Cray X1 machine, shown in Figures 4.5 and 4.6. In Figure 4.6, the shared memory design is obvious. Each multi-streaming processor (MSP) shown in Figure 4.5 has 4 processors (custom designed processor chips forged by IBM), and 4 corresponding caches. Although not clear from available diagrams, vector memory access apparently permits cache by-pass; hence the term streaming in MSP. That is, vector registers are loaded directly from memory: see, for example, Figure 3.4. On each board (called nodes) are 4 such MSPs and 16 memory modules which share a common (coherent) memory view. Coherence is only maintained on each board, but not across multiple board systems.