In the last years, embedded systems have evolved so that they offer capabilities we could only find before in high performance systems. Portable devices already have multiprocessors on-chip (such as PowerPC 476FP or ARM Cortex A9 MP), usually multi-threaded, and a powerful multi-level cache memory hierarchy on-chip. As most of these systems are battery-powered, the power consumption becomes a critical issue. Achieving high performance and low power consumption is a high complexity challenge where some proposals have been already made. Suarez et al. proposed a new cache hierarchy on-chip, the LP-NUCA (Low Power NUCA), which is able to reduce the access latency taking advantage of NUCA (Non-Uniform Cache Architectures) properties. The key points are decoupling the functionality, and utilizing three specialized networks on-chip. This structure has been proved to be efficient for data hierarchies, achieving a good performance and reducing the energy consumption. On the other hand, instruction caches have different requirements and characteristics than data caches, contradicting the low-power embedded systems requirements, especially in SMT (simultaneous multi-threading) environments. We want to study the benefits of utilizing small tiled caches for the instruction hierarchy, so we propose a new design, ID-LP-NUCAs. Thus, we need to re-evaluate completely our previous design in terms of structure design, interconnection networks (including topologies, flow control and routing), content management (with special interest in hardware/software content allocation policies), and structure sharing. In CMP environments (chip multiprocessors) with parallel workloads, coherence plays an important role, and must be taken into consideration.
Index-shuffle graphs are introduced as candidate interconnection networks for parallel computers. The comparative advantages of index-shuffle graphs over the standard bounded-degree "approximations" of the hypercube, namely butterfly-like and shuffle-like graphs, are demonstrated in the theoretical framework of graph embedding and network emulations. An N-node index-shuffle graph emulates: • an N-node shuffle-exchange graph with no slowdown, which the currently best emulations of shuffle-like graphs by hypercubes and butterflies incur a slowdown of Ω( log N). • its like-sized butterfly graph with a slowdown O( log log log N), while the currently best emulations of butterfly-like graphs by shuffle-like graphs incur a slowdown of Ω( log log N). • an N-node hypercube that executes an on-line leveled algorithm with a slowdown O( log log N), while the slowdown of currently best such emulations of the hypercube by its bounded-degree shuffle-like and butterfly-like derivatives remains Ω( log N). Our emulation is based on an embedding of an N-node hypercube into an N-node index-shuffle graph with dilation O( log log N), while the currently best embeddings of the hypercube into its bounded-degree shuffle-like and butterfly-like derivatives incur a dilation of Ω( log N).