Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors

Author(s):  
Fang Liu ◽  
Yan Solihin
2009 ◽  
Vol 17 (1-2) ◽  
pp. 59-76 ◽  
Author(s):  
Alejandro Rico ◽  
Alex Ramirez ◽  
Mateo Valero

There is a clear industrial trend towards chip multiprocessors (CMP) as the most power efficient way of further increasing performance. Heterogeneous CMP architectures take one more step along this power efficiency trend by using multiple types of processors, tailored to the workloads they will execute. Programming these CMP architectures has been identified as one of the main challenges in the near future, and programming heterogeneous systems is even more challenging. High-level programming models which allow the programmer to identify parallel tasks, and the runtime management of the inter-task dependencies, have been identified as a suitable model for programming such heterogeneous CMP architectures. In this paper we analyze the performance of Cell Superscalar, a task-based programming model for the Cell Broadband Engine Architecture, in terms of its scalability to higher number of on-chip processors. Our results show that the low performance of the PPE component limits the scalability of some applications to less than 16 processors. Since the PPE has been identified as the limiting element, we perform a set of simulation studies evaluating the impact of out-of-order execution, branch prediction and larger caches on the task management overhead. We conclude that out-of-order execution is a very desirable feature, since it increases task management performance by 50%. We also identify memory latency as a fundamental aspect in performance, while the working set is not that large. We expect a significant performance impact if task management would run using a fast private memory to store the task dependency graph instead of relying on the cache hierarchy.


Multicore systems are achieving new dimensions to meet the pace between processor technology and high processing demands. These chip multiprocessors (CMPs) share some on- chip and off-chip resources to achieve higher performance. With the increase in core integration, the performance of workloads is dependent on the allocation of resources. Though the CMPs elevate performance, but the challenge imposed due to subtle interactions of several applications contending for resources leads to performance degradation by several magnitudes. In this direction the work performs a two-fold evaluation of application programs. The applications running on CMP cores depict distinct behavior towards consumption of shared resources. It characterizes the resource centric nature of SPEC CPU2006 benchmarks based on their resource consumption behavior. Secondly the work aims to evaluate the effect of inter-core interference on the performance of application programs based on the characterization obtained and the potential contention caused on performance due to corunners. Lastly we place significant remarks on the impact on performance due to resource sharing and its implication on resource contention.


2009 ◽  
Vol 37 (1) ◽  
pp. 169-180 ◽  
Author(s):  
Ayse K. Coskun ◽  
Richard Strong ◽  
Dean M. Tullsen ◽  
Tajana Simunic Rosing

2018 ◽  
Vol 17 (2) ◽  
pp. 175-178 ◽  
Author(s):  
Laith M. AlBarakat ◽  
Paul V. Gratz ◽  
Daniel A. Jimenez

Author(s):  
Guillaume Aupy ◽  
Anne Benoit ◽  
Brice Goglin ◽  
Loïc Pottier ◽  
Yves Robert

With the recent advent of many-core architectures such as chip multiprocessors (CMPs), the number of processing units accessing a global shared memory is constantly increasing. Co-scheduling techniques are used to improve application throughput on such architectures, but sharing resources often generates critical interferences. In this article, we focus on the interferences in the last level of cache (LLC) and use the Cache Allocation Technology (CAT) recently provided by Intel to partition the LLC and give each co-scheduled application their own cache area. We consider m iterative HPC applications running concurrently and answer to the following questions: (i) How to precisely model the behavior of these applications on the cache-partitioned platform? and (ii) how many cores and cache fractions should be assigned to each application to maximize the platform efficiency? Here, platform efficiency is defined as maximizing the performance either globally, or as guaranteeing a fixed ratio of iterations per second for each application. Through extensive experiments using CAT, we demonstrate the impact of cache partitioning when multiple HPC applications are co-scheduled onto CMP platforms.


Sign in / Sign up

Export Citation Format

Share Document