Studying the impact of hardware prefetching and bandwidth partitioning in chip-multiprocessors

There is a clear industrial trend towards chip multiprocessors (CMP) as the most power efficient way of further increasing performance. Heterogeneous CMP architectures take one more step along this power efficiency trend by using multiple types of processors, tailored to the workloads they will execute. Programming these CMP architectures has been identified as one of the main challenges in the near future, and programming heterogeneous systems is even more challenging. High-level programming models which allow the programmer to identify parallel tasks, and the runtime management of the inter-task dependencies, have been identified as a suitable model for programming such heterogeneous CMP architectures. In this paper we analyze the performance of Cell Superscalar, a task-based programming model for the Cell Broadband Engine Architecture, in terms of its scalability to higher number of on-chip processors. Our results show that the low performance of the PPE component limits the scalability of some applications to less than 16 processors. Since the PPE has been identified as the limiting element, we perform a set of simulation studies evaluating the impact of out-of-order execution, branch prediction and larger caches on the task management overhead. We conclude that out-of-order execution is a very desirable feature, since it increases task management performance by 50%. We also identify memory latency as a fundamental aspect in performance, while the working set is not that large. We expect a significant performance impact if task management would run using a fast private memory to store the task dependency graph instead of relying on the cache hierarchy.

Download Full-text

Understanding how off-chip memory bandwidth partitioning in Chip Multiprocessors affects system performance

HPCA - 16 2010 The Sixteenth International Symposium on High-Performance Computer Architecture ◽

10.1109/hpca.2010.5416655 ◽

2010 ◽

Cited By ~ 3

Author(s):

Fang Liu ◽

Xiaowei Jiang ◽

Yan Solihin

Keyword(s):

System Performance ◽

Chip Multiprocessors ◽

Memory Bandwidth ◽

Bandwidth Partitioning

Download Full-text

Evaluating Resource Centric Behavior of Workloads and Performance Analysis in CMPs due to Shared Resources

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f8872.088619 ◽

2019 ◽

Vol 8 (6) ◽

pp. 4974-4981

Keyword(s):

Resource Sharing ◽

Chip Multiprocessors ◽

Shared Resources ◽

Consumption Behavior ◽

And Performance ◽

On Chip ◽

Core Integration ◽

The Impact ◽

High Processing ◽

Application Programs

Multicore systems are achieving new dimensions to meet the pace between processor technology and high processing demands. These chip multiprocessors (CMPs) share some on- chip and off-chip resources to achieve higher performance. With the increase in core integration, the performance of workloads is dependent on the allocation of resources. Though the CMPs elevate performance, but the challenge imposed due to subtle interactions of several applications contending for resources leads to performance degradation by several magnitudes. In this direction the work performs a two-fold evaluation of application programs. The applications running on CMP cores depict distinct behavior towards consumption of shared resources. It characterizes the resource centric nature of SPEC CPU2006 benchmarks based on their resource consumption behavior. Secondly the work aims to evaluate the effect of inter-core interference on the performance of application programs based on the characterization obtained and the potential contention caused on performance due to corunners. Lastly we place significant remarks on the impact on performance due to resource sharing and its implication on resource contention.

Download Full-text

Analyzing the impact of data prefetching on Chip MultiProcessors

2008 13th Asia-Pacific Computer Systems Architecture Conference ◽

10.1109/apcsac.2008.4625454 ◽

2008 ◽

Author(s):

Naoto Fukumoto ◽

Tomonobu Mihara ◽

Koji Inoue ◽

Kazuaki Murakami

Keyword(s):

Chip Multiprocessors ◽

Data Prefetching ◽

On Chip ◽

The Impact

Download Full-text

Evaluating the impact of job scheduling and power management on processor lifetime for chip multiprocessors

Proceedings of the eleventh international joint conference on Measurement and modeling of computer systems - SIGMETRICS '09 ◽

10.1145/1555349.1555369 ◽

2009 ◽

Cited By ~ 58

Author(s):

Ayse K. Coskun ◽

Richard Strong ◽

Dean M. Tullsen ◽

Tajana Simunic Rosing

Keyword(s):

Power Management ◽

Chip Multiprocessors ◽

Job Scheduling ◽

The Impact

Download Full-text

Evaluating the impact of job scheduling and power management on processor lifetime for chip multiprocessors

ACM SIGMETRICS Performance Evaluation Review ◽

10.1145/2492101.1555369 ◽

2009 ◽

Vol 37 (1) ◽

pp. 169-180 ◽

Cited By ~ 2

Author(s):

Ayse K. Coskun ◽

Richard Strong ◽

Dean M. Tullsen ◽

Tajana Simunic Rosing

Keyword(s):

Power Management ◽

Chip Multiprocessors ◽

Job Scheduling ◽

The Impact

Download Full-text

Evaluation of the impact chip multiprocessors have on SNL application performance.

10.2172/973673 ◽

2009 ◽

Cited By ~ 1

Author(s):

Douglas W. Doerfler

Keyword(s):

Chip Multiprocessors ◽

Application Performance ◽

The Impact

Download Full-text

MTB-Fetch: Multithreading Aware Hardware Prefetching for Chip Multiprocessors

IEEE Computer Architecture Letters ◽

10.1109/lca.2018.2847345 ◽

2018 ◽

Vol 17 (2) ◽

pp. 175-178 ◽

Cited By ~ 4

Author(s):

Laith M. AlBarakat ◽

Paul V. Gratz ◽

Daniel A. Jimenez

Keyword(s):

Chip Multiprocessors ◽

Hardware Prefetching

Download Full-text

Co-scheduling HPC workloads on cache-partitioned CMP platforms

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019846956 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1221-1239

Author(s):

Guillaume Aupy ◽

Anne Benoit ◽

Brice Goglin ◽

Loïc Pottier ◽

Yves Robert

Keyword(s):

Shared Memory ◽

Chip Multiprocessors ◽

Fixed Ratio ◽

Cache Partitioning ◽

Many Core ◽

The Impact ◽

Cache Allocation

With the recent advent of many-core architectures such as chip multiprocessors (CMPs), the number of processing units accessing a global shared memory is constantly increasing. Co-scheduling techniques are used to improve application throughput on such architectures, but sharing resources often generates critical interferences. In this article, we focus on the interferences in the last level of cache (LLC) and use the Cache Allocation Technology (CAT) recently provided by Intel to partition the LLC and give each co-scheduled application their own cache area. We consider m iterative HPC applications running concurrently and answer to the following questions: (i) How to precisely model the behavior of these applications on the cache-partitioned platform? and (ii) how many cores and cache fractions should be assigned to each application to maximize the platform efficiency? Here, platform efficiency is defined as maximizing the performance either globally, or as guaranteeing a fixed ratio of iterations per second for each application. Through extensive experiments using CAT, we demonstrate the impact of cache partitioning when multiple HPC applications are co-scheduled onto CMP platforms.

Download Full-text

Future execution: a hardware prefetching technique for chip multiprocessors

14th International Conference on Parallel Architectures and Compilation Techniques (PACT'05) ◽

10.1109/pact.2005.23 ◽

2005 ◽

Cited By ~ 35

Author(s):

I. Ganusov ◽

M. Burtscher

Keyword(s):

Chip Multiprocessors ◽

Hardware Prefetching ◽

Prefetching Technique

Download Full-text