Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors?

Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with finegrained chiplets for deep learning inference, an application domain with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with a batch size of one, delivering an inference latency of 0.50 ms.

Download Full-text

Task and Communication Allocation for Real-time Tasks to Networks-on-Chip Multiprocessors

2020 Second International Conference on Embedded & Distributed Systems (EDiS) ◽

10.1109/edis49545.2020.9296446 ◽

2020 ◽

Author(s):

Chawki Benchehida ◽

Mohammed Kamel Benhaoua ◽

Houssam-Eddine Zahaf ◽

Giuseppe Lipari

Keyword(s):

Real Time ◽

Chip Multiprocessors ◽

Networks On Chip ◽

On Chip

Download Full-text

Virtual reuse distance analysis of SPECjvm2008 data locality

Proceedings of the 7th International Conference on Principles and Practice of Programming in Java - PPPJ '09 ◽

10.1145/1596655.1596684 ◽

2009 ◽

Author(s):

Xiaoming Gu ◽

Xiao-Feng Li ◽

Buqi Cheng ◽

Eric Huang

Keyword(s):

Data Locality ◽

Reuse Distance ◽

Distance Analysis

Download Full-text

A case for shared instruction cache on chip multiprocessors running OLTP

ACM SIGARCH Computer Architecture News ◽

10.1145/1024295.1024297 ◽

2004 ◽

Vol 32 (3) ◽

pp. 11-18 ◽

Cited By ~ 3

Author(s):

Partha Kundu ◽

Murali Annavaram ◽

Trung Diep ◽

John Shen

Keyword(s):

Chip Multiprocessors ◽

Instruction Cache ◽

On Chip

Download Full-text

Using data replication to reduce communication energy on chip multiprocessors

Proceedings of the 2005 conference on Asia South Pacific design automation - ASP-DAC '05 ◽

10.1145/1120725.1121013 ◽

2005 ◽

Cited By ~ 5

Author(s):

M. Kandemir ◽

G. Chen ◽

F. Li ◽

I. Demirkiran

Keyword(s):

Chip Multiprocessors ◽

Data Replication ◽

Using Data ◽

On Chip

Download Full-text

PARALLEL FFT ALGORITHMS ON NETWORK-ON-CHIPS

Journal of Circuits System and Computers ◽

10.1142/s0218126609005046 ◽

2009 ◽

Vol 18 (02) ◽

pp. 255-269 ◽

Cited By ~ 1

Author(s):

JUN HO BAHN ◽

JUNG SOOK YANG ◽

WEN-HSIANG HU ◽

NADER BAGHERZADEH

Keyword(s):

Data Communication ◽

Digital Signal ◽

Variable Number ◽

Data Locality ◽

Communication Traffic ◽

On Chip ◽

Fft Algorithms ◽

Signal Processors ◽

Parallel Fft ◽

Parallel Fft Algorithm

This paper presents parallel FFT algorithms with different degree of computation and communication overheads for multiprocessors in a Network-on-Chip (NoC) environment. Of the three parallel FFT algorithms presented in this paper, we propose two parallel FFT algorithms for a 2D NoC that can contain a variable number of processing elements (PEs) and one is a reference parallel FFT algorithm for comparison. A parallel FFT algorithm we propose increases performance by assigning well-balanced computation tasks to PEs. The execution times are reduced because the algorithm uses data locality well to avoid unnecessary data exchanges among PEs and removes the overall idle periods by2 a balanced task scheduling. An enhanced version of this algorithm is suggested in which communication traffic is reduced. In this algorithm, returning transformed data to an original PE after one computation stage before sending them to a next PE for the following stage is removed. Instead, we propose a method that enables to keep regularity of the data communication and computations with twiddle factors. According to the simulation result from our cycle-accurate SystemC NoC model with a parametrizable 2-D mesh architecture, and the analysis of the algorithms in time and complexity, our proposed algorithms are shown to outperform the reference parallel FFT algorithm and FFT implementations on TI Digital Signal Processors (DSPs) that have similar specifications to our simulation environment.

Download Full-text