scholarly journals Is Reuse Distance Applicable to Data Locality Analysis on Chip Multiprocessors?

Author(s):  
Yunlian Jiang ◽  
Eddy Z. Zhang ◽  
Kai Tian ◽  
Xipeng Shen
2014 ◽  
Vol 13 (4) ◽  
pp. 1-36 ◽  
Author(s):  
Luis Angel D. Bathen ◽  
Nikil D. Dutt
Keyword(s):  

2021 ◽  
Vol 64 (6) ◽  
pp. 107-116
Author(s):  
Yakun Sophia Shao ◽  
Jason Cemons ◽  
Rangharajan Venkatesan ◽  
Brian Zimmer ◽  
Matthew Fojtik ◽  
...  

Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with finegrained chiplets for deep learning inference, an application domain with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with a batch size of one, delivering an inference latency of 0.50 ms.


2004 ◽  
Vol 32 (3) ◽  
pp. 11-18 ◽  
Author(s):  
Partha Kundu ◽  
Murali Annavaram ◽  
Trung Diep ◽  
John Shen

2009 ◽  
Vol 18 (02) ◽  
pp. 255-269 ◽  
Author(s):  
JUN HO BAHN ◽  
JUNG SOOK YANG ◽  
WEN-HSIANG HU ◽  
NADER BAGHERZADEH

This paper presents parallel FFT algorithms with different degree of computation and communication overheads for multiprocessors in a Network-on-Chip (NoC) environment. Of the three parallel FFT algorithms presented in this paper, we propose two parallel FFT algorithms for a 2D NoC that can contain a variable number of processing elements (PEs) and one is a reference parallel FFT algorithm for comparison. A parallel FFT algorithm we propose increases performance by assigning well-balanced computation tasks to PEs. The execution times are reduced because the algorithm uses data locality well to avoid unnecessary data exchanges among PEs and removes the overall idle periods by2 a balanced task scheduling. An enhanced version of this algorithm is suggested in which communication traffic is reduced. In this algorithm, returning transformed data to an original PE after one computation stage before sending them to a next PE for the following stage is removed. Instead, we propose a method that enables to keep regularity of the data communication and computations with twiddle factors. According to the simulation result from our cycle-accurate SystemC NoC model with a parametrizable 2-D mesh architecture, and the analysis of the algorithms in time and complexity, our proposed algorithms are shown to outperform the reference parallel FFT algorithm and FFT implementations on TI Digital Signal Processors (DSPs) that have similar specifications to our simulation environment.


Sign in / Sign up

Export Citation Format

Share Document