ICLA Unit: Intra-Cluster Locality-Aware Unit to Reduce L2 Access and NoC Pressure in GPGPUs

Author(s):  
Siamak Biglari Ardabili ◽  
Gholamreza Zare Fatin

As the number of streaming multiprocessors (SMs) in GPUs increases, in order to gain better performance, the reply network faces heavy traffic. This causes congestion on Network-on-Chip (NoC) routers and memory controller’s (MC) buffers. By taking advantage of cooperative thread arrays (CTAs) that are scheduled locally in clusters, there is a high probability of finding the same copy of data in other SM’s [Formula: see text] cache in the same cluster. In order to make this feasible, it is necessary for the SMs to have access to local [Formula: see text] cache of the neighboring SMs. There is a considerable congestion in NoC due to unique traffic pattern called many-to-few-to-many. Thanks to the reduced number of requests that is attained by our proposed Intra-Cluster Locality-Aware (ICLA) unit, this congested replying network traffic becomes many-to-many traffic pattern and the replied data goes through the less-utilized core-to-core communication that mitigates the NoC traffic. The proposed architecture in this paper has been evaluated using 15 different workloads from CUDA SDK, Rodinia, and ISPASS2009 benchmarks. The proposed ICLA unit has been modeled and simulated in the GPGPU-Sim. The results show about 23.79% (up to 49.82%) reduction in average network latency, 15.49% (up to 36.82%) reduction in average [Formula: see text] cache access, and 18.18% (up to 58.1%) average improvement in the instruction per cycle (IPC).

2021 ◽  
Vol 20 (3) ◽  
pp. 1-6
Author(s):  
Mohammed Shaba Saliu ◽  
Muyideen Omuya Momoh ◽  
Pascal Uchenna Chinedu ◽  
Wilson Nwankwo ◽  
Aliu Daniel

Network-on-Chip (NoC) has been proposed as a viable solution to the communication challenges on System-on-Chips (SoCs). As the communication paradigm of SoC, NoCs performance depends mainly on the type of routing algorithm chosen. In this paper different categories of routing algorithms were compared. These include XY routing, OE turn model adaptive routing, DyAD routing and Age-Aware adaptive routing.  By varying the load at different Packet Injection Rate (PIR) under random traffic pattern, comparison was conducted using a 4 × 4 mesh topology. The Noxim simulator, a cycle accurate systemC based simulator was employed. The packets were modeled as a Poisson distribution; first-in-first-out (FIFO) input buffer channel with a depth of five (5) flits and a flit size of 32 bits; and a packet size of 3 flits respectively. The simulation time was 10,000 cycles. The findings showed that the XY routing algorithm performed better when the PIR is low.  In a similar vein, the DyAD routing and Age-aware algorithms performed better when the load i.e. PIR is high.


2005 ◽  
Author(s):  
S. Mahadevan ◽  
F. Angiolini ◽  
M. Storgaard ◽  
R.G. Olsen ◽  
J. Sparso ◽  
...  

2018 ◽  
Vol 5 (1) ◽  
pp. 54-57
Author(s):  
Wahyudi Khusnandar ◽  
Fransiscus Ati Halim ◽  
Felix Lokananta

XY adaptive routing protocol is a routing protocol used on UTAR NoC communication architecture. This routing algorithm adapts shrotest-path first algorithm, which will forward will not be able to work optimally if the closest route no longer have enough bandwidth to continue the packet. Packet will be stored inside the router and forwarded to the nearest router when closest route has enough bandwidth. This paper suggest TTL based routing algorithm to resolve this issue. TTL based routing algorithm adapts XY adaptive routing protocol by adding several parameters on RTL UTAR NoC and additional bit in each packet sent by router. This additional bit and parameter will be used by TTL based algorithm as additional factors in choosing alternative routes inside the communication architecture. Use of TTL on TTL based routing different from use of TTL on communication network. Packets that carry TTL value that equal to Maximum TTL will be route using XY adaptive routing protocol. TTL based routing algorithm has shown better performance compared to XY adaptive routing on some of the experiment done using MSCL NoC Traffic Pattern Suite. This research also proves that TTL based routing algorithm cannot work optimally on small-scaled architecture.


Author(s):  
Shankar Mahadevan ◽  
Federico Angiolini ◽  
Jens Sparsø ◽  
Michael Storgaard ◽  
Jan Madsen ◽  
...  

2019 ◽  
Vol 28 (12) ◽  
pp. 1950202 ◽  
Author(s):  
Khyamling Parane ◽  
B. M. Prabhu Prasad ◽  
Basavaraj Talawar

Many-core systems employ the Network on Chip (NoC) as the underlying communication architecture. To achieve an optimized design for an application under consideration, there is a need for fast and flexible NoC simulator. This paper presents an FPGA-based NoC simulation acceleration framework supporting design space exploration of standard and custom NoC topologies considering a full set of microarchitectural parameters. The framework is capable of designing custom routing algorithms, various traffic patterns such as uniform random, transpose, bit complement and random permutation are supported. For conventional NoCs, the standard minimal routing algorithms are supported. For designing the custom topologies, the table-based routing has been implemented. A custom topology called diagonal mesh has been evaluated using table-based and novel shortest path routing algorithm. A congestion-aware adaptive routing has been proposed to route the packets along the minimally congested path. The congestion-aware adaptive routing algorithm has negligible FPGA area overhead compared to the conventional XY routing. Employing the congestion-aware adaptive routing, network latency is reduced by 55% compared to the XY routing algorithm. The microarchitectural parameters such as buffer depth, traffic pattern and flit width have been varied to observe the effect on NoC behavior. For the [Formula: see text] mesh topology, the LUT and FF usages will be increased from 32.23% to 34.45% and from 12.62% to 15% considering the buffer depth of 4 and flit widths of 16 bits, and 32 bits, respectively. Similar behavior has been observed for other configurations of buffer depth and flit width. The torus topology consumes 24% more resources than the mesh topology. The 56-node fat tree topology consumes 27% and 2.2% more FPGA resources than the [Formula: see text] mesh and torus topologies. The 56-node fat tree topology with buffer depth of 8 and 16 flits saturates at the injection rates of 40% and 45%, respectively.


Electronics ◽  
2019 ◽  
Vol 9 (1) ◽  
pp. 6 ◽  
Author(s):  
Juan Fang ◽  
Tingwen Yu ◽  
Zelin Wei

Multi-core processors integrate with multiple computing units on one chip. This technology is increasingly mature, and communication between cores has become the largest research hotspot. As the number of cores continues to increase, the humble bus structure can no longer play the role of multi-core processors. Network on chip (NoC) connects components through routing, which greatly enhances the efficiency of communication. However, the communication power it consumes and network latency are issues that cannot be ignored. An efficient mapping algorithm is an effective method to reduce the communication power and network latency. This paper proposes a mapping method. First, the task is divided depending on the scale of the task. When the task scale is small, to reduce the communication distance between resource nodes, a given NoC substructure is selected to map the task; when the task scale is large, to reduce the communication between tasks, the tasks are clustered and tasks with dependencies are divided into the same resource node. Then combine with an improving ant colony algorithm (ACO) for mapping. The method proposed is being experimentally verified on NoC platforms of different scales. The experimental results show that the method proposed is very effectual for reducing communication power and network latency during NoC mapping.


2014 ◽  
Vol 2014 ◽  
pp. 1-11 ◽  
Author(s):  
Hui Liu ◽  
Linquan Xie ◽  
Jiansheng Liu ◽  
Lei Ding

This paper studied the topology of NoC (Network-on-Chip). By combining the characteristics of the Clos network and butterfly network, a new topology named BFC (Butterfly Clos-network) network was proposed. This topology integrates several modules, which belongs to the same layer but different dimensions, into a new module. In the BFC network, a bidirectional link is used to complete information exchange, instead of information exchange between different layers in the original network. During the routing period, other nondestination nodes can be used as middle stages to transfer data packets to complete the routing mission. Therefore, this topology has the characteristic of multistage. Simulation analyses show that BFC inherits the rich path diversity of Clos network, and it has a better performance than butterfly network in throughput and delay in a quite congested traffic pattern.


Sign in / Sign up

Export Citation Format

Share Document