Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architecture

The Journal of Supercomputing ◽

10.1007/s11227-021-03853-x ◽

2021 ◽

Author(s):

Xiaohan Tao ◽

Jianmin Pang ◽

Jinlong Xu ◽

Yu Zhu

Keyword(s):

Energy Consumption ◽

High Performance ◽

Scientific Computing ◽

Data Transfer ◽

Performance Model ◽

Experimental Result ◽

Transfer Model ◽

Scratchpad Memory ◽

On Chip ◽

Many Core

AbstractThe heterogeneous many-core architecture plays an important role in the fields of high-performance computing and scientific computing. It uses accelerator cores with on-chip memories to improve performance and reduce energy consumption. Scratchpad memory (SPM) is a kind of fast on-chip memory with lower energy consumption compared with a hardware cache. However, data transfer between SPM and off-chip memory can be managed only by a programmer or compiler. In this paper, we propose a compiler-directed multithreaded SPM data transfer model (MSDTM) to optimize the process of data transfer in a heterogeneous many-core architecture. We use compile-time analysis to classify data accesses, check dependences and determine the allocation of data transfer operations. We further present the data transfer performance model to derive the optimal granularity of data transfer and select the most profitable data transfer strategy. We implement the proposed MSDTM on the GCC complier and evaluate it on Sunway TaihuLight with selected test cases from benchmarks and scientific computing applications. The experimental result shows that the proposed MSDTM improves the application execution time by 5.49$$\times$$ × and achieves an energy saving of 5.16$$\times$$ × on average.

Download Full-text

Hybrid silicon-photonic network-on-chip for future generations of high-performance many-core systems

The Journal of Supercomputing ◽

10.1007/s11227-015-1539-0 ◽

2015 ◽

Vol 71 (12) ◽

pp. 4446-4475 ◽

Cited By ~ 12

Author(s):

Achraf Ben Ahmed ◽

Abderazek Ben Abdallah

Keyword(s):

High Performance ◽

Network On Chip ◽

Future Generations ◽

Photonic Network ◽

Silicon Photonic ◽

Hybrid Silicon ◽

On Chip ◽

Many Core

Download Full-text

Framework for Design Exploration and Performance Analysis of RF-NoC Manycore Architecture

Journal of Low Power Electronics and Applications ◽

10.3390/jlpea10040037 ◽

2020 ◽

Vol 10 (4) ◽

pp. 37

Author(s):

Habiba Lahdhiri ◽

Jordane Lorandel ◽

Salvatore Monteleone ◽

Emmanuelle Bourdel ◽

Maurizio Palesi

Keyword(s):

High Performance ◽

Design Space Exploration ◽

Routing Algorithm ◽

Long Distance ◽

Promising Solution ◽

And Performance ◽

On Chip ◽

Many Core ◽

High Degree ◽

Real Traffic

The Network-on-chip (NoC) paradigm has been proposed as a promising solution to enable the handling of a high degree of integration in multi-/many-core architectures. Despite their advantages, wired NoC infrastructures are facing several performance issues regarding multi-hop long-distance communications. RF-NoC is an attractive solution offering high performance and multicast/broadcast capabilities. However, managing RF links is a critical aspect that relies on both application-dependent and architectural parameters. This paper proposes a design space exploration framework for OFDMA-based RF-NoC architecture, which takes advantage of both real application benchmarks simulated using Sniper and RF-NoC architecture modeled using Noxim. We adopted the proposed framework to finely configure a routing algorithm, working with real traffic, achieving up to 45% of delay reduction, compared to a wired NoC setup in similar conditions.

Download Full-text

Design and Calibration of MIMU Based on Chip Size Micro Inertial Sensors

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.849.302 ◽

2013 ◽

Vol 849 ◽

pp. 302-309

Author(s):

Yun Xu ◽

Xin Hua Zhu ◽

Yu Wang

Keyword(s):

High Performance ◽

Inertial Sensors ◽

Low Cost ◽

Rapid Development ◽

Experimental Result ◽

Integrated Navigation ◽

Bias Stability ◽

Chip Size ◽

Order Of Magnitude ◽

On Chip

With rapid development of micro fabrication technology, the performance of MIMU has gradually improved. The MIMU introduced in this paper is based on the silicon micro machined gyroscope of type MSG7000D and accelerometer of type MSA6000. The volume of it is 3×3×3cm3, the mass is 68.5g and the power consumption is less than 1w. The experimental result shows that the bias stability of the gyroscope and accelerometer for each axis of the designed MIMU is less than 10°/h and 0.5mg respectively. For the non orthogonality in three axes of the structure, MIMU needs to be calibrated. After calibration, the measurement accuracy has improved by an order of magnitude. The designed MIMU can satisfy the requirement of high performance, low cost, light weight and small size for strap-down navigation system, thus it can be widely applied not only to the field of vehicles integrated navigation, attitude measurement but also to the fields of personal goods such as mobile, game consoles and so on.

Download Full-text

Congestion aware adaptive routing for network-on-chip communication

10.32920/ryerson.14645025.v1 ◽

2021 ◽

Author(s):

Stephen Chui

Keyword(s):

Embedded Systems ◽

High Performance ◽

Data Transfer ◽

Adaptive Routing ◽

Network On Chip ◽

Message Routing ◽

Data Packets ◽

Novel Approach ◽

Communication Links ◽

On Chip

Network-On-Chip (NoC) has surpassed the traditional bus based on-chip communication in offering better performance for data transfers among many processing, peripheral and other cores of high performance embedded systems. Adaptive routing provides an effective way of efficient on-chip communication among NoC cores. The message routing efficiency can further improve the performance of NoC based embedded systems on a chip. Congestion awareness has been applied to adaptive routing for achieving better data throughput and latency. This thesis presents a novel approach of analyzing congestion to improve NoC throughput by improving packet allocation in NoC routers. The routers would have the knowledge of the traffic conditions around themselves by utilizing the congestion information. We employ header flits to store the congestion information that does not require any additional communication links between the routers. By prioritizing data packets that are likely to suffer the worst congestion would improve overall NoC data transfer latency.

Download Full-text

3D Stacked Cache Data Management for Energy Minimization of 3D Chip Multiprocessor

International Journal of Students Research in Technology & Management ◽

10.18510/ijsrtm.2015.325 ◽

2015 ◽

Vol 3 (2) ◽

pp. 264-268

Author(s):

K. Suresh Kumar ◽

S. Anitha ◽

M. Gayathri

Keyword(s):

Temperature Distribution ◽

High Performance ◽

Chip Multiprocessors ◽

Electrical Power ◽

Chip Multiprocessor ◽

Energy Reduction ◽

Experimental Result ◽

Data Mapping ◽

Promising Solution ◽

On Chip

In this model a runtime cache data mapping is discussed for 3-D stacked L2 caches to minimize the overall energy of 3-D chip multiprocessors (CMPs). The suggested method considers both temperature distribution and memory traffic of 3-D CMPs. Experimental result shows energy reduction achieving up to 22.88% compared to an existing solution which considers only the temperature distribution. New tendencies envisage 3D Multi-Processor System-On-Chip (MPSoC) design as a promising solution to keep increasing the performance of the next-generation high performance computing (HPC) systems. However, as the power density of HPC systems increases with the arrival of 3D MPSoCs with energy reduction achieving up to 19.55% by supplying electrical power to the computing equipment and constantly removing the generated heat is rapidly becoming the dominant cost in any HPC facility.

Download Full-text

Double-Layer Energy Efficient Synchronous-Asynchronous Circuit-Switched NoC

Electronics ◽

10.3390/electronics10151821 ◽

2021 ◽

Vol 10 (15) ◽

pp. 1821

Author(s):

Sandy A. Wasif ◽

Salma Hesham ◽

Diana Goehringer ◽

Klaus Hofmann ◽

Mohamed A. Abd El Ghany

Keyword(s):

Power Consumption ◽

Double Layer ◽

Energy Efficient ◽

High Performance ◽

Data Transfer ◽

Low Frequency ◽

Asynchronous Circuit ◽

Two Phase ◽

Phase Layer ◽

On Chip

A network-on-chip (NoC) offers high performance, flexibility and scalability in communication infrastructure within multi-core platforms. However, NoCs contribute significantly to the overall system’s power consumption. The double-layer energy efficient synchronous-asynchronous circuit-switched NoC (CS-NoC) is proposed to enhance the power utilization. To reduce the dynamic power consumption, single-rail asynchronous protocols are utilized. The two-phase and four-phase encoding algorithms are analyzed to determine the most efficient technique. For the data layer, the two asynchronous protocols reduced the power consumption by 80%, with an increase in latency when compared with the fully synchronous protocol. However, the two-phase single-rail protocol had better performance compared with the four-phase protocol by 38%, with the same power consumption and a slight increase in area of 5%. Based on this conducted analysis, the asynchronous two-phase layer had significant power reduction yet operated at a moderate frequency. Therefore, the proposed NoC is divided into two data transfer layers with a single control layer. The data transfer layers are designed using synchronous and asynchronous protocols. The synchronous layer is designated to high-frequency loads, and the asynchronous layer is confined to low-frequency loads. The switching between the layers creates a trade-off between the maximum allowed frequency and the power consumption. The proposed NoC reduces the overall power consumption by 23% when compared with recent previous work. The NoC maintains the same system performance with an 8% area increase over the fully synchronous double-layer in the literature.

Download Full-text

Range smart cluster monitor based guesstimate approach for resource scheduling in small size clusters

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i2.9531 ◽

2018 ◽

Vol 7 (2) ◽

pp. 837

Author(s):

S Gokuldev ◽

Jathin R

Keyword(s):

Energy Consumption ◽

Resource Sharing ◽

System Performance ◽

High Performance ◽

Performance Metrics ◽

Resource Scheduling ◽

Turnaround Time ◽

Experimental Result ◽

Battery Life ◽

Power Efficient

Performing scheduling of tasks with low energy consumption with high performance is one of the major concerns in distributed computing. Most of the existing systems have achieved improved energy efficiency but compromised with QoS metrics such as makespan and resource utilization. A resource scheduling strategy for wireless clusters is proposed by making careful considerations on decisions that would im-prove the battery life of nodes. The proposed strategy also incorporates monitoring system with in the clusters for optimizing the system performance as well as energy consumption. The system ensures “Any case zero loss" performance wherein each cluster will be monitored by at least one cluster monitor. This is implemented by using predictive calculation at each cluster monitor to communicate only if absolutely essential, during assigning jobs to resources, selecting optimal resources by assigning the jobs to the most power efficient resource among the available idle resources within the cluster. The experimental result ensures improved system performance with low power consumption in homogeneous computing environment. The resource sharing strategy is experimentally analyzed, considering the important performance metrics such as starvation deadline, turnaround time, miss hit count through simulations. Significant results were observed with improved efficiency.

Download Full-text

On-chip AMBA Bus Based Efficient Bridge between High Performance and Low Peripheral Devices

International Journal of Reconfigurable and Embedded Systems (IJRES) ◽

10.11591/ijres.v6.i1.pp41-47 ◽

2018 ◽

Vol 6 (1) ◽

pp. 41

Author(s):

Anurag Shrivastava ◽

Sudhir Kumar Sharma

Keyword(s):

High Performance ◽

Data Transfer ◽

Functional Description ◽

On Chip

Today’s scenario of SOC deals with integrity and sharing of information or data with various level of communication. AMBA bus protocol has been proposed by ARM community to justify the uneven demand of integrity .In this paper functional description and implementation of high peripheral devices supporting protocol AXI2.0 and its interface between low peripheral devices has been proposed. The connection named as bridge take care of the protocol mismatch and operates on data transfer for uneven speed demand. Asynchronous FIFO has been considered to avoid the complex handshaking mechanism. The design has been implemented within VHDL and implemented on Xilinx Virtex 4.

Download Full-text

Networks on Chips: Structure and Design Methodologies

Journal of Electrical and Computer Engineering ◽

10.1155/2012/509465 ◽

2012 ◽

Vol 2012 ◽

pp. 1-15 ◽

Cited By ~ 20

Author(s):

Wen-Chung Tsai ◽

Ying-Cherng Lan ◽

Yu-Hen Hu ◽

Sao-Jie Chen

Keyword(s):

High Performance ◽

Chip Multiprocessors ◽

Multiprocessor System ◽

Communication Performance ◽

Core System ◽

Traditional System ◽

On Chip ◽

Many Core ◽

Bus Architecture

The next generation of multiprocessor system on chip (MPSoC) and chip multiprocessors (CMPs) will contain hundreds or thousands of cores. Such a many-core system requires high-performance interconnections to transfer data among the cores on the chip. Traditional system components interface with the interconnection backbone via a bus interface. This interconnection backbone can be an on-chip bus or multilayer bus architecture. With the advent of many-core architectures, the bus architecture becomes the performance bottleneck of the on-chip interconnection framework. In contrast, network on chip (NoC) becomes a promising on-chip communication infrastructure, which is commonly considered as an aggressive long-term approach for on-chip communications. Accordingly, this paper first discusses several common architectures and prevalent techniques that can deal well with the design issues of communication performance, power consumption, signal integrity, and system scalability in an NoC. Finally, a novel bidirectional NoC (BiNoC) architecture with a dynamically self-reconfigurable bidirectional channel is proposed to break the conventional performance bottleneck caused by bandwidth restriction in conventional NoCs.

Download Full-text

Hybrid Network-on-Chip: An Application-Aware Framework for Big Data

Complexity ◽

10.1155/2018/1040869 ◽

2018 ◽

Vol 2018 ◽

pp. 1-11

Author(s):

Juan Fang ◽

Sitong Liu ◽

Shijian Liu ◽

Yanjin Cheng ◽

Lu Yu

Keyword(s):

Energy Efficiency ◽

Big Data ◽

Power Consumption ◽

High Performance ◽

Network On Chip ◽

Hybrid Network ◽

Big Data Applications ◽

Application Aware ◽

On Chip ◽

Many Core

Burst growing IoT and cloud computing demand exascale computing systems with high performance and low power consumption to process massive amounts of data. Modern system platforms based on fundamental requirements encounter a performance gap in chasing exponential growth in data speed and amount. To narrow the gap, a heterogamous design gives us a hint. A network-on-chip (NoC) introduces a packet-switched fabric for on-chip communication and becomes the de facto many-core interconnection mechanism; it refers to a vital shared resource for multifarious applications which will notably affect system energy efficiency. Among all the challenges in NoC, unaware application behaviors bring about considerable congestion, which wastes huge amounts of bandwidth and power consumption on the chip. In this paper, we propose a hybrid NoC framework, combining buffered and bufferless NoCs, to make the NoC framework aware of applications’ performance demands. An optimized congestion control scheme is also devised to satisfy the requirement in energy efficiency and the fairness of big data applications. We use a trace-driven simulator to model big data applications. Compared with the classical buffered NoC, the proposed hybrid NoC is able to significantly improve the performance of mixed applications by 17% on average and 24% at the most, decrease the power consumption by 38%, and improve the fairness by 13.3%.

Download Full-text