A Quantitative Study of the On-Chip Network and Memory Hierarchy Design for Many-Core Processor

With recent developments in parallel supercomputing architecture, many core, multi-core, and GPU processors are now commonplace, resulting in more levels of parallelism, memory hierarchy, and programming complexity. It has been necessary to adapt the MILC code to these new processors starting with NVIDIA GPUs, and more recently, the Intel Xeon Phi processors. We report on our efforts to port and optimize our code for the Intel Knights Landing architecture. We consider performance of the MILC code with MPI and OpenMP, and optimizations with QOPQDP and QPhiX. For the latter approach, we concentrate on the staggered conjugate gradient and gauge force. We also consider performance on recent NVIDIA GPUs using the QUDA library.

Download Full-text

A Survey of Software-Defined Networks-on-Chip: Motivations, Challenges and Opportunities

Micromachines ◽

10.3390/mi12020183 ◽

2021 ◽

Vol 12 (2) ◽

pp. 183

Author(s):

Jose Ricardo Gomez-Rodriguez ◽

Remberto Sandoval-Arechiga ◽

Salvador Ibarra-Delgado ◽

Viktor Ivan Rodriguez-Abdala ◽

Jose Luis Vazquez-Avila ◽

...

Keyword(s):

Single Chip ◽

Synthesis Time ◽

Networks On Chip ◽

Data Dependencies ◽

Layered Architecture ◽

Systems On Chip ◽

Challenges And Opportunities ◽

Computing Platforms ◽

On Chip ◽

Many Core

Current computing platforms encourage the integration of thousands of processing cores, and their interconnections, into a single chip. Mobile smartphones, IoT, embedded devices, desktops, and data centers use Many-Core Systems-on-Chip (SoCs) to exploit their compute power and parallelism to meet the dynamic workload requirements. Networks-on-Chip (NoCs) lead to scalable connectivity for diverse applications with distinct traffic patterns and data dependencies. However, when the system executes various applications in traditional NoCs—optimized and fixed at synthesis time—the interconnection nonconformity with the different applications’ requirements generates limitations in the performance. In the literature, NoC designs embraced the Software-Defined Networking (SDN) strategy to evolve into an adaptable interconnection solution for future chips. However, the works surveyed implement a partial Software-Defined Network-on-Chip (SDNoC) approach, leaving aside the SDN layered architecture that brings interoperability in conventional networking. This paper explores the SDNoC literature and classifies it regarding the desired SDN features that each work presents. Then, we described the challenges and opportunities detected from the literature survey. Moreover, we explain the motivation for an SDNoC approach, and we expose both SDN and SDNoC concepts and architectures. We observe that works in the literature employed an uncomplete layered SDNoC approach. This fact creates various fertile areas in the SDNoC architecture where researchers may contribute to Many-Core SoCs designs.

Download Full-text

Near-Optimal Thermal Monitoring Framework for Many-Core Systems-on-Chip

IEEE Transactions on Computers ◽

10.1109/tc.2015.2395423 ◽

2015 ◽

Vol 64 (11) ◽

pp. 3197-3209 ◽

Cited By ~ 2

Author(s):

Juri Ranieri ◽

Alessandro Vincenzi ◽

Amina Chebira ◽

David Atienza ◽

Martin Vetterli

Keyword(s):

Thermal Monitoring ◽

Systems On Chip ◽

Monitoring Framework ◽

On Chip ◽

Many Core

Download Full-text

MCVP-NoC: Many-Core Virtual Platform with Networks-on-Chip support

2013 IEEE 10th International Conference on ASIC ◽

10.1109/asicon.2013.6811836 ◽

2013 ◽

Author(s):

Dexue Zhang ◽

Xiaoyang Zeng ◽

Zongyan Wang ◽

Weike Wang ◽

Xinhua Chen

Keyword(s):

Networks On Chip ◽

Virtual Platform ◽

On Chip ◽

Many Core

Download Full-text

FoToNoC: A Folded Torus-Like Network-on-Chip Based Many-Core Systems-on-Chip in the Dark Silicon Era

IEEE Transactions on Parallel and Distributed Systems ◽

10.1109/tpds.2016.2643669 ◽

2017 ◽

Vol 28 (7) ◽

pp. 1905-1918 ◽

Cited By ~ 16

Author(s):

Lei Yang ◽

Weichen Liu ◽

Weiwen Jiang ◽

Mengquan Li ◽

Peng Chen ◽

...

Keyword(s):

Network On Chip ◽

Dark Silicon ◽

Systems On Chip ◽

On Chip ◽

Many Core

Download Full-text

Compiler-directed scratchpad memory data transfer optimization for multithreaded applications on a heterogeneous many-core architecture

The Journal of Supercomputing ◽

10.1007/s11227-021-03853-x ◽

2021 ◽

Author(s):

Xiaohan Tao ◽

Jianmin Pang ◽

Jinlong Xu ◽

Yu Zhu

Keyword(s):

Energy Consumption ◽

High Performance ◽

Scientific Computing ◽

Data Transfer ◽

Performance Model ◽

Experimental Result ◽

Transfer Model ◽

Scratchpad Memory ◽

On Chip ◽

Many Core

AbstractThe heterogeneous many-core architecture plays an important role in the fields of high-performance computing and scientific computing. It uses accelerator cores with on-chip memories to improve performance and reduce energy consumption. Scratchpad memory (SPM) is a kind of fast on-chip memory with lower energy consumption compared with a hardware cache. However, data transfer between SPM and off-chip memory can be managed only by a programmer or compiler. In this paper, we propose a compiler-directed multithreaded SPM data transfer model (MSDTM) to optimize the process of data transfer in a heterogeneous many-core architecture. We use compile-time analysis to classify data accesses, check dependences and determine the allocation of data transfer operations. We further present the data transfer performance model to derive the optimal granularity of data transfer and select the most profitable data transfer strategy. We implement the proposed MSDTM on the GCC complier and evaluate it on Sunway TaihuLight with selected test cases from benchmarks and scientific computing applications. The experimental result shows that the proposed MSDTM improves the application execution time by 5.49$$\times$$ × and achieves an energy saving of 5.16$$\times$$ × on average.

Download Full-text

Late Breaking Results: Building an On-Chip Deep Learning Memory Hierarchy Brick by Brick

2020 57th ACM/IEEE Design Automation Conference (DAC) ◽

10.1109/dac18072.2020.9218728 ◽

2020 ◽

Author(s):

Isak Edo Vivancos ◽

Sayeh Sharify ◽

Milos Nikolic ◽

Ciaran Bannon ◽

Mostafa Mahmoud ◽

...

Keyword(s):

Deep Learning ◽

Memory Hierarchy ◽

On Chip

Download Full-text

Hybrid silicon-photonic network-on-chip for future generations of high-performance many-core systems

The Journal of Supercomputing ◽

10.1007/s11227-015-1539-0 ◽

2015 ◽

Vol 71 (12) ◽

pp. 4446-4475 ◽

Cited By ~ 12

Author(s):

Achraf Ben Ahmed ◽

Abderazek Ben Abdallah

Keyword(s):

High Performance ◽

Network On Chip ◽

Future Generations ◽

Photonic Network ◽

Silicon Photonic ◽

Hybrid Silicon ◽

On Chip ◽

Many Core

Download Full-text

Machine learning for design and optimization challenges in multi/many-core network-on-chip

10.1145/3477231.3490427 ◽

2021 ◽

Author(s):

Md Farhadur Reza

Keyword(s):

Machine Learning ◽

Network On Chip ◽

Core Network ◽

Design And Optimization ◽

On Chip ◽

Many Core

Download Full-text

A Novel Hybrid Cache Coherence with Global Snooping for Many-core Architectures

ACM Transactions on Design Automation of Electronic Systems ◽

10.1145/3462775 ◽

2022 ◽

Vol 27 (1) ◽

pp. 1-31

Author(s):

Sri Harsha Gade ◽

Sujay Deb

Keyword(s):

Lower Energy ◽

Cache Coherence ◽

Network On Chip ◽

Highly Efficient ◽

Wireless Links ◽

Coherence Protocols ◽

High Area ◽

On Chip ◽

Many Core ◽

Clustered Network

Cache coherence ensures correctness of cached data in multi-core processors. Traditional implementations of existing protocols make them unscalable for many core architectures. While snoopy coherence requires unscalable ordered networks, directory coherence is weighed down by high area and energy overheads. In this work, we propose Wireless-enabled Share-aware Hybrid (WiSH) to provide scalable coherence in many core processors. WiSH implements a novel Snoopy over Directory protocol using on-chip wireless links and hierarchical, clustered Network-on-Chip to achieve low-overhead and highly efficient coherence. A local directory protocol maintains coherence within a cluster of cores, while coherence among such clusters is achieved through global snoopy protocol. The ordered network for global snooping is provided through low-latency and low-energy broadcast wireless links. The overheads are further reduced through share-aware cache segmentation to eliminate coherence for private blocks. Evaluations show that WiSH reduces traffic by and runtime by , while requiring smaller storage and lower energy as compared to existing hierarchical and hybrid coherence protocols. Owing to its modularity, WiSH provides highly efficient and scalable coherence for many core processors.

Download Full-text