Exploiting loop-level parallelism on coarse-grained reconfigurable architectures using modulo scheduling

Author(s):  
Bingfeng Mei ◽  
S. Vernalde ◽  
D. Verkest ◽  
H. De Man ◽  
R. Lauwereins
Author(s):  
Rani Gnanaolivu ◽  
Theodore S. Norvell ◽  
Ramachandran Venkatesan

Coarse-Grained Reconfigurable Architectures (CGRAs) have gained currency in recent years due to their abundant parallelism and flexibility. To utilize the parallelism found in CGRAs, this paper proposes a fast and efficient Modulo-Constrained Hybrid Particle Swarm Optimization (MCHPSO) scheduling algorithm to exploit loop-level parallelism in applications. This paper shows that Particle Swarm Optimization (PSO) is capable of software pipelining loops by overlapping placement, scheduling and routing of successive loop iterations and executing them in parallel. The proposed algorithm has been experimentally validated on various DSP benchmarks under two different architecture configurations. These experiments indicate that the proposed MCHPSO algorithm can find schedules with small initiation intervals within a reasonable amount of time. The MCHPSO scheduling algorithm was analyzed with different topologies and Functional Unit (FU) configurations. The authors have tested the parallelizability of the algorithm and found that it exhibits a nearly linear speedup on a multi-core CPU.


Electronics ◽  
2021 ◽  
Vol 10 (18) ◽  
pp. 2210
Author(s):  
Zhongyuan Zhao ◽  
Weiguang Sheng ◽  
Jinchao Li ◽  
Pengfei Ye ◽  
Qin Wang ◽  
...  

Modulo-scheduled coarse-grained reconfigurable array (CGRA) processors have shown their potential for exploiting loop-level parallelism at high energy efficiency. However, these CGRAs need frequent reconfiguration during their execution, which makes them suffer from large area and power overhead for context memory and context-fetching. To tackle this challenge, this paper uses an architecture/compiler co-designed method for context reduction. From an architecture perspective, we carefully partition the context into several subsections and only fetch the subsections that are different to the former context word whenever fetching the new context. We package each different subsection with an opcode and index value to formulate a context-fetching primitive (CFP) and explore the hardware design space by providing the centralized and distributed CFP-fetching CGRA to support this CFP-based context-fetching scheme. From the software side, we develop a similarity-aware tuning algorithm and integrate it into state-of-the-art modulo scheduling and memory access conflict optimization algorithms. The whole compilation flow can efficiently improve the similarities between contexts in each PE for the purpose of reducing both context-fetching latency and context footprint. Experimental results show that our HW/SW co-designed framework can improve the area efficiency and energy efficiency to at most 34% and 21% higher with only 2% performance overhead.


2009 ◽  
Vol 44 (7) ◽  
pp. 21-30 ◽  
Author(s):  
Taewook Oh ◽  
Bernhard Egger ◽  
Hyunchul Park ◽  
Scott Mahlke

Sign in / Sign up

Export Citation Format

Share Document