Translating Timing into an Architecture: The Synergy of COTSon and HLS (Domain Expertise—Designing a Computer Architecture via HLS)

Translating a system requirement into a low-level representation (e.g., register transfer level or RTL) is the typical goal of the design of FPGA-based systems. However, the Design Space Exploration (DSE) needed to identify the final architecture may be time consuming, even when using high-level synthesis (HLS) tools. In this article, we illustrate our hybrid methodology, which uses a frontend for HLS so that the DSE is performed more rapidly by using a higher level abstraction, but without losing accuracy, thanks to the HP-Labs COTSon simulation infrastructure in combination with our DSE tools (MYDSE tools). In particular, this proposed methodology proved useful to achieve an appropriate design of a whole system in a shorter time than trying to design everything directly in HLS. Our motivating problem was to deploy a novel execution model called data-flow threads (DF-Threads) running on yet-to-be-designed hardware. For that goal, directly using the HLS was too premature in the design cycle. Therefore, a key point of our methodology consists in defining the first prototype in our simulation framework and gradually migrating the design into the Xilinx HLS after validating the key performance metrics of our novel system in the simulator. To explain this workflow, we first use a simple driving example consisting in the modelling of a two-way associative cache. Then, we explain how we generalized this methodology and describe the types of results that we were able to analyze in the AXIOM project, which helped us reduce the development time from months/weeks to days/hours.

Download Full-text

One-IPC high-level simulation of microthreaded many-core architectures

The International Journal of High Performance Computing Applications ◽

10.1177/1094342015584495 ◽

2016 ◽

Vol 31 (2) ◽

pp. 152-162 ◽

Cited By ~ 3

Author(s):

Irfan Uddin

Keyword(s):

Design Space Exploration ◽

Instruction Set ◽

Efficient Design ◽

Simulation Framework ◽

Fine Grained ◽

Detailed Simulation ◽

High Level ◽

Many Core ◽

The Cost ◽

Multiple Clusters

The microthreaded many-core architecture is comprised of multiple clusters of fine-grained multi-threaded cores. The management of concurrency is supported in the instruction set architecture of the cores and the computational work in application is asynchronously delegated to different clusters of cores, where the cluster is allocated dynamically. Computer architects are always interested in analyzing the complex interaction amongst the dynamically allocated resources. Generally a detailed simulation with a cycle-accurate simulation of the execution time is used. However, the cycle-accurate simulator for the microthreaded architecture executes at the rate of 100,000 instructions per second, divided over the number of simulated cores. This means that the evaluation of a complex application executing on a contemporary multi-core machine can be very slow. To perform efficient design space exploration we present a co-simulation environment, where the detailed execution of instructions in the pipeline of microthreaded cores and the interactions amongst the hardware components are abstracted. We present the evaluation of the high-level simulation framework against the cycle-accurate simulation framework. The results show that the high-level simulator is faster and less complicated than the cycle-accurate simulator but with the cost of losing accuracy.

Download Full-text

Evaluation of Static Mapping for Dynamic Space-Shared Multi-task Processing on FPGAs

Journal of Signal Processing Systems ◽

10.1007/s11265-020-01633-z ◽

2021 ◽

Author(s):

Umar Ibrahim Minhas ◽

Roger Woods ◽

Georgios Karakonstantis

Keyword(s):

High Performance ◽

Design Space Exploration ◽

Design Space ◽

System Throughput ◽

Design Parameters ◽

Temporal Constraints ◽

Shared Resources ◽

Task Processing ◽

High Level ◽

Performance Computing

AbstractWhilst FPGAs have been used in cloud ecosystems, it is still extremely challenging to achieve high compute density when mapping heterogeneous multi-tasks on shared resources at runtime. This work addresses this by treating the FPGA resource as a service and employing multi-task processing at the high level, design space exploration and static off-line partitioning in order to allow more efficient mapping of heterogeneous tasks onto the FPGA. In addition, a new, comprehensive runtime functional simulator is used to evaluate the effect of various spatial and temporal constraints on both the existing and new approaches when varying system design parameters. A comprehensive suite of real high performance computing tasks was implemented on a Nallatech 385 FPGA card and show that our approach can provide on average 2.9 × and 2.3 × higher system throughput for compute and mixed intensity tasks, while 0.2 × lower for memory intensive tasks due to external memory access latency and bandwidth limitations. The work has been extended by introducing a novel scheduling scheme to enhance temporal utilization of resources when using the proposed approach. Additional results for large queues of mixed intensity tasks (compute and memory) show that the proposed partitioning and scheduling approach can provide higher than 3 × system speedup over previous schemes.

Download Full-text

Implementation and Design Space Exploration of a Turbo Decoder in High-Level Synthesis

2019 International Conference on ReConFigurable Computing and FPGAs (ReConFig) ◽

10.1109/reconfig48160.2019.8994787 ◽

2019 ◽

Author(s):

Wesley Stirk ◽

Jeff Goeders

Keyword(s):

Design Space Exploration ◽

Design Space ◽

Space Exploration ◽

High Level Synthesis ◽

Turbo Decoder ◽

High Level

Download Full-text

Distributed design-space exploration for high-level synthesis systems

[1992] Proceedings 29th ACM/IEEE Design Automation Conference ◽

10.1109/dac.1992.227806 ◽

2003 ◽

Cited By ~ 24

Author(s):

R. Dutta ◽

J. Roy ◽

R. Vemuri

Keyword(s):

Design Space Exploration ◽

Design Space ◽

Space Exploration ◽

High Level Synthesis ◽

Distributed Design ◽

High Level

Download Full-text

Book review: High-Level Language Computer Architecture edited by Veljko Milutinovic (Computer Science Press, 1989)

ACM SIGARCH Computer Architecture News ◽

10.1145/379126.773540 ◽

1990 ◽

Vol 18 (1) ◽

pp. 120-122

Author(s):

Robert P. Colwell

Keyword(s):

Computer Science ◽

Computer Architecture ◽

High Level Language ◽

High Level

Download Full-text

High level performance metrics for FPGA-based multiprocessor systems

Performance Evaluation ◽

10.1016/j.peva.2009.12.004 ◽

2010 ◽

Vol 67 (6) ◽

pp. 417-431 ◽

Cited By ~ 2

Author(s):

Marta Beltrán ◽

Antonio Guzmán ◽

Fernando Sevillano

Keyword(s):

Performance Metrics ◽

Multiprocessor Systems ◽

High Level ◽

Level Performance

Download Full-text

Divide and conquer high-level synthesis design space exploration

ACM Transactions on Design Automation of Electronic Systems ◽

10.1145/2209291.2209302 ◽

2012 ◽

Vol 17 (3) ◽

pp. 1-19 ◽

Cited By ~ 24

Author(s):

Benjamin Carrion Schafer ◽

Kazutoshi Wakabayashi

Keyword(s):

Design Space Exploration ◽

Design Space ◽

Space Exploration ◽

Divide And Conquer ◽

High Level Synthesis ◽

Synthesis Design ◽

High Level

Download Full-text

High-Level Synthesis Design for Stencil Computations on FPGA with High Bandwidth Memory

Electronics ◽

10.3390/electronics9081275 ◽

2020 ◽

Vol 9 (8) ◽

pp. 1275

Author(s):

Changdao Du ◽

Yoshiki Yamaguchi

Keyword(s):

Programming Languages ◽

High Performance ◽

Design Space Exploration ◽

Scale Up ◽

High Level Synthesis ◽

Stencil Computations ◽

Temporal Domain ◽

High Bandwidth ◽

Promising Solution ◽

High Level

Due to performance and energy requirements, FPGA-based accelerators have become a promising solution for high-performance computations. Meanwhile, with the help of high-level synthesis (HLS) compilers, FPGA can be programmed using common programming languages such as C, C++, or OpenCL, thereby improving design efficiency and portability. Stencil computations are significant kernels in various scientific applications. In this paper, we introduce an architecture design for implementing stencil kernels on state-of-the-art FPGA with high bandwidth memory (HBM). Traditional FPGAs are usually equipped with external memory, e.g., DDR3 or DDR4, which limits the design space exploration in the spatial domain of stencil kernels. Therefore, many previous studies mainly relied on exploiting parallelism in the temporal domain to eliminate the bandwidth limitations. In our approach, we scale-up the design performance by considering both the spatial and temporal parallelism of the stencil kernel equally. We also discuss the design portability among different HLS compilers. We use typical stencil kernels to evaluate our design on a Xilinx U280 FPGA board and compare the results with other existing studies. By adopting our method, developers can take broad parallelization strategies based on specific FPGA resources to improve performance.

Download Full-text

A rapid design space exploration approach for multi-objective optimization of DSP filter designs

10.32920/ryerson.14653857.v1 ◽

2021 ◽

Author(s):

Aakriti Tarun Sharma

Keyword(s):

Design Space Exploration ◽

Design Space ◽

Space Exploration ◽

Systems Design ◽

Design Solution ◽

Multi Objective Optimization ◽

Multi Objective ◽

Time Period ◽

And Performance ◽

High Level

The process of converting a behavioral specification of an application to its equivalent system architecture is referred to as High Level-Synthesis (HLS). A crucial stage in embedded systems design involves finding the trade off between resource utilization and performance. An exhaustive search would yield the required results, but would take a huge amount of time to arrive at the solution even for smaller designs. This would result in a high time complexity. We employ the use of Design Space Exploration (DSE) in order to reduce the complexity of the design space and to reach the desired results in less time. In reality, there are multiple constraints defined by the user that need to be satisfied simultaneously. Thus, the nature of the task at hand is referred to as Multi-Objective Optimization. In this thesis, the design process of DSP benchmarks was analyzed based on user defined constraints such as power and execution time. The analyzed outcome was compared with the existing approaches in DSE and an optimal design solution was derived in a shorter time period.

Download Full-text

Execution time - area tradeoff in gausing residual load decoder: Integrated exploration of chaining based schedule and allocation in HLS for hardware accelerators

Facta universitatis - series Electronics and Energetics ◽

10.2298/fuee1402235s ◽

2014 ◽

Vol 27 (2) ◽

pp. 235-249 ◽

Cited By ~ 1

Author(s):

Anirban Sengupta ◽

Reza Sedaghat ◽

Vipul Mishra

Keyword(s):

Execution Time ◽

Design Space Exploration ◽

Design Space ◽

Space Exploration ◽

Integrated Design ◽

Hardware Accelerators ◽

Average Improvement ◽

Genetic Algorithm Approach ◽

High Level ◽

The Cost

Design space exploration is an indispensable segment of High Level Synthesis (HLS) design of hardware accelerators. This paper presents a novel technique for Area-Execution time tradeoff using residual load decoding heuristics in genetic algorithms (GA) for integrated design space exploration (DSE) of scheduling and allocation. This approach is also able to resolve issues encountered during DSE of data paths for hardware accelerators, such as accuracy of the solution found, as well as the total exploration time during the process. The integrated solution found by the proposed approach satisfies the user specified constraints of hardware area and total execution time (not just latency), while at the same time offers a twofold unified solution of chaining based schedule and allocation. The cost function proposed in the genetic algorithm approach takes into account the functional units, multiplexers and demultiplexers needed during implementation. The proposed exploration system (ExpSys) was tested on a large number of benchmarks drawn from the literature for assessment of its efficiency. Results indicate an average improvement in Quality of Results (QoR) greater than 26% when compared to a recent well known GA based exploration method.

Download Full-text