One-IPC high-level simulation of microthreaded many-core architectures

The microthreaded many-core architecture is comprised of multiple clusters of fine-grained multi-threaded cores. The management of concurrency is supported in the instruction set architecture of the cores and the computational work in application is asynchronously delegated to different clusters of cores, where the cluster is allocated dynamically. Computer architects are always interested in analyzing the complex interaction amongst the dynamically allocated resources. Generally a detailed simulation with a cycle-accurate simulation of the execution time is used. However, the cycle-accurate simulator for the microthreaded architecture executes at the rate of 100,000 instructions per second, divided over the number of simulated cores. This means that the evaluation of a complex application executing on a contemporary multi-core machine can be very slow. To perform efficient design space exploration we present a co-simulation environment, where the detailed execution of instructions in the pipeline of microthreaded cores and the interactions amongst the hardware components are abstracted. We present the evaluation of the high-level simulation framework against the cycle-accurate simulation framework. The results show that the high-level simulator is faster and less complicated than the cycle-accurate simulator but with the cost of losing accuracy.

Download Full-text

HIGH LEVEL SIMULATION OF SVP MANY-CORE SYSTEMS

Parallel Processing Letters ◽

10.1142/s0129626411000308 ◽

2011 ◽

Vol 21 (04) ◽

pp. 413-438 ◽

Cited By ~ 6

Author(s):

M. IRFAN UDDIN ◽

MICHIEL W. VAN TOL ◽

CHRIS R. JESSHOPE

Keyword(s):

Execution Time ◽

Simulation Environment ◽

Dynamic Allocation ◽

Simulation Accuracy ◽

Low Level ◽

Fine Grained ◽

Control Code ◽

High Level ◽

Many Core ◽

Multiple Clusters

The Microgrid is a many-core architecture comprising multiple clusters of fine-grained multi-threaded cores. The SVP API supported by the cores allows for the asynchronous delegation of work to different clusters of cores which can be acquired dynamically. We want to explore the execution of complex applications and their interaction with dynamically allocated resources. To date, any evaluation of the Microgrid has used a detailed emulation with a cycle accurate simulation of the execution time. Although the emulator can be used to evaluate small program kernels, it only executes at a rate of 100K instructions per second, divided over the number of emulated cores. This makes it inefficient to evaluate a complex application executing on many cores using dynamic allocation of clusters. In order to obtain a more efficient evaluation we have developed a co-simulation environment that executes high level SVP control code but which abstracts the scheduling of the low-level threads using two different techniques. The co-simulation is evaluated for both performance and simulation accuracy.

Download Full-text

Execution time - area tradeoff in gausing residual load decoder: Integrated exploration of chaining based schedule and allocation in HLS for hardware accelerators

Facta universitatis - series Electronics and Energetics ◽

10.2298/fuee1402235s ◽

2014 ◽

Vol 27 (2) ◽

pp. 235-249 ◽

Cited By ~ 1

Author(s):

Anirban Sengupta ◽

Reza Sedaghat ◽

Vipul Mishra

Keyword(s):

Execution Time ◽

Design Space Exploration ◽

Design Space ◽

Space Exploration ◽

Integrated Design ◽

Hardware Accelerators ◽

Average Improvement ◽

Genetic Algorithm Approach ◽

High Level ◽

The Cost

Design space exploration is an indispensable segment of High Level Synthesis (HLS) design of hardware accelerators. This paper presents a novel technique for Area-Execution time tradeoff using residual load decoding heuristics in genetic algorithms (GA) for integrated design space exploration (DSE) of scheduling and allocation. This approach is also able to resolve issues encountered during DSE of data paths for hardware accelerators, such as accuracy of the solution found, as well as the total exploration time during the process. The integrated solution found by the proposed approach satisfies the user specified constraints of hardware area and total execution time (not just latency), while at the same time offers a twofold unified solution of chaining based schedule and allocation. The cost function proposed in the genetic algorithm approach takes into account the functional units, multiplexers and demultiplexers needed during implementation. The proposed exploration system (ExpSys) was tested on a large number of benchmarks drawn from the literature for assessment of its efficiency. Results indicate an average improvement in Quality of Results (QoR) greater than 26% when compared to a recent well known GA based exploration method.

Download Full-text

Translating Timing into an Architecture: The Synergy of COTSon and HLS (Domain Expertise—Designing a Computer Architecture via HLS)

International Journal of Reconfigurable Computing ◽

10.1155/2019/2624938 ◽

2019 ◽

Vol 2019 ◽

pp. 1-18 ◽

Cited By ~ 1

Author(s):

Roberto Giorgi ◽

Farnam Khalili ◽

Marco Procaccini

Keyword(s):

Computer Architecture ◽

Design Space Exploration ◽

Performance Metrics ◽

Simulation Framework ◽

Design Cycle ◽

System Requirement ◽

Domain Expertise ◽

Execution Model ◽

High Level

Translating a system requirement into a low-level representation (e.g., register transfer level or RTL) is the typical goal of the design of FPGA-based systems. However, the Design Space Exploration (DSE) needed to identify the final architecture may be time consuming, even when using high-level synthesis (HLS) tools. In this article, we illustrate our hybrid methodology, which uses a frontend for HLS so that the DSE is performed more rapidly by using a higher level abstraction, but without losing accuracy, thanks to the HP-Labs COTSon simulation infrastructure in combination with our DSE tools (MYDSE tools). In particular, this proposed methodology proved useful to achieve an appropriate design of a whole system in a shorter time than trying to design everything directly in HLS. Our motivating problem was to deploy a novel execution model called data-flow threads (DF-Threads) running on yet-to-be-designed hardware. For that goal, directly using the HLS was too premature in the design cycle. Therefore, a key point of our methodology consists in defining the first prototype in our simulation framework and gradually migrating the design into the Xilinx HLS after validating the key performance metrics of our novel system in the simulator. To explain this workflow, we first use a simple driving example consisting in the modelling of a two-way associative cache. Then, we explain how we generalized this methodology and describe the types of results that we were able to analyze in the AXIOM project, which helped us reduce the development time from months/weeks to days/hours.

Download Full-text

A high level synthesis design flow with a novel approach for efficient design space exploration in case of multi-parametric optimization objective

Microelectronics Reliability ◽

10.1016/j.microrel.2009.11.015 ◽

2010 ◽

Vol 50 (3) ◽

pp. 424-437 ◽

Cited By ~ 17

Author(s):

Anirban Sengupta ◽

Reza Sedaghat ◽

Zhipeng Zeng

Keyword(s):

Design Space Exploration ◽

Parametric Optimization ◽

Design Space ◽

Space Exploration ◽

Design Flow ◽

High Level Synthesis ◽

Efficient Design ◽

Synthesis Design ◽

Novel Approach ◽

High Level

Download Full-text

A Signature-Based Power Model for MPSoC on FPGA

VLSI Design ◽

10.1155/2012/196984 ◽

2012 ◽

Vol 2012 ◽

pp. 1-13 ◽

Cited By ~ 9

Author(s):

Roberta Piscitelli ◽

Andy D. Pimentel

Keyword(s):

Design Space Exploration ◽

Design Space ◽

Power Estimation ◽

System Level ◽

Estimation Methods ◽

Instruction Set ◽

Systems On Chip ◽

Event Signatures ◽

On Chip ◽

High Level

This paper presents a framework for high-level power estimation of multiprocessor systems-on-chip (MPSoC) architectures on FPGA. The technique is based on abstract execution profiles, called event signatures, and it operates at a higher level of abstraction than, for example, commonly used instruction-set simulator (ISS)-based power estimation methods and should thus be capable of achieving good evaluation performance. As a consequence, the technique can be very useful in the context of early system-level design space exploration. We integrated the power estimation technique in a system-level MPSoC synthesis framework. Subsequently, using this framework, we designed a range of different candidate architectures which contain different numbers of MicroBlaze processors and compared our power estimation results to those from real measurements on a Virtex-6 FPGA board.

Download Full-text

Exploring Many-Core Design Templates for FPGAs and ASICs

International Journal of Reconfigurable Computing ◽

10.1155/2012/439141 ◽

2012 ◽

Vol 2012 ◽

pp. 1-15 ◽

Cited By ~ 4

Author(s):

Ilia Lebedev ◽

Christopher Fletcher ◽

Shaoyi Cheng ◽

James Martin ◽

Austin Doupnik ◽

...

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Coarse Grained ◽

Processing Unit ◽

Fine Grained ◽

Data Parallel ◽

Level Data ◽

Graph Inference ◽

High Level ◽

Many Core

We present a highly productive approach to hardware design based on a many-core microarchitectural template used to implement compute-bound applications expressed in a high-level data-parallel language such as OpenCL. The template is customized on a per-application basis via a range of high-level parameters such as the interconnect topology or processing element architecture. The key benefits of this approach are that it (i) allows programmers to express parallelism through an API defined in a high-level programming language, (ii) supports coarse-grained multithreading and fine-grained threading while permitting bit-level resource control, and (iii) reduces the effort required to repurpose the system for different algorithms or different applications. We compare template-driven design to both full-custom and programmable approaches by studying implementations of a compute-bound data-parallel Bayesian graph inference algorithm across several candidate platforms. Specifically, we examine a range of template-based implementations on both FPGA and ASIC platforms and compare each against full custom designs. Throughout this study, we use a general-purpose graphics processing unit (GPGPU) implementation as a performance and area baseline. We show that our approach, similar in productivity to programmable approaches such as GPGPU applications, yields implementations with performance approaching that of full-custom designs on both FPGA and ASIC platforms.

Download Full-text

High level synthesis design flow for multi parametric optimization with hybrid hierarchical design space exploration

10.32920/ryerson.14661969 ◽

2021 ◽

Author(s):

Zhipeng Zeng

Keyword(s):

Design Space Exploration ◽

Design Space ◽

Space Exploration ◽

Design Flow ◽

System Level ◽

High Level Synthesis ◽

Efficient Design ◽

Optimum Result ◽

Fuzzy Search ◽

High Level

High Level Synthesis (HLS) has definitely bridged the pathway between the Electronic System Level (ESL) and its respective structural block at the Register Transfer Level (RTL). However, the most critical task during HLS is to assess and find a superior architecture from the design space that meets the design objectives. This thesis introduces a novel mechanism for efficient Design Space Exploration (DSE) based on Priority Facgtor using the Fuzzy search technique to achieve the optimum result. This novel approach is more efficient than traditional DSE approaches and is capable of drastically reducing the number of architectural variants to be assessed for architecture selection. The proposed method, when applied to a number of benchmarks, yielded improved results with remarkable speedup compared to the existing approach. The HLS design flow shown in this thesis uses the proposed approach for DSE with optimization of three parameters, hardware area, execution time and power consumption.

Download Full-text

High level synthesis design flow for multi parametric optimization with hybrid hierarchical design space exploration

10.32920/ryerson.14661969.v1 ◽

2021 ◽

Author(s):

Zhipeng Zeng

Keyword(s):

Design Space Exploration ◽

Design Space ◽

Space Exploration ◽

Design Flow ◽

System Level ◽

High Level Synthesis ◽

Efficient Design ◽

Optimum Result ◽

Fuzzy Search ◽

High Level

Download Full-text

Device Hopping

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3471909 ◽

2021 ◽

Vol 18 (4) ◽

pp. 1-25

Author(s):

Paul Metzger ◽

Volker Seeker ◽

Christian Fensch ◽

Murray Cole

Keyword(s):

Programming Model ◽

Heterogeneous Systems ◽

Code Size ◽

Fine Grained ◽

Scheduling Policy ◽

High Level ◽

Many Core ◽

Execution Models ◽

Current Systems

Existing OS techniques for homogeneous many-core systems make it simple for single and multithreaded applications to migrate between cores. Heterogeneous systems do not benefit so fully from this flexibility, and applications that cannot migrate in mid-execution may lose potential performance. The situation is particularly challenging when a switch of language runtime would be desirable in conjunction with a migration. We present a case study in making heterogeneous CPU + GPU systems more flexible in this respect. Our technique for fine-grained application migration, allows switches between OpenMP, OpenCL, and CUDA execution, in conjunction with migrations from GPU to CPU, and CPU to GPU. To achieve this, we subdivide iteration spaces into slices, and consider migration on a slice-by-slice basis. We show that slice sizes can be learned offline by machine learning models. To further improve performance, memory transfers are made migration-aware. The complexity of the migration capability is hidden from programmers behind a high-level programming model. We present a detailed evaluation of our mid-kernel migration mechanism with the First Come, First Served scheduling policy. We compare our technique in a focused evaluation scenario against idealized kernel-by-kernel scheduling, which is typical for current systems, and makes perfect kernel to device scheduling decisions, but cannot migrate kernels mid-execution. Models show that up to 1.33× speedup can be achieved over these systems by adding fine-grained migration. Our experimental results with all nine applicable SHOC and Rodinia benchmarks achieve speedups of up to 1.30× (1.08× on average) over an implementation of a perfect but kernel-migration incapable scheduler when migrated to a faster device. Our mechanism and slice size choices introduce an average slowdown of only 2.44% if kernels never migrate. Lastly, our programming model reduces the code size by at least 88% if compared to manual implementations of migratable kernels.

Download Full-text

Changing Role of Agriculture in Income and Employment, and Trends of Agricultural Worker Productivity in Indian States

Indian Journal of Economics and Development ◽

10.35716/ijed/ns20-008 ◽

2020 ◽

pp. 183-189

Keyword(s):

Green Revolution ◽

Positive Trend ◽

Worker Productivity ◽

Indian States ◽

Agricultural Worker ◽

The Sixties ◽

High Level ◽

The Cost ◽

Changing Role

The productivity of land has been often discussed and deliberated by the academia and policymakers to understand agriculture, however, very few studies have focused on the agriculture worker productivity to analyze this sector. This study concentrates on the productivity of agricultural workers from across the states taking two-time points into consideration. The agriculture worker productivity needs to be dealt with seriously and on a time series basis so that the marginal productivity of worker can be ascertained but also the dependency of worker on agriculture gets revealed. There is still disguised unemployment in all the states and high level of labour migration, yet most of the states showed the dependency has gone down. Although a state like Madhya Pradesh is doing very well in terms of income earned but that is at the cost of increased worker power in agriculture as a result of which, the productivity of worker has gone down. States like Mizoram, Meghalaya, Nagaland and Tripura, though small in size showed remarkable growth in productivity and all these states showed a positive trend in terms of worker shifting away from agriculture. The traditional states which gained the most from Green Revolution of the sixties are performing decently well, but they need to have the next major policy push so that they move to the next orbit of growth.

Download Full-text