Hardware and Software Synthesis of Heterogeneous Systems from Dataflow Programs

The new generation of multicore processors and reconfigurable hardware platforms provides a dramatic increase of the available parallelism and processing capabilities. However, one obstacle for exploiting all the promises of such platforms is deeply rooted in sequential thinking. The sequential programming model does not naturally expose potential parallelism that effectively permits to build parallel applications that can be efficiently mapped on different kind of platforms. A shift of paradigm is necessary at all levels of application development to yield portable and scalable implementations on the widest range of heterogeneous platforms. This paper presents a design flow for the hardware and software synthesis of heterogeneous systems allowing to automatically generate hardware and software components as well as appropriate interfaces, from a unique high-level description of the application, based on the dataflow paradigm, running onto heterogeneous architectures composed by reconfigurable hardware units and multicore processors. Experimental results based on the implementation of several video coding algorithms onto heterogeneous platforms are also provided to show the effectiveness of the approach both in terms of portability and scalability.

Download Full-text

A "NEAR-THE-BEST" SYSTEM-LEVEL DESIGN METHODOLOGY OF MULTI-CORE H.264 VIDEO DECODER BASED ON THE PARALLELIZED MULTI-CORE SIMULATOR

Journal of Circuits System and Computers ◽

10.1142/s0218126612500582 ◽

2012 ◽

Vol 21 (07) ◽

pp. 1250058

Author(s):

BINGBING XIA ◽

FEI QIAO ◽

ZIDONG DU ◽

DI ZHU ◽

HUAZHONG YANG

Keyword(s):

Video Processing ◽

Power Efficiency ◽

Programming Model ◽

Good Choice ◽

Design Flow ◽

System Level ◽

Core System ◽

Video Decoder ◽

Level Design ◽

High Level

H.264 video decoder is a good choice for embedded video processing applications because of its higher compression ratio than MPEG2, although it has higher requirements of run-time computational resource. Multi-core system is the future of the embedded processor design for its power efficiency and multi-thread parallelization capability, and can be used to fit well with the requirements for such video processing algorithms. To simulate and evaluate the performance of these multi-core systems effectively, a design flow at the system level is developed, at the higher level, the combination of TLM language (SystemC) and shared-memory parallel programming model (OpenMP) is used for such transaction-level simulation, and at the lower level, a multi-core simulator based on the extension of the SimpleScalar 3.0 ToolSet is developed for the cycle-accurate level simulation. Compared with other high-level simulation methods, ours has the ability to realize the true-parallelization simulation. What is more, experiments show that such simulation methodology can effectively simulate these complex multi-core applications in a short time to get the appropriate core number and the task allocation strategy (much less than RTL-level simulation) and the results can get at less than 15% deviated from the ideal ones calculated based on Amadal's Law, so the parallelization strategy obtained from such simulation is the best one that can be further applied for the RTL-level design of the final multi-core system.

Download Full-text

A Framework for Developing Parallel Applications with high level Tasks on Heterogeneous Platforms

Proceedings of the 8th International Workshop on Programming Models and Applications for Multicores and Manycores - PMAM'17 ◽

10.1145/3026937.3026946 ◽

2017 ◽

Cited By ~ 1

Author(s):

Chao Liu ◽

Miriam Leeser

Keyword(s):

Parallel Applications ◽

Heterogeneous Platforms ◽

High Level

Download Full-text

PERFORMANCE EVALUATION OF BLAS ON THE TRIDENT PROCESSOR

Parallel Processing Letters ◽

10.1142/s0129626405002325 ◽

2005 ◽

Vol 15 (04) ◽

pp. 407-414

Author(s):

MOSTAFA I. SOLIMAN ◽

STANISLAV G. SEDUKHIN

Keyword(s):

High Performance ◽

Programming Model ◽

Parallel Applications ◽

Instruction Set ◽

Code Size ◽

Data Parallel ◽

Fine Grain ◽

Multi Level ◽

High Level ◽

Programming Interface

Different subtasks of an application usually have different computational, memory, and I/O requirements that result in different needs for computer capabilities. Thus, the more appropriate approach for both high performance and simple programming model is designing a processor having multi-level instruction set architecture (ISA). This leads to high performance and minimum executable code size. Since the fundamental data structures for a wide variety of existing applications are scalar, vector, and matrix, our research Trident processor has three-level ISA executed on zero-, one-, and two-dimensional arrays of data. These levels are used to express a great amount of fine-grain data parallelism to a processor instead of the dynamical extraction by a complicated logic or statically with compilers. This reduces the design complexity and provides high-level programming interface to hardware. In this paper, the performance of Trident processor is evaluated on BLAS, which represent the kernel operations of many data parallel applications. We show that Trident processor proportionally reduces the number of clock cycles per floating-point operation by increasing the number of execution datapaths.

Download Full-text

Implicitly threaded parallelism in Manticore

Journal of Functional Programming ◽

10.1017/s0956796810000201 ◽

2010 ◽

Vol 20 (5-6) ◽

pp. 537-576 ◽

Cited By ~ 29

Author(s):

MATTHEW FLUET ◽

MIKE RAINEY ◽

JOHN REPPY ◽

ADAM SHAW

Keyword(s):

Large Scale ◽

Multicore Processors ◽

Regular Structure ◽

Parallel Applications ◽

Parallel Language ◽

Fine Grained ◽

Data Parallel ◽

Parallel Languages ◽

Parallel Case ◽

High Level

AbstractThe increasing availability of commodity multicore processors is making parallel computing ever more widespread. In order to exploit its potential, programmers need languages that make the benefits of parallelism accessible and understandable. Previous parallel languages have traditionally been intended for large-scale scientific computing, and they tend not to be well suited to programming the applications one typically finds on a desktop system. Thus, we need new parallel-language designs that address a broader spectrum of applications. The Manticore project is our effort to address this need. At its core is Parallel ML, a high-level functional language for programming parallel applications on commodity multicore hardware. Parallel ML provides a diverse collection of parallel constructs for different granularities of work. In this paper, we focus on the implicitly threaded parallel constructs of the language, which support fine-grained parallelism. We concentrate on those elements that distinguish our design from related ones, namely, a novel parallel binding form, a nondeterministic parallel case form, and the treatment of exceptions in the presence of data parallelism. These features differentiate the present work from related work on functional data-parallel language designs, which have focused largely on parallel problems with regular structure and the compiler transformations—most notably, flattening—that make such designs feasible. We present detailed examples utilizing various mechanisms of the language and give a formal description of our implementation.

Download Full-text

Device Hopping

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3471909 ◽

2021 ◽

Vol 18 (4) ◽

pp. 1-25

Author(s):

Paul Metzger ◽

Volker Seeker ◽

Christian Fensch ◽

Murray Cole

Keyword(s):

Programming Model ◽

Heterogeneous Systems ◽

Code Size ◽

Fine Grained ◽

Scheduling Policy ◽

High Level ◽

Many Core ◽

Execution Models ◽

Current Systems

Existing OS techniques for homogeneous many-core systems make it simple for single and multithreaded applications to migrate between cores. Heterogeneous systems do not benefit so fully from this flexibility, and applications that cannot migrate in mid-execution may lose potential performance. The situation is particularly challenging when a switch of language runtime would be desirable in conjunction with a migration. We present a case study in making heterogeneous CPU + GPU systems more flexible in this respect. Our technique for fine-grained application migration, allows switches between OpenMP, OpenCL, and CUDA execution, in conjunction with migrations from GPU to CPU, and CPU to GPU. To achieve this, we subdivide iteration spaces into slices, and consider migration on a slice-by-slice basis. We show that slice sizes can be learned offline by machine learning models. To further improve performance, memory transfers are made migration-aware. The complexity of the migration capability is hidden from programmers behind a high-level programming model. We present a detailed evaluation of our mid-kernel migration mechanism with the First Come, First Served scheduling policy. We compare our technique in a focused evaluation scenario against idealized kernel-by-kernel scheduling, which is typical for current systems, and makes perfect kernel to device scheduling decisions, but cannot migrate kernels mid-execution. Models show that up to 1.33× speedup can be achieved over these systems by adding fine-grained migration. Our experimental results with all nine applicable SHOC and Rodinia benchmarks achieve speedups of up to 1.30× (1.08× on average) over an implementation of a perfect but kernel-migration incapable scheduler when migrated to a faster device. Our mechanism and slice size choices introduce an average slowdown of only 2.44% if kernels never migrate. Lastly, our programming model reduces the code size by at least 88% if compared to manual implementations of migratable kernels.

Download Full-text

Durable functions: semantics for stateful serverless

Proceedings of the ACM on Programming Languages ◽

10.1145/3485510 ◽

2021 ◽

Vol 5 (OOPSLA) ◽

pp. 1-27

Author(s):

Sebastian Burckhardt ◽

Chris Gillum ◽

David Justo ◽

Konstantinos Kallas ◽

Connor McMahon ◽

...

Keyword(s):

Programming Model ◽

Lambda Calculus ◽

Application Development ◽

Wide Range ◽

Level Model ◽

Elastic Scaling ◽

High Level ◽

New Generation ◽

Execution Models ◽

Critical Sections

Serverless, or Functions-as-a-Service (FaaS), is an increasingly popular paradigm for application development, as it provides implicit elastic scaling and load based billing. However, the weak execution guarantees and intrinsic compute-storage separation of FaaS create serious challenges when developing applications that require persistent state, reliable progress, or synchronization. This has motivated a new generation of serverless frameworks that provide stateful abstractions. For instance, Azure's Durable Functions (DF) programming model enhances FaaS with actors, workflows, and critical sections. As a programming model, DF is interesting because it combines task and actor parallelism, which makes it suitable for a wide range of serverless applications. We describe DF both informally, using examples, and formally, using an idealized high-level model based on the untyped lambda calculus. Next, we demystify how the DF runtime can (1) execute in a distributed unreliable serverless environment with compute-storage separation, yet still conform to the fault-free high-level model, and (2) persist execution progress without requiring checkpointing support by the language runtime. To this end we define two progressively more complex execution models, which contain the compute-storage separation and the record-replay, and prove that they are equivalent to the high-level model.

Download Full-text

Designing heterogeneous systems including programmable hardware, multicore processors, DSP, processors and more

IET and Electronics Weekly Seminar on Programmable Hardware Systems ◽

10.1049/ic:20080616 ◽

2008 ◽

Author(s):

C. Turner

Keyword(s):

Multicore Processors ◽

Heterogeneous Systems ◽

Dsp Processors ◽

Programmable Hardware

Download Full-text

High-Level Parallel Ant Colony Optimization with Algorithmic Skeletons

International Journal of Parallel Programming ◽

10.1007/s10766-021-00714-1 ◽

2021 ◽

Author(s):

Breno A. de Melo Menezes ◽

Nina Herrmann ◽

Herbert Kuchen ◽

Fernando Buarque de Lima Neto

Keyword(s):

Ant Colony Optimization ◽

High Performance ◽

Optimization Problems ◽

Programming Model ◽

Parallel Implementation ◽

Ant Colony ◽

Algorithmic Skeletons ◽

Low Level ◽

Programming Patterns ◽

High Level

AbstractParallel implementations of swarm intelligence algorithms such as the ant colony optimization (ACO) have been widely used to shorten the execution time when solving complex optimization problems. When aiming for a GPU environment, developing efficient parallel versions of such algorithms using CUDA can be a difficult and error-prone task even for experienced programmers. To overcome this issue, the parallel programming model of Algorithmic Skeletons simplifies parallel programs by abstracting from low-level features. This is realized by defining common programming patterns (e.g. map, fold and zip) that later on will be converted to efficient parallel code. In this paper, we show how algorithmic skeletons formulated in the domain specific language Musket can cope with the development of a parallel implementation of ACO and how that compares to a low-level implementation. Our experimental results show that Musket suits the development of ACO. Besides making it easier for the programmer to deal with the parallelization aspects, Musket generates high performance code with similar execution times when compared to low-level implementations.

Download Full-text