Device Hopping

Existing OS techniques for homogeneous many-core systems make it simple for single and multithreaded applications to migrate between cores. Heterogeneous systems do not benefit so fully from this flexibility, and applications that cannot migrate in mid-execution may lose potential performance. The situation is particularly challenging when a switch of language runtime would be desirable in conjunction with a migration. We present a case study in making heterogeneous CPU + GPU systems more flexible in this respect. Our technique for fine-grained application migration, allows switches between OpenMP, OpenCL, and CUDA execution, in conjunction with migrations from GPU to CPU, and CPU to GPU. To achieve this, we subdivide iteration spaces into slices, and consider migration on a slice-by-slice basis. We show that slice sizes can be learned offline by machine learning models. To further improve performance, memory transfers are made migration-aware. The complexity of the migration capability is hidden from programmers behind a high-level programming model. We present a detailed evaluation of our mid-kernel migration mechanism with the First Come, First Served scheduling policy. We compare our technique in a focused evaluation scenario against idealized kernel-by-kernel scheduling, which is typical for current systems, and makes perfect kernel to device scheduling decisions, but cannot migrate kernels mid-execution. Models show that up to 1.33× speedup can be achieved over these systems by adding fine-grained migration. Our experimental results with all nine applicable SHOC and Rodinia benchmarks achieve speedups of up to 1.30× (1.08× on average) over an implementation of a perfect but kernel-migration incapable scheduler when migrated to a faster device. Our mechanism and slice size choices introduce an average slowdown of only 2.44% if kernels never migrate. Lastly, our programming model reduces the code size by at least 88% if compared to manual implementations of migratable kernels.

Download Full-text

One-IPC high-level simulation of microthreaded many-core architectures

The International Journal of High Performance Computing Applications ◽

10.1177/1094342015584495 ◽

2016 ◽

Vol 31 (2) ◽

pp. 152-162 ◽

Cited By ~ 3

Author(s):

Irfan Uddin

Keyword(s):

Design Space Exploration ◽

Instruction Set ◽

Efficient Design ◽

Simulation Framework ◽

Fine Grained ◽

Detailed Simulation ◽

High Level ◽

Many Core ◽

The Cost ◽

Multiple Clusters

The microthreaded many-core architecture is comprised of multiple clusters of fine-grained multi-threaded cores. The management of concurrency is supported in the instruction set architecture of the cores and the computational work in application is asynchronously delegated to different clusters of cores, where the cluster is allocated dynamically. Computer architects are always interested in analyzing the complex interaction amongst the dynamically allocated resources. Generally a detailed simulation with a cycle-accurate simulation of the execution time is used. However, the cycle-accurate simulator for the microthreaded architecture executes at the rate of 100,000 instructions per second, divided over the number of simulated cores. This means that the evaluation of a complex application executing on a contemporary multi-core machine can be very slow. To perform efficient design space exploration we present a co-simulation environment, where the detailed execution of instructions in the pipeline of microthreaded cores and the interactions amongst the hardware components are abstracted. We present the evaluation of the high-level simulation framework against the cycle-accurate simulation framework. The results show that the high-level simulator is faster and less complicated than the cycle-accurate simulator but with the cost of losing accuracy.

Download Full-text

Many-core needs fine-grained scheduling: A case study of query processing on Intel Xeon Phi processors

Journal of Parallel and Distributed Computing ◽

10.1016/j.jpdc.2017.09.005 ◽

2018 ◽

Vol 120 ◽

pp. 395-404 ◽

Cited By ~ 3

Author(s):

Xuntao Cheng ◽

Bingsheng He ◽

Mian Lu ◽

Chiew Tong Lau

Keyword(s):

Query Processing ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Fine Grained ◽

Many Core ◽

Intel Xeon

Download Full-text

City-Scale Mapping of Urban Façade Color Using Street-View Imagery

Remote Sensing ◽

10.3390/rs13081591 ◽

2021 ◽

Vol 13 (8) ◽

pp. 1591

Author(s):

Teng Zhong ◽

Cheng Ye ◽

Zian Wang ◽

Guoan Tang ◽

Wei Zhang ◽

...

Keyword(s):

Deep Learning ◽

Urban Space ◽

Space Planning ◽

Fine Grained ◽

Regular Grids ◽

Planning And Design ◽

Street View ◽

Gray Tone ◽

High Level

Precise urban façade color is the foundation of urban color planning. Nevertheless, existing research on urban colors usually relies on manual sampling due to technical limitations, which brings challenges for evaluating urban façade color with the co-existence of city-scale and fine-grained resolution. In this study, we propose a deep learning-based approach for mapping the urban façade color using street-view imagery. The dominant color of the urban façade (DCUF) is adopted as an indicator to describe the urban façade color. A case study in Shenzhen was conducted to measure the urban façade color using Baidu Street View (BSV) panoramas, with city-scale mapping of the urban façade color in both irregular geographical units and regular grids. Shenzhen’s urban façade color has a gray tone with low chroma. The results demonstrate that the proposed method has a high level of accuracy for the extraction of the urban façade color. In short, this study contributes to the development of urban color planning by efficiently analyzing the urban façade color with higher levels of validity across city-scale areas. Insights into the mapping of the urban façade color from the humanistic perspective could facilitate higher quality urban space planning and design.

Download Full-text

Exploring Many-Core Design Templates for FPGAs and ASICs

International Journal of Reconfigurable Computing ◽

10.1155/2012/439141 ◽

2012 ◽

Vol 2012 ◽

pp. 1-15 ◽

Cited By ~ 4

Author(s):

Ilia Lebedev ◽

Christopher Fletcher ◽

Shaoyi Cheng ◽

James Martin ◽

Austin Doupnik ◽

...

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Coarse Grained ◽

Processing Unit ◽

Fine Grained ◽

Data Parallel ◽

Level Data ◽

Graph Inference ◽

High Level ◽

Many Core

We present a highly productive approach to hardware design based on a many-core microarchitectural template used to implement compute-bound applications expressed in a high-level data-parallel language such as OpenCL. The template is customized on a per-application basis via a range of high-level parameters such as the interconnect topology or processing element architecture. The key benefits of this approach are that it (i) allows programmers to express parallelism through an API defined in a high-level programming language, (ii) supports coarse-grained multithreading and fine-grained threading while permitting bit-level resource control, and (iii) reduces the effort required to repurpose the system for different algorithms or different applications. We compare template-driven design to both full-custom and programmable approaches by studying implementations of a compute-bound data-parallel Bayesian graph inference algorithm across several candidate platforms. Specifically, we examine a range of template-based implementations on both FPGA and ASIC platforms and compare each against full custom designs. Throughout this study, we use a general-purpose graphics processing unit (GPGPU) implementation as a performance and area baseline. We show that our approach, similar in productivity to programmable approaches such as GPGPU applications, yields implementations with performance approaching that of full-custom designs on both FPGA and ASIC platforms.

Download Full-text

PERFORMANCE EVALUATION OF BLAS ON THE TRIDENT PROCESSOR

Parallel Processing Letters ◽

10.1142/s0129626405002325 ◽

2005 ◽

Vol 15 (04) ◽

pp. 407-414

Author(s):

MOSTAFA I. SOLIMAN ◽

STANISLAV G. SEDUKHIN

Keyword(s):

High Performance ◽

Programming Model ◽

Parallel Applications ◽

Instruction Set ◽

Code Size ◽

Data Parallel ◽

Fine Grain ◽

Multi Level ◽

High Level ◽

Programming Interface

Different subtasks of an application usually have different computational, memory, and I/O requirements that result in different needs for computer capabilities. Thus, the more appropriate approach for both high performance and simple programming model is designing a processor having multi-level instruction set architecture (ISA). This leads to high performance and minimum executable code size. Since the fundamental data structures for a wide variety of existing applications are scalar, vector, and matrix, our research Trident processor has three-level ISA executed on zero-, one-, and two-dimensional arrays of data. These levels are used to express a great amount of fine-grain data parallelism to a processor instead of the dynamical extraction by a complicated logic or statically with compilers. This reduces the design complexity and provides high-level programming interface to hardware. In this paper, the performance of Trident processor is evaluated on BLAS, which represent the kernel operations of many data parallel applications. We show that Trident processor proportionally reduces the number of clock cycles per floating-point operation by increasing the number of execution datapaths.

Download Full-text

Hardware and Software Synthesis of Heterogeneous Systems from Dataflow Programs

Journal of Electrical and Computer Engineering ◽

10.1155/2012/484962 ◽

2012 ◽

Vol 2012 ◽

pp. 1-11 ◽

Cited By ~ 10

Author(s):

Ghislain Roquier ◽

Endri Bezati ◽

Marco Mattavelli

Keyword(s):

Programming Model ◽

Multicore Processors ◽

Heterogeneous Systems ◽

Reconfigurable Hardware ◽

Design Flow ◽

Parallel Applications ◽

Application Development ◽

Software Synthesis ◽

Heterogeneous Platforms ◽

High Level

The new generation of multicore processors and reconfigurable hardware platforms provides a dramatic increase of the available parallelism and processing capabilities. However, one obstacle for exploiting all the promises of such platforms is deeply rooted in sequential thinking. The sequential programming model does not naturally expose potential parallelism that effectively permits to build parallel applications that can be efficiently mapped on different kind of platforms. A shift of paradigm is necessary at all levels of application development to yield portable and scalable implementations on the widest range of heterogeneous platforms. This paper presents a design flow for the hardware and software synthesis of heterogeneous systems allowing to automatically generate hardware and software components as well as appropriate interfaces, from a unique high-level description of the application, based on the dataflow paradigm, running onto heterogeneous architectures composed by reconfigurable hardware units and multicore processors. Experimental results based on the implementation of several video coding algorithms onto heterogeneous platforms are also provided to show the effectiveness of the approach both in terms of portability and scalability.

Download Full-text

HIGH LEVEL SIMULATION OF SVP MANY-CORE SYSTEMS

Parallel Processing Letters ◽

10.1142/s0129626411000308 ◽

2011 ◽

Vol 21 (04) ◽

pp. 413-438 ◽

Cited By ~ 6

Author(s):

M. IRFAN UDDIN ◽

MICHIEL W. VAN TOL ◽

CHRIS R. JESSHOPE

Keyword(s):

Execution Time ◽

Simulation Environment ◽

Dynamic Allocation ◽

Simulation Accuracy ◽

Low Level ◽

Fine Grained ◽

Control Code ◽

High Level ◽

Many Core ◽

Multiple Clusters

The Microgrid is a many-core architecture comprising multiple clusters of fine-grained multi-threaded cores. The SVP API supported by the cores allows for the asynchronous delegation of work to different clusters of cores which can be acquired dynamically. We want to explore the execution of complex applications and their interaction with dynamically allocated resources. To date, any evaluation of the Microgrid has used a detailed emulation with a cycle accurate simulation of the execution time. Although the emulator can be used to evaluate small program kernels, it only executes at a rate of 100K instructions per second, divided over the number of emulated cores. This makes it inefficient to evaluate a complex application executing on many cores using dynamic allocation of clusters. In order to obtain a more efficient evaluation we have developed a co-simulation environment that executes high level SVP control code but which abstracts the scheduling of the low-level threads using two different techniques. The co-simulation is evaluated for both performance and simulation accuracy.

Download Full-text

Durable functions: semantics for stateful serverless

Proceedings of the ACM on Programming Languages ◽

10.1145/3485510 ◽

2021 ◽

Vol 5 (OOPSLA) ◽

pp. 1-27

Author(s):

Sebastian Burckhardt ◽

Chris Gillum ◽

David Justo ◽

Konstantinos Kallas ◽

Connor McMahon ◽

...

Keyword(s):

Programming Model ◽

Lambda Calculus ◽

Application Development ◽

Wide Range ◽

Level Model ◽

Elastic Scaling ◽

High Level ◽

New Generation ◽

Execution Models ◽

Critical Sections

Serverless, or Functions-as-a-Service (FaaS), is an increasingly popular paradigm for application development, as it provides implicit elastic scaling and load based billing. However, the weak execution guarantees and intrinsic compute-storage separation of FaaS create serious challenges when developing applications that require persistent state, reliable progress, or synchronization. This has motivated a new generation of serverless frameworks that provide stateful abstractions. For instance, Azure's Durable Functions (DF) programming model enhances FaaS with actors, workflows, and critical sections. As a programming model, DF is interesting because it combines task and actor parallelism, which makes it suitable for a wide range of serverless applications. We describe DF both informally, using examples, and formally, using an idealized high-level model based on the untyped lambda calculus. Next, we demystify how the DF runtime can (1) execute in a distributed unreliable serverless environment with compute-storage separation, yet still conform to the fault-free high-level model, and (2) persist execution progress without requiring checkpointing support by the language runtime. To this end we define two progressively more complex execution models, which contain the compute-storage separation and the record-replay, and prove that they are equivalent to the high-level model.

Download Full-text

Review for "Fine‐grained gravity flow deposits and their depositional processes: A case study from the Cretaceous Nenjiang Formation, Songliao Basin, NE China"

10.1002/gj.4017/v2/review2 ◽

2020 ◽

Keyword(s):

Songliao Basin ◽

Gravity Flow ◽

Fine Grained ◽

Depositional Processes

Download Full-text

Work Life Balance Of Female Crew In The Aviation Industry: A Case Study

GIS Business ◽

10.26643/gis.v14i6.11699 ◽

2019 ◽

Vol 14 (6) ◽

pp. 206-212

Author(s):

Dr. D. Shoba ◽

Dr. G. Suganthi

Keyword(s):

Family Life ◽

Work Life ◽

Work Life Balance ◽

Aviation Industry ◽

Life Balance ◽

International Airport ◽

Work Pressure ◽

Women Employees ◽

High Level

Employees and employers are facing issues in work life balance. It has become a difficult domain now, because the work needs have increased due to an increase in work pressure and complexities in handling the technology. As there are drastic changes in the rules and regulations in the work scenario of the aviation industry, it makes work life balance of employees difficult and set more hurdles. Hence there are many distractions and imbalances in the life of women employees in the aviation industry working across all levels. This work pressure is creating high level of hurdles in maintaining a harmonious job and family life, especially for female aviation employees. Data is collected from 50 female crew members working at Cochin International Airport. The objective of this study is to analyze the work life balance of working females of Cochin International Airport and its influence on their personal and specialized lives. The result of the study shows that the management should frame certain policies which will help employees to have the balance among their personal and expert lives.

Download Full-text