Early Periodic Register Allocation on ILP Processors

Register allocation in loops is generally performed after or during the software pipelining process. This is because doing a conventional register allocation as a first step without assuming a schedule lacks the information of interferences between values live ranges. Thus, the register allocator may introduce an excessive amount of false dependences that dramatically reduce the ILP (Instruction Level Parallelism). We present a new theoretical framework for controlling the register pressure before software pipelining. Thus is based on inserting some anti-dependence edges (register reuse edges) labeled with reuse distances, directly on the data dependence graph. In this new graph, we are able to fix the register pressure, measured as the number of simultaneously alive variables in any schedule. The determination of register and distance reuse is parameterized by the desired minimum initiation interval (MII) as well as by the register pressure constraints - either can be minimized while the other one is fixed. After scheduling, register allocation is done on conventional register sets or on rotating register files. We give an optimal exact model, and an approximation that generalizes the Ning-Gao [22] buffer optimization method. We provide experimental results which show good improvement compared to [22]. Our theoretical model considers superscalar, VLIW and EPIC/IA64 processors.

Download Full-text

CODE OPTIMIZATION METHOD FOR QUALCOMM HEXAGON PROCESSOR, SUPPORTING INSTRUCTION LEVEL PARALLELISM AND BUILT WITH VLIW (Very Long Instruction Word) ARCHITECTURE

ITNOU: Information technologies in education, science and management ◽

10.47501/itnou.2021.1.105-115 ◽

2021 ◽

Vol 115 ◽

pp. 105-115

Author(s):

Tatiana Nikolaevna Romanova ◽

◽

Dmitry Igorevich Gorin ◽

Keyword(s):

Optimization Method ◽

Instruction Level Parallelism ◽

Code Optimization ◽

Very Long Instruction Word ◽

Running Time ◽

Level Parallelism ◽

Packet Density

A method for optimizing the filling of a machine word with independent instructions is proposed, which allows to increase the performance of programs by stacking the maximum number of independent commands in a package. The paper also confirms the hypothesis that with the transition to random register allocation by the compiler, the packet density will increase, which will result in a decrease in the program's running time.

Download Full-text

COMPARISON OF INSTRUCTION SCHEDULING AND REGISTER ALLOCATION FOR MIPS AND HPL-PD ARCHITECTURE FOR EXPLOITATION OF INSTRUCTION LEVEL PARALLELISM

Engineering Heritage Journal ◽

10.26480/gwk.02.2018.04.08 ◽

2018 ◽

Vol 2 (2) ◽

pp. 04-08

Author(s):

Rajendra Kumar

Keyword(s):

Instruction Scheduling ◽

Instruction Level Parallelism ◽

Level Parallelism

Download Full-text

COMPARISON OF INSTRUCTION SCHEDULING AND REGISTER ALLOCATION FOR MIPS AND HPL-PD ARCHITECTURE FOR EXPLOITATION OF INSTRUCTION LEVEL PARALLELISM

Engineering Heritage Journal ◽

10.26480/gwk.01.2018.04.08 ◽

2018 ◽

Vol 2 (2) ◽

pp. 04-08 ◽

Cited By ~ 1

Author(s):

Rajendra Kumar

Keyword(s):

Instruction Scheduling ◽

Instruction Level Parallelism ◽

Level Parallelism

Download Full-text

The Potential for a GPU-Like Overlay Architecture for FPGAs

International Journal of Reconfigurable Computing ◽

10.1155/2011/514581 ◽

2011 ◽

Vol 2011 ◽

pp. 1-15 ◽

Cited By ~ 14

Author(s):

Jeffrey Kingyens ◽

J. Gregory Steffan

Keyword(s):

Graphics Processing Units ◽

Programming Model ◽

Instruction Level Parallelism ◽

Floating Point ◽

High Level ◽

Graphics Processing ◽

Level Parallelism ◽

Data Level ◽

Accelerator System

We propose a soft processor programming model and architecture inspired by graphics processing units (GPUs) that are well-matched to the strengths of FPGAs, namely, highly parallel and pipelinable computation. In particular, our soft processor architecture exploits multithreading, vector operations, and predication to supply a floating-point pipeline of 64 stages via hardware support for up to 256 concurrent thread contexts. The key new contributions of our architecture are mechanisms for managing threads and register files that maximize data-level and instruction-level parallelism while overcoming the challenges of port limitations of FPGA block memories as well as memory and pipeline latency. Through simulation of a system that (i) is programmable via NVIDIA's high-levelCglanguage, (ii) supports AMD's CTM r5xx GPU ISA, and (iii) is realizable on an XtremeData XD1000 FPGA-based accelerator system, we demonstrate the potential for such a system to achieve 100% utilization of a deeply pipelined floating-point datapath.

Download Full-text

Microarchitectural Characterization on a Mobile Workload

Applied Sciences ◽

10.3390/app11031225 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1225

Author(s):

Woohyong Lee ◽

Jiyoung Lee ◽

Bo Kyung Park ◽

R. Young Chul Kim

Keyword(s):

Performance Monitoring ◽

Performance Metrics ◽

Performance Comparison ◽

Instruction Level Parallelism ◽

Data Set ◽

Performance Events ◽

Hardware Performance Counters ◽

On Chip ◽

The Comparative Study ◽

Level Parallelism

Geekbench is one of the most referenced cross-platform benchmarks in the mobile world. Most of its workloads are synthetic but some of them aim to simulate real-world behavior. In the mobile world, its microarchitectural behavior has been reported rarely since the hardware profiling features are limited to the public. As a popular mobile performance workload, it is hard to find Geekbench’s microarchitecture characteristics in mobile devices. In this paper, a thorough experimental study of Geekbench performance characterization is reported with detailed performance metrics. This study also identifies mobile system on chip (SoC) microarchitecture impacts, such as the cache subsystem, instruction-level parallelism, and branch performance. After the study, we could understand the bottleneck of workloads, especially in the cache sub-system. This means that the change of data set size directly impacts performance score significantly in some systems and will ruin the fairness of the CPU benchmark. In the experiment, Samsung’s Exynos9820-based platform was used as the tested device with Android Native Development Kit (NDK) built binaries. The Exynos9820 is a superscalar processor capable of dual issuing some instructions. To help performance analysis, we enable the capability to collect performance events with performance monitoring unit (PMU) registers. The PMU is a set of hardware performance counters which are built into microprocessors to store the counts of hardware-related activities. Throughout the experiment, functional and microarchitectural performance profiles were fully studied. This paper describes the details of the mobile performance studies above. In our experiment, the ARM DS5 tool was used for collecting runtime PMU profiles including OS-level performance data. After the comparative study is completed, users will understand more about the mobile architecture behavior, and this will help to evaluate which benchmark is preferable for fair performance comparison.

Download Full-text

UltraSynth: Insights of a CGRA Integration into a Control Engineering Environment

Journal of Signal Processing Systems ◽

10.1007/s11265-021-01641-7 ◽

2021 ◽

Author(s):

Dennis Wolf ◽

Andreas Engel ◽

Tajas Ruschke ◽

Andreas Koch ◽

Christian Hochberger

Keyword(s):

Computing System ◽

Coarse Grained ◽

Instruction Level Parallelism ◽

Control Engineering ◽

Processing Elements ◽

Actual Application ◽

Reconfigurable Arrays ◽

Engineering Environment ◽

On Chip ◽

Level Parallelism

AbstractCoarse Grained Reconfigurable Arrays (CGRAs) or Architectures are a concept for hardware accelerators based on the idea of distributing workload over Processing Elements. These processors exploit instruction level parallelism, while being energy efficient due to their simplistic internal structure. However, the incorporation into a complete computing system raises severe challenges at the hardware and software level. This article evaluates a CGRA integrated into a control engineering environment targeting a Xilinx Zynq System on Chip (SoC) in detail. Besides the actual application execution performance, the practicability of the configuration toolchain is validated. Challenges of the real-world integration are discussed and practical insights are highlighted.

Download Full-text

Determination of Fowler–Nordheim tunneling parameters in Metal–Oxide–Semiconductor structure including oxide field correction using a vertical optimization method

Solid-State Electronics ◽

10.1016/j.sse.2016.04.007 ◽

2016 ◽

Vol 122 ◽

pp. 56-63 ◽

Cited By ~ 9

Author(s):

S. Toumi ◽

Z. Ouennoughi ◽

K.C. Strenger ◽

L. Frey

Keyword(s):

Metal Oxide ◽

Optimization Method ◽

Metal Oxide Semiconductor ◽

Semiconductor Structure ◽

Oxide Semiconductor ◽

Nordheim Tunneling ◽

Metal Oxide Semiconductor Structure

Download Full-text

A Hybrid Approach to Computer-Aided Process Planning for Prismatic Parts

ASME 1994 International Computers in Engineering Conference and Exhibition ◽

10.1115/cie1994-0429 ◽

1994 ◽

Author(s):

Y. F. Zhang ◽

A. Y. C. Nee ◽

J. Y. H. Fuh

Keyword(s):

Process Planning ◽

Hybrid Approach ◽

Optimization Method ◽

Optimal Operation ◽

Operation Sequencing ◽

Computer Aided Process Planning ◽

Prismatic Parts ◽

Minimum Number ◽

Set Up

Abstract One of the most difficult tasks in automated process planning is the determination of operation sequencing. This paper describes a hybrid approach for identifying the optimal operation sequence of machining prismatic parts on a three-axis milling machining centre. In the proposed methodology, the operation sequencing is carried out in two levels of planning: set-up planning and operation planning. Various constraints on the precedence relationships between features are identified and rules and heuristics are created. Based on the precedence relationships between features, an optimization method is developed to find the optimal plan(s) with minimum number of set-ups in which the conflict between the feature precedence relationships and set-up sequence is avoided. For each set-up, an optimal feature machining sequence with minimum number of tool changes is also determined using a developed algorithm. The proposed system is still under development and the hybrid approach is partially implemented. An example is provided to demonstrate this approach.

Download Full-text