CODE OPTIMIZATION METHOD FOR QUALCOMM HEXAGON PROCESSOR, SUPPORTING INSTRUCTION LEVEL PARALLELISM AND BUILT WITH VLIW (Very Long Instruction Word) ARCHITECTURE

ITNOU: Information technologies in education, science and management ◽

10.47501/itnou.2021.1.105-115 ◽

2021 ◽

Vol 115 ◽

pp. 105-115

Author(s):

Tatiana Nikolaevna Romanova ◽

◽

Dmitry Igorevich Gorin ◽

Keyword(s):

Optimization Method ◽

Instruction Level Parallelism ◽

Code Optimization ◽

Very Long Instruction Word ◽

Running Time ◽

Level Parallelism ◽

Packet Density

A method for optimizing the filling of a machine word with independent instructions is proposed, which allows to increase the performance of programs by stacking the maximum number of independent commands in a package. The paper also confirms the hypothesis that with the transition to random register allocation by the compiler, the packet density will increase, which will result in a decrease in the program's running time.

Download Full-text

Early Periodic Register Allocation on ILP Processors

Parallel Processing Letters ◽

10.1142/s012962640400188x ◽

2004 ◽

Vol 14 (02) ◽

pp. 287-313 ◽

Cited By ~ 14

Author(s):

Sid-Ahmed-Ali TOUATI ◽

Christine EISENBEIS

Keyword(s):

Optimization Method ◽

Software Pipelining ◽

Instruction Level Parallelism ◽

Good Improvement ◽

Level Parallelism ◽

Register allocation in loops is generally performed after or during the software pipelining process. This is because doing a conventional register allocation as a first step without assuming a schedule lacks the information of interferences between values live ranges. Thus, the register allocator may introduce an excessive amount of false dependences that dramatically reduce the ILP (Instruction Level Parallelism). We present a new theoretical framework for controlling the register pressure before software pipelining. Thus is based on inserting some anti-dependence edges (register reuse edges) labeled with reuse distances, directly on the data dependence graph. In this new graph, we are able to fix the register pressure, measured as the number of simultaneously alive variables in any schedule. The determination of register and distance reuse is parameterized by the desired minimum initiation interval (MII) as well as by the register pressure constraints - either can be minimized while the other one is fixed. After scheduling, register allocation is done on conventional register sets or on rotating register files. We give an optimal exact model, and an approximation that generalizes the Ning-Gao [22] buffer optimization method. We provide experimental results which show good improvement compared to [22]. Our theoretical model considers superscalar, VLIW and EPIC/IA64 processors.

Download Full-text

Research on Cipher Coprocessor Instruction Level Parallelism Compiler

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.130-134.2907 ◽

2011 ◽

Vol 130-134 ◽

pp. 2907-2910

Author(s):

Hong Yan Li

Keyword(s):

System Architecture ◽

Instruction Level Parallelism ◽

Design Technique ◽

Improve Performance ◽

Specific Instruction ◽

Very Long Instruction Word ◽

Important Method ◽

Reconfigurable Design ◽

Level Parallelism

The important method of studying cipher coprocessor is focus on system architecture of processor in combination with reconfigurable design technique. How to improve performance of cipher coprocessor is important. Based on very long instruction word (VLIW) structure and reconfigurable design technique, specific instruction cipher coprocessor is designed. In this paper, the cipher coprocessor instruction level parallelism compilation technique is studied to enhance the cipher coprocessor performance by increasing the instruction level parallelism.

Download Full-text

Code Optimization of Polynomial Approximation Functions on Clustered Instruction-Level Parallelism Processors

International Journal of Computers and Applications ◽

10.1080/1206212x.2006.11441823 ◽

2006 ◽

Vol 28 (4) ◽

pp. 367-378

Author(s):

M. Yang ◽

J. Wang ◽

S.Q. Zheng ◽

Y. Jiang

Keyword(s):

Polynomial Approximation ◽

Instruction Level Parallelism ◽

Code Optimization ◽

Level Parallelism

Download Full-text

CODE OPTIMIZATION OF POLYNOMIAL APPROXIMATION FUNCTIONS ON CLUSTERED INSTRUCTION-LEVEL PARALLELISM PROCESSORS

International Journal of Computers and Applications ◽

10.2316/journal.202.2006.4.202-1835 ◽

2006 ◽

Vol 28 (4) ◽

Author(s):

M. Yang ◽

J. Wang ◽

S.Q. Zheng ◽

Y. Jiang

Keyword(s):

Polynomial Approximation ◽

Instruction Level Parallelism ◽

Code Optimization ◽

Level Parallelism

Download Full-text

COMPARISON OF INSTRUCTION SCHEDULING AND REGISTER ALLOCATION FOR MIPS AND HPL-PD ARCHITECTURE FOR EXPLOITATION OF INSTRUCTION LEVEL PARALLELISM

Engineering Heritage Journal ◽

10.26480/gwk.02.2018.04.08 ◽

2018 ◽

Vol 2 (2) ◽

pp. 04-08

Author(s):

Rajendra Kumar

Keyword(s):

Instruction Scheduling ◽

Instruction Level Parallelism ◽

Level Parallelism

Download Full-text

A NOVEL IMPLEMENTATION OF 32-BIT VLIW-MISC PROCESSOR ON FPGA

International Journal of Computer and Communication Technology ◽

10.47893/ijcct.2016.1339 ◽

2016 ◽

pp. 60-63

Author(s):

M. KAMARAJU ◽

M. ALEKHYA ◽

K.LAL KISHORE

Keyword(s):

Embedded Systems ◽

High Performance ◽

Optimal Choice ◽

Performance Level ◽

Instruction Level Parallelism ◽

Very Long Instruction Word ◽

Risc Processor ◽

Level Parallelism ◽

High Performance Level

The main objective of this work is to implement a 32-bit pipelined RISC processor without interlocking stages. It is developed by S.I.M.E (Single Instruction Multiple Execution) that is with single instruction scheme more executions can be done and is based on VLIW(Very Long Instruction Word) architecture processing is an optimal choice in the attempt to obtain high performance level in Embedded Systems. In VLIW based architecture, the effectiveness of the processor depends on the ability of compilers to provide sufficient instruction level parallelism (ILP). The processor has been designed with VHDL, synthesized using Xilinx tool.

Download Full-text

COMPARISON OF INSTRUCTION SCHEDULING AND REGISTER ALLOCATION FOR MIPS AND HPL-PD ARCHITECTURE FOR EXPLOITATION OF INSTRUCTION LEVEL PARALLELISM

Engineering Heritage Journal ◽

10.26480/gwk.01.2018.04.08 ◽

2018 ◽

Vol 2 (2) ◽

pp. 04-08 ◽

Cited By ~ 1

Author(s):

Rajendra Kumar

Keyword(s):

Instruction Scheduling ◽

Instruction Level Parallelism ◽

Level Parallelism

Download Full-text

Microarchitectural Characterization on a Mobile Workload

Applied Sciences ◽

10.3390/app11031225 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1225

Author(s):

Woohyong Lee ◽

Jiyoung Lee ◽

Bo Kyung Park ◽

R. Young Chul Kim

Keyword(s):

Performance Monitoring ◽

Performance Metrics ◽

Performance Comparison ◽

Instruction Level Parallelism ◽

Data Set ◽

Performance Events ◽

Hardware Performance Counters ◽

On Chip ◽

The Comparative Study ◽

Level Parallelism

Geekbench is one of the most referenced cross-platform benchmarks in the mobile world. Most of its workloads are synthetic but some of them aim to simulate real-world behavior. In the mobile world, its microarchitectural behavior has been reported rarely since the hardware profiling features are limited to the public. As a popular mobile performance workload, it is hard to find Geekbench’s microarchitecture characteristics in mobile devices. In this paper, a thorough experimental study of Geekbench performance characterization is reported with detailed performance metrics. This study also identifies mobile system on chip (SoC) microarchitecture impacts, such as the cache subsystem, instruction-level parallelism, and branch performance. After the study, we could understand the bottleneck of workloads, especially in the cache sub-system. This means that the change of data set size directly impacts performance score significantly in some systems and will ruin the fairness of the CPU benchmark. In the experiment, Samsung’s Exynos9820-based platform was used as the tested device with Android Native Development Kit (NDK) built binaries. The Exynos9820 is a superscalar processor capable of dual issuing some instructions. To help performance analysis, we enable the capability to collect performance events with performance monitoring unit (PMU) registers. The PMU is a set of hardware performance counters which are built into microprocessors to store the counts of hardware-related activities. Throughout the experiment, functional and microarchitectural performance profiles were fully studied. This paper describes the details of the mobile performance studies above. In our experiment, the ARM DS5 tool was used for collecting runtime PMU profiles including OS-level performance data. After the comparative study is completed, users will understand more about the mobile architecture behavior, and this will help to evaluate which benchmark is preferable for fair performance comparison.

Download Full-text

UltraSynth: Insights of a CGRA Integration into a Control Engineering Environment

Journal of Signal Processing Systems ◽

10.1007/s11265-021-01641-7 ◽

2021 ◽

Author(s):

Dennis Wolf ◽

Andreas Engel ◽

Tajas Ruschke ◽

Andreas Koch ◽

Christian Hochberger

Keyword(s):

Computing System ◽

Coarse Grained ◽

Instruction Level Parallelism ◽

Control Engineering ◽

Processing Elements ◽

Actual Application ◽

Reconfigurable Arrays ◽

Engineering Environment ◽

On Chip ◽

Level Parallelism

AbstractCoarse Grained Reconfigurable Arrays (CGRAs) or Architectures are a concept for hardware accelerators based on the idea of distributing workload over Processing Elements. These processors exploit instruction level parallelism, while being energy efficient due to their simplistic internal structure. However, the incorporation into a complete computing system raises severe challenges at the hardware and software level. This article evaluates a CGRA integrated into a control engineering environment targeting a Xilinx Zynq System on Chip (SoC) in detail. Besides the actual application execution performance, the practicability of the configuration toolchain is validated. Challenges of the real-world integration are discussed and practical insights are highlighted.

Download Full-text

Available instruction-level parallelism for superscalar and superpipelined machines

Proceedings of the third international conference on Architectural support for programming languages and operating systems - ASPLOS-III ◽

10.1145/70082.68207 ◽

1989 ◽

Cited By ~ 165

Author(s):

N. P. Jouppi ◽

D. W. Wall

Keyword(s):

Instruction Level Parallelism ◽

Level Parallelism

Download Full-text