A Novel Compiler Assisted Approach for Issue Queue Power Reduction

Superscalar processors contain complex control logic in order to extract sufficient instruction level parallelism (ILP). The issue logic is one of the main sources of power dissipation in current superscalar processors. It has been estimated that up to 30% of the energy consumed by a processor is in the issue logic. This paper presents a novel compiler assisted approach to power reduction where we use compiler analysis to pass information to the processor about the number of entries needed, allowing the processor to resize the issue queue dynamically which limit the number of instruction dispatched and resident in the queue reduces the energy consumption without adversely affecting performance. Compared with hardware scheme, our approach is simpler faster and saves more energy. Using the approach we achieve 43.3% dynamic and 28.5% static power savings.

Download Full-text

Optimizing Static Power Dissipation by Functional Units in Superscalar Processors

Lecture Notes in Computer Science - Compiler Construction ◽

10.1007/3-540-45937-5_19 ◽

2002 ◽

pp. 261-275 ◽

Cited By ~ 34

Author(s):

Siddharth Rele ◽

Santosh Pande ◽

Soner Onder ◽

Rajiv Gupta

Keyword(s):

Power Dissipation ◽

Superscalar Processors ◽

Static Power ◽

Functional Units

Download Full-text

Efficient exploitation of instruction-level parallelism for superscalar processors by the conjugate register file scheme

IEEE Transactions on Computers ◽

10.1109/12.485567 ◽

1996 ◽

Vol 45 (3) ◽

pp. 278-293 ◽

Cited By ~ 4

Author(s):

Meng-Chou Chang ◽

Feipei Lai

Keyword(s):

Instruction Level Parallelism ◽

Superscalar Processors ◽

Level Parallelism

Download Full-text

Improving instruction level parallelism through reconfigurable units in superscalar processors

ACM SIGARCH Computer Architecture News ◽

10.1145/1294313.1294320 ◽

2007 ◽

Vol 35 (3) ◽

pp. 20-27

Author(s):

Tameesh Suri

Keyword(s):

Instruction Level Parallelism ◽

Superscalar Processors ◽

Improving Instruction ◽

Level Parallelism

Download Full-text

Microarchitectural Characterization on a Mobile Workload

Applied Sciences ◽

10.3390/app11031225 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1225

Author(s):

Woohyong Lee ◽

Jiyoung Lee ◽

Bo Kyung Park ◽

R. Young Chul Kim

Keyword(s):

Performance Monitoring ◽

Performance Metrics ◽

Performance Comparison ◽

Instruction Level Parallelism ◽

Data Set ◽

Performance Events ◽

Hardware Performance Counters ◽

On Chip ◽

The Comparative Study ◽

Level Parallelism

Geekbench is one of the most referenced cross-platform benchmarks in the mobile world. Most of its workloads are synthetic but some of them aim to simulate real-world behavior. In the mobile world, its microarchitectural behavior has been reported rarely since the hardware profiling features are limited to the public. As a popular mobile performance workload, it is hard to find Geekbench’s microarchitecture characteristics in mobile devices. In this paper, a thorough experimental study of Geekbench performance characterization is reported with detailed performance metrics. This study also identifies mobile system on chip (SoC) microarchitecture impacts, such as the cache subsystem, instruction-level parallelism, and branch performance. After the study, we could understand the bottleneck of workloads, especially in the cache sub-system. This means that the change of data set size directly impacts performance score significantly in some systems and will ruin the fairness of the CPU benchmark. In the experiment, Samsung’s Exynos9820-based platform was used as the tested device with Android Native Development Kit (NDK) built binaries. The Exynos9820 is a superscalar processor capable of dual issuing some instructions. To help performance analysis, we enable the capability to collect performance events with performance monitoring unit (PMU) registers. The PMU is a set of hardware performance counters which are built into microprocessors to store the counts of hardware-related activities. Throughout the experiment, functional and microarchitectural performance profiles were fully studied. This paper describes the details of the mobile performance studies above. In our experiment, the ARM DS5 tool was used for collecting runtime PMU profiles including OS-level performance data. After the comparative study is completed, users will understand more about the mobile architecture behavior, and this will help to evaluate which benchmark is preferable for fair performance comparison.

Download Full-text

UltraSynth: Insights of a CGRA Integration into a Control Engineering Environment

Journal of Signal Processing Systems ◽

10.1007/s11265-021-01641-7 ◽

2021 ◽

Author(s):

Dennis Wolf ◽

Andreas Engel ◽

Tajas Ruschke ◽

Andreas Koch ◽

Christian Hochberger

Keyword(s):

Computing System ◽

Coarse Grained ◽

Instruction Level Parallelism ◽

Control Engineering ◽

Processing Elements ◽

Actual Application ◽

Reconfigurable Arrays ◽

Engineering Environment ◽

On Chip ◽

Level Parallelism

AbstractCoarse Grained Reconfigurable Arrays (CGRAs) or Architectures are a concept for hardware accelerators based on the idea of distributing workload over Processing Elements. These processors exploit instruction level parallelism, while being energy efficient due to their simplistic internal structure. However, the incorporation into a complete computing system raises severe challenges at the hardware and software level. This article evaluates a CGRA integrated into a control engineering environment targeting a Xilinx Zynq System on Chip (SoC) in detail. Besides the actual application execution performance, the practicability of the configuration toolchain is validated. Challenges of the real-world integration are discussed and practical insights are highlighted.

Download Full-text

Available instruction-level parallelism for superscalar and superpipelined machines

Proceedings of the third international conference on Architectural support for programming languages and operating systems - ASPLOS-III ◽

10.1145/70082.68207 ◽

1989 ◽

Cited By ~ 165

Author(s):

N. P. Jouppi ◽

D. W. Wall

Keyword(s):

Instruction Level Parallelism ◽

Level Parallelism

Download Full-text

Topic 8 Parallel Computer Architecture and Instruction-Level Parallelism

Euro-Par 2003 Parallel Processing - Lecture Notes in Computer Science ◽

10.1007/978-3-540-45209-6_78 ◽

2003 ◽

pp. 541-542

Author(s):

Stamatis Vassiliadis ◽

Nikitas Dimopoulos ◽

Jean-Francois Collard ◽

Arndt Bode

Keyword(s):

Computer Architecture ◽

Parallel Computer ◽

Instruction Level Parallelism ◽

Level Parallelism

Download Full-text

ZTC bias point of advanced fin based device: The importance and exploration

Facta universitatis - series Electronics and Energetics ◽

10.2298/fuee1503393m ◽

2015 ◽

Vol 28 (3) ◽

pp. 393-405 ◽

Cited By ~ 3

Author(s):

Sushanta Mohapatra ◽

Kumar Pradhan ◽

Prasanna Sahu

Keyword(s):

Power Dissipation ◽

Temperature Compensation ◽

Performance Metrics ◽

Zero Temperature ◽

Sweet Spot ◽

Static Power ◽

20 Nm ◽

Bias Point ◽

The Impact ◽

Present Understanding

The present understanding of this work is about to evaluate and resolve the temperature compensation point (TCP) or zero temperature coefficient (ZTC) point for a sub-20 nm FinFET. The sensitivity of geometry parameters on assorted performances of Fin based device and its reliability over ample range of temperatures i.e. 25?C to 225?C is reviewed to extend the benchmark of device scalability. The impact of fin height (HFin), fin width (WFin), and temperature (T) on immense performance metrics including on-off ratio (Ion/Ioff), transconductance (gm), gain (AV), cut-off frequency (fT), static power dissipation (PD), energy (E), energy delay product (EDP), and sweet spot (gmfT/ID) of the FinFET is successfully carried out by commercially available TCAD simulator SentaurusTM from Synopsis Inc.

Download Full-text

A Flux Controlled Memristor using 90nm Technology

Indian Journal of Signal Processing ◽

10.54105/ijsp.b1004.051221 ◽

2021 ◽

pp. 1-6

Author(s):

B.T. Krishna ◽

◽

Shaik. mohaseena Salma ◽

Keyword(s):

Power Supply ◽

Power Dissipation ◽

Power Efficiency ◽

Current Mode ◽

Cmos Technology ◽

Theoretical Simulation ◽

Transconductance Amplifier ◽

Higher Power ◽

Static Power ◽

A Current

A flux-controlled memristor using complementary metal–oxide–(CMOS) structure is presented in this study. The proposed circuit provides higher power efficiency, less static power dissipation, lesser area, and can also reduce the power supply by using CMOS 90nm technology. The circuit is implemented based on the use of a second-generation current conveyor circuit (CCII) and operational transconductance amplifier (OTA) with few passive elements. The proposed circuit uses a current-mode approach which improves the high frequency performance. The reduction of a power supply is a crucial aspect to decrease the power consumption in VLSI. An offered emulator in this proposed circuit is made to operate incremental and decremental configurations well up to 26.3 MHZ in cadence virtuoso platform gpdk using 90nm CMOS technology. proposed memristor circuit has very little static power dissipation when operating with ±1V supply. Transient analysis, memductance analysis, and dc analysis simulations are verified practically with the Experimental demonstration by using ideal memristor made up of ICs AD844AN and CA3080, using multisim which exhibits theoretical simulation are verified and discussed.

Download Full-text

An Efficient Power Optimized 32 bit BCD Adder Using Multi-Channel Technique

International Journal of New Practices in Management and Engineering ◽

10.17762/ijnpme.v6i02.57 ◽

2017 ◽

Vol 6 (02) ◽

pp. 07-12

Author(s):

Diksha Siddhamshittiwar

Keyword(s):

Average Power ◽

Full Adder ◽

Deep Submicron ◽

Power Reduction ◽

Vlsi Circuits ◽

Power Gating ◽

Static Power ◽

Simulation Results ◽

Efficient Power ◽

Standby Mode

Static power reduction is a challenge in deep submicron VLSI circuits. In this paper 28T full adder circuit, 14T full adder circuit and 32 bit power gated BCD adder using the full adders respectively were designed and their average power was compared. In existing work a conventional full adder is designed using 28T and the same is used to design 32 bit BCD adder. In the proposed architecture 14T transmission gate based power gated full adder is used for the design of 32 bit BCD adder. The leakage supremacy dissipated during standby mode in all deep submicron CMOS devices is reduced using efficient power gating and multi-channel technique. Simulation results were obtained using Tanner EDA and TSMC_180nm library file is used for the design of 28T full adder, 14T full adder and power gated BCD adder and a significant power reduction is achieved in the proposed architecture.

Download Full-text