LP-NUCA: Networks-in-Cache for High-Performance Low-Power Embedded Processors

Embedded processors are widely used in various systems working on different tasks with different workloads. A more complex micro-architecture leads to better peak performance and worse power consumption. Shutting down the units designed for performance enhancement could improve energy efficiency in low-workload scenarios. In this paper, we evaluated the energy distribution in various embedded processors. According to the analysis, pipeline registers and the dynamic branch predictor, which are employed for better peak performance, have great impacts on energy efficiency. Thus, we proposed an ultra-low-power processor with variable micro-architecture. The processor is based on a 4-stage pipeline core with a Gshare branch predictor, and all units work in high-performance mode. In normal mode, the Gshare predictor is shut down and Always-Not-Taken prediction is used. In low-power mode, some of the pipeline registers are bypassed to avoid unnecessary energy dissipation and improve executing efficiency. A mode register (MR) is designed to indicate current working mode. Switching between different modes is controlled by the software. The proposed core is implemented in 40 nm technology and simulated with the traces of 17 benchmarks in Embench. The average amounts of power consumed by the respective modes are 41.7 μW, 59.7 μW and 71.1 μW. The results show that normal mode (N-mode) and low-power mode (L-mode) consume 16.08% and 41.37% less power than high-performance mode (H-mode) on average. In best case scenarios, they could save 25.36% and 49.30% more power than H-mode. Considering the execution efficiency evaluated by instructions per cycle (IPC), the proposed processor consumes 7.78% or 51.57% less energy for each instruction than the baseline core. The area of the proposed processor is only 7.19% larger than the baseline core, and only 3.08% more power is consumed in H-mode.

Download Full-text

CORDIC Hardware Acceleration Using DMA-Based ISA Extension

Journal of Low Power Electronics and Applications ◽

10.3390/jlpea12010004 ◽

2022 ◽

Vol 12 (1) ◽

pp. 4

Author(s):

Erez Manor ◽

Avrech Ben-David ◽

Shlomo Greenberg

Keyword(s):

Low Power ◽

High Performance ◽

Low Cost ◽

Hardware Acceleration ◽

Embedded Processors ◽

Data Path ◽

Instruction Set ◽

Mathematical Functions ◽

Coordinate Rotation ◽

Software Implementations

The use of RISC-based embedded processors aimed at low cost and low power is becoming an increasingly popular ecosystem for both hardware and software development. High-performance yet low-power embedded processors may be attained via the use of hardware acceleration and Instruction Set Architecture (ISA) extension. Recent publications of AI have demonstrated the use of Coordinate Rotation Digital Computer (CORDIC) as a dedicated low-power solution for solving nonlinear equations applied to Neural Networks (NN). This paper proposes ISA extension to support floating-point CORDIC, providing efficient hardware acceleration for mathematical functions. A new DMA-based ISA extension approach integrated with a pipeline CORDIC accelerator is proposed. The CORDIC ISA extension is directly interfaced with a standard processor data path, allowing efficient implementation of new trigonometric ALU-based custom instructions. The proposed DMA-based CORDIC accelerator can also be used to perform repeated array calculations, offering a significant speedup over software implementations. The proposed accelerator is evaluated on Intel Cyclone-IV FPGA as an extension to Nios processor. Experimental results show a significant speedup of over three orders of magnitude compared with software implementation, while applied to trigonometric arrays, and outperforms the existing commercial CORDIC hardware accelerator.

Download Full-text

Design of a Low Power, High Performance BICMOS Current-limiting Circuit for DC-DC Converter Application

PIERS Online ◽

10.2529/piers060817034009 ◽

2007 ◽

Vol 3 (4) ◽

pp. 368-373 ◽

Cited By ~ 5

Author(s):

Hongbo Ma ◽

Quanyuan Feng

Keyword(s):

Low Power ◽

High Performance ◽

Current Limiting

Download Full-text

Efficient Instruction and Data Caching for High Performance Embedded Processors

Jornada de Jóvenes Investigadores del I3A ◽

10.26754/jji-i3a.201201788 ◽

1970 ◽

pp. 9

Author(s):

A. Ferrerón Labari ◽

D. Suárez Gracia ◽

V. Viñals Yúfera

Keyword(s):

Embedded Systems ◽

Power Consumption ◽

Low Power ◽

Interconnection Networks ◽

High Performance ◽

Critical Issue ◽

Content Management ◽

Structure Design ◽

Portable Devices ◽

On Chip

In the last years, embedded systems have evolved so that they offer capabilities we could only find before in high performance systems. Portable devices already have multiprocessors on-chip (such as PowerPC 476FP or ARM Cortex A9 MP), usually multi-threaded, and a powerful multi-level cache memory hierarchy on-chip. As most of these systems are battery-powered, the power consumption becomes a critical issue. Achieving high performance and low power consumption is a high complexity challenge where some proposals have been already made. Suarez et al. proposed a new cache hierarchy on-chip, the LP-NUCA (Low Power NUCA), which is able to reduce the access latency taking advantage of NUCA (Non-Uniform Cache Architectures) properties. The key points are decoupling the functionality, and utilizing three specialized networks on-chip. This structure has been proved to be efficient for data hierarchies, achieving a good performance and reducing the energy consumption. On the other hand, instruction caches have different requirements and characteristics than data caches, contradicting the low-power embedded systems requirements, especially in SMT (simultaneous multi-threading) environments. We want to study the benefits of utilizing small tiled caches for the instruction hierarchy, so we propose a new design, ID-LP-NUCAs. Thus, we need to re-evaluate completely our previous design in terms of structure design, interconnection networks (including topologies, flow control and routing), content management (with special interest in hardware/software content allocation policies), and structure sharing. In CMP environments (chip multiprocessors) with parallel workloads, coherence plays an important role, and must be taken into consideration.

Download Full-text

High-Performance and Low-Power Full Color Reflective LCD for New Applications

Proceedings of the International Display Workshops ◽

10.36463/idw.2019.1411 ◽

2019 ◽

pp. 1411

Author(s):

Hiroyuki Hakoi ◽

Ming Ni ◽

Junichi Hashimoto ◽

Takashi Sato ◽

Shinji Shimada ◽

...

Keyword(s):

Low Power ◽

High Performance ◽

Full Color ◽

New Applications

Download Full-text

Performance Analysis of Various Multipliers Using 8T-full Adder with 180nm Technology

Recent Advances in Electrical & Electronic Engineering (Formerly Recent Patents on Electrical & Electronic Engineering) ◽

10.2174/2352096513666200107091932 ◽

2020 ◽

Vol 13 (6) ◽

pp. 864-870

Author(s):

Sai Venkatramana Prasada G.S ◽

G. Seshikala ◽

S. Niranjana

Keyword(s):

Low Power ◽

Power Dissipation ◽

High Speed ◽

High Performance ◽

Full Adder ◽

Fundamental Operation ◽

Wallace Tree ◽

Power Delay Product ◽

The Comparative Study ◽

Wallace Tree Multiplier

Background: This paper presents the comparative study of power dissipation, delay and power delay product (PDP) of different full adders and multiplier designs. Methods: Full adder is the fundamental operation for any processors, DSP architectures and VLSI systems. Here ten different full adder structures were analyzed for their best performance using a Mentor Graphics tool with 180nm technology. Results: From the analysis result high performance full adder is extracted for further higher level designs. 8T full adder exhibits high speed, low power delay and low power delay product and hence it is considered to construct four different multiplier designs, such as Array multiplier, Baugh Wooley multiplier, Braun multiplier and Wallace Tree multiplier. These different structures of multipliers were designed using 8T full adder and simulated using Mentor Graphics tool in a constant W/L aspect ratio. Conclusion: From the analysis, it is concluded that Wallace Tree multiplier is the high speed multiplier but dissipates comparatively high power. Baugh Wooley multiplier dissipates less power but exhibits more time delay and low PDP.

Download Full-text

A reconfigurable low-power high-performance matrix multiplier architecture with borrow parallel counters

Proceedings International Parallel and Distributed Processing Symposium ◽

10.1109/ipdps.2003.1213336 ◽

2004 ◽

Author(s):

Rong Lin

Keyword(s):

Low Power ◽

High Performance ◽

Performance Matrix

Download Full-text

Optimization of a Low-Power Chemoresistive Gas Sensor: Predictive Thermal Modelling and Mechanical Failure Analysis

Sensors ◽

10.3390/s21030783 ◽

2021 ◽

Vol 21 (3) ◽

pp. 783 ◽

Cited By ~ 1

Author(s):

Andrea Gaiardo ◽

David Novel ◽

Elia Scattolo ◽

Michele Crivellari ◽

Antonino Picciotto ◽

...

Keyword(s):

Low Power ◽

Failure Analysis ◽

Gas Sensors ◽

High Performance ◽

Gas Sensing ◽

Mechanical Failure ◽

Sensing Applications ◽

Theoretical Approaches ◽

Sensing Material ◽

Silicon Bulk

The substrate plays a key role in chemoresistive gas sensors. It acts as mechanical support for the sensing material, hosts the heating element and, also, aids the sensing material in signal transduction. In recent years, a significant improvement in the substrate production process has been achieved, thanks to the advances in micro- and nanofabrication for micro-electro-mechanical system (MEMS) technologies. In addition, the use of innovative materials and smaller low-power consumption silicon microheaters led to the development of high-performance gas sensors. Various heater layouts were investigated to optimize the temperature distribution on the membrane, and a suspended membrane configuration was exploited to avoid heat loss by conduction through the silicon bulk. However, there is a lack of comprehensive studies focused on predictive models for the optimization of the thermal and mechanical properties of a microheater. In this work, three microheater layouts in three membrane sizes were developed using the microfabrication process. The performance of these devices was evaluated to predict their thermal and mechanical behaviors by using both experimental and theoretical approaches. Finally, a statistical method was employed to cross-correlate the thermal predictive model and the mechanical failure analysis, aiming at microheater design optimization for gas-sensing applications.

Download Full-text

Ultracompact and low-power-consumption silicon thermo-optic switch for high-speed data

Nanophotonics ◽

10.1515/nanoph-2020-0496 ◽

2020 ◽

Vol 10 (2) ◽

pp. 937-945

Author(s):

Ruihuan Zhang ◽

Yu He ◽

Yong Zhang ◽

Shaohua An ◽

Qingming Zhu ◽

...

Keyword(s):

Power Consumption ◽

Low Power ◽

High Speed ◽

High Performance ◽

Pulse Amplitude ◽

Telecommunication Networks ◽

Low Power Consumption ◽

Power Efficient ◽

High Speed Data ◽

On Chip

AbstractUltracompact and low-power-consumption optical switches are desired for high-performance telecommunication networks and data centers. Here, we demonstrate an on-chip power-efficient 2 × 2 thermo-optic switch unit by using a suspended photonic crystal nanobeam structure. A submilliwatt switching power of 0.15 mW is obtained with a tuning efficiency of 7.71 nm/mW in a compact footprint of 60 μm × 16 μm. The bandwidth of the switch is properly designed for a four-level pulse amplitude modulation signal with a 124 Gb/s raw data rate. To the best of our knowledge, the proposed switch is the most power-efficient resonator-based thermo-optic switch unit with the highest tuning efficiency and data ever reported.

Download Full-text