Energy and Performance Trade-Off Optimization in Heterogeneous Computing via Reinforcement Learning

This paper suggests an optimisation approach in heterogeneous computing systems to balance energy power consumption and efficiency. The work proposes a power measurement utility for a reinforcement learning (PMU-RL) algorithm to dynamically adjust the resource utilisation of heterogeneous platforms in order to minimise power consumption. A reinforcement learning (RL) technique is applied to analyse and optimise the resource utilisation of field programmable gate array (FPGA) control state capabilities, which is built for a simulation environment with a Xilinx ZYNQ multi-processor systems-on-chip (MPSoC) board. In this study, the balance operation mode for improving power consumption and performance is established to dynamically change the programmable logic (PL) end work state. It is based on an RL algorithm that can quickly discover the optimization effect of PL on different workloads to improve energy efficiency. The results demonstrate a substantial reduction of 18% in energy consumption without affecting the application’s performance. Thus, the proposed PMU-RL technique has the potential to be considered for other heterogeneous computing platforms.

Download Full-text

Performance and energy optimization of heterogeneous CPU-GPU systems for embedded applications

10.32920/ryerson.14661414 ◽

2021 ◽

Author(s):

Abdullah Siddiqui

Keyword(s):

Embedded Systems ◽

Power Consumption ◽

Optimization Algorithm ◽

Heterogeneous Computing ◽

Energy Optimization ◽

Systems Design ◽

Application Partitioning ◽

Computing Platforms ◽

Embedded Applications ◽

Software Partitioning

One of the most critical steps of embedded systems design is Hardware-Software partitioning. It is characterized by distributing the components of an application between hardware and software such that the user defined system constraints are satisfied. Heterogeneous computing platforms consisting of CPUs and GPUs have tremendous potential for enhancing the performance of embedded applications. The challenge of application partitioning for CPU-GPU mapping is much greater on such platforms due to their unique and diverse characteristics. In this thesis, an optimization algorithm is devised and presented for partitioning and mapping computational tasks on CPU-GPU platforms while keeping a check on the power consumption. Our methodology also uses parallelism in applications and their tasks by utilizing the architectural capabilities of the GPU. The optimization algorithm was tested with a MJPEG decoder, several benchmarks and synthetic graphs.

Download Full-text

Multi-Vdd Design for Content Addressable Memories (CAM): A Power-Delay Optimization Analysis

Journal of Low Power Electronics and Applications ◽

10.3390/jlpea8030025 ◽

2018 ◽

Vol 8 (3) ◽

pp. 25 ◽

Cited By ~ 6

Author(s):

Siddhartha Joshi ◽

Dawei Li ◽

Seda Ogrenci-Memik ◽

Grzegorz Deptuch ◽

James Hoff ◽

...

Keyword(s):

Power Consumption ◽

Simulation Analysis ◽

Cmos Technology ◽

Power Measurement ◽

Test Chip ◽

Voltage Range ◽

Content Addressable Memory ◽

Noise Margin ◽

Low Power Cmos ◽

And Performance

In this paper, we characterize the interplay between power consumption and performance of a matchline-based Content Addressable Memory and then propose the use of a multi-Vdd design to save power and increase post-fabrication tunability. Exploration of the power consumption behavior of a CAM chip shows the drastically different behavior among the components and suggests the use of different and independent power supplies. The complete design, simulation and testing of a multi-Vdd CAM chip along with an exploration of the multi-Vdd design space are presented. Our analysis has been applied to simulated models on two different technology nodes (130 nm and 45 nm), followed by experiments on a 246-kb test chip fabricated in 130 nm Global Foundries Low Power CMOS technology. The proposed design, operating at an optimal operating point in a triple-Vdd configuration, increases the power-delay operation range by 2.4 times and consumes 25.3% less dynamic power when compared to a conventional single-Vdd design operating over the same voltage range with equivalent noise margin. Our multi-Vdd design also helps save 51.3% standby power. Measurement results from the test chip combined with the simulation analysis at the two nodes validate our thesis.

Download Full-text

Performance and energy optimization of heterogeneous CPU-GPU systems for embedded applications

10.32920/ryerson.14661414.v1 ◽

2021 ◽

Author(s):

Abdullah Siddiqui

Keyword(s):

Embedded Systems ◽

Power Consumption ◽

Optimization Algorithm ◽

Heterogeneous Computing ◽

Energy Optimization ◽

Systems Design ◽

Application Partitioning ◽

Computing Platforms ◽

Embedded Applications ◽

Software Partitioning

Download Full-text

Faculty Opinions recommendation of Dopamine and performance in a reinforcement learning task: evidence from Parkinson's disease.

Faculty Opinions – Post-Publication Peer Review of the Biomedical Literature ◽

10.3410/f.713947855.789352805 ◽

2012 ◽

Author(s):

Kent Berridge

Keyword(s):

Parkinson’S Disease ◽

Parkinson's Disease ◽

Reinforcement Learning ◽

Learning Task ◽

And Performance

Download Full-text

Scheduling Challenges in Mixed Critical Real-Time Heterogeneous Computing Platforms

Procedia Computer Science ◽

10.1016/j.procs.2013.05.358 ◽

2013 ◽

Vol 18 ◽

pp. 1891-1898

Author(s):

Chetan Kumar N G ◽

Sudhanshu Vyas ◽

Ron K. Cytron ◽

Christopher D. Gill ◽

Joseph Zambreno ◽

...

Keyword(s):

Real Time ◽

Heterogeneous Computing ◽

Computing Platforms

Download Full-text

Value Iteration Architecture Based Deep Learning for Intelligent Routing Exploiting Heterogeneous Computing Platforms

IEEE Transactions on Computers ◽

10.1109/tc.2018.2874483 ◽

2019 ◽

Vol 68 (6) ◽

pp. 939-950 ◽

Cited By ~ 14

Author(s):

Zubair Md. Fadlullah ◽

Bomin Mao ◽

Fengxiao Tang ◽

Nei Kato

Keyword(s):

Deep Learning ◽

Heterogeneous Computing ◽

Value Iteration ◽

Intelligent Routing ◽

Computing Platforms

Download Full-text

Power and Performance Evaluation of Memory-Intensive Applications

Energies ◽

10.3390/en14144089 ◽

2021 ◽

Vol 14 (14) ◽

pp. 4089

Author(s):

Kaiqiang Zhang ◽

Dongyang Ou ◽

Congfeng Jiang ◽

Yeliang Qiu ◽

Longchuan Yan

Keyword(s):

Energy Efficiency ◽

Energy Consumption ◽

Power Consumption ◽

Job Scheduling ◽

Memory System ◽

Processor Core ◽

Memory Efficiency ◽

And Performance ◽

Reasonable Use ◽

Server System

In terms of power and energy consumption, DRAMs play a key role in a modern server system as well as processors. Although power-aware scheduling is based on the proportion of energy between DRAM and other components, when running memory-intensive applications, the energy consumption of the whole server system will be significantly affected by the non-energy proportion of DRAM. Furthermore, modern servers usually use NUMA architecture to replace the original SMP architecture to increase its memory bandwidth. It is of great significance to study the energy efficiency of these two different memory architectures. Therefore, in order to explore the power consumption characteristics of servers under memory-intensive workload, this paper evaluates the power consumption and performance of memory-intensive applications in different generations of real rack servers. Through analysis, we find that: (1) Workload intensity and concurrent execution threads affects server power consumption, but a fully utilized memory system may not necessarily bring good energy efficiency indicators. (2) Even if the memory system is not fully utilized, the memory capacity of each processor core has a significant impact on application performance and server power consumption. (3) When running memory-intensive applications, memory utilization is not always a good indicator of server power consumption. (4) The reasonable use of the NUMA architecture will improve the memory energy efficiency significantly. The experimental results show that reasonable use of NUMA architecture can improve memory efficiency by 16% compared with SMP architecture, while unreasonable use of NUMA architecture reduces memory efficiency by 13%. The findings we present in this paper provide useful insights and guidance for system designers and data center operators to help them in energy-efficiency-aware job scheduling and energy conservation.

Download Full-text

Fuzzy-Based Thermal Management Scheme for 3D Chip Multicores with Stacked Caches

Electronics ◽

10.3390/electronics9020346 ◽

2020 ◽

Vol 9 (2) ◽

pp. 346 ◽

Cited By ~ 1

Author(s):

Lili Shen ◽

Ning Wu ◽

Gaizhen Yan

Keyword(s):

Power Consumption ◽

Thermal Management ◽

System Performance ◽

Control Policy ◽

Three Dimension ◽

Processor Core ◽

Management Scheme ◽

And Performance ◽

On Chip ◽

Silicon Vias

By using through-silicon-vias (TSV), three dimension integration technology can stack large memory on the top of cores as a last-level on-chip cache (LLC) to reduce off-chip memory access and enhance system performance. However, the integration of more on-chip caches increases chip power density, which might lead to temperature-related issues in power consumption, reliability, cooling cost, and performance. An effective thermal management scheme is required to ensure the performance and reliability of the system. In this study, a fuzzy-based thermal management scheme (FBTM) is proposed that simultaneously considers cores and stacked caches. The proposed method combines a dynamic cache reconfiguration scheme with a fuzzy-based control policy in a temperature-aware manner. The dynamic cache reconfiguration scheme determines the size of the cache for the processor core according to the application that reaches a substantial amount of power consumption savings. The fuzzy-based control policy is used to change the frequency level of the processor core based on dynamic cache reconfiguration, a process which can further improve the system performance. Experiments show that, compared with other thermal management schemes, the proposed FBTM can achieve, on average, 3 degrees of reduction in temperature and a 41% reduction of leakage energy.

Download Full-text

BlastFunction: A Full-stack Framework Bringing FPGA Hardware Acceleration to Cloud-native Applications

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3472958 ◽

2022 ◽

Vol 15 (2) ◽

pp. 1-27

Author(s):

Andrea Damiani ◽

Giorgia Fiscaletti ◽

Marco Bacis ◽

Rolando Brondolin ◽

Marco D. Santambrogio

Keyword(s):

Heterogeneous Computing ◽

Performance Metrics ◽

State Of The Art ◽

Hardware Acceleration ◽

Standard Approach ◽

Cloud Models ◽

Experimental Campaign ◽

Computing Platforms ◽

Cloud Infrastructures

“Cloud-native” is the umbrella adjective describing the standard approach for developing applications that exploit cloud infrastructures’ scalability and elasticity at their best. As the application complexity and user-bases grow, designing for performance becomes a first-class engineering concern. As an answer to these needs, heterogeneous computing platforms gained widespread attention as powerful tools to continue meeting SLAs for compute-intensive cloud-native workloads. We propose BlastFunction, an FPGA-as-a-Service full-stack framework to ease FPGAs’ adoption for cloud-native workloads, integrating with the vast spectrum of fundamental cloud models. At the IaaS level, BlastFunction time-shares FPGA-based accelerators to provide multi-tenant access to accelerated resources without any code rewriting. At the PaaS level, BlastFunction accelerates functionalities leveraging the serverless model and scales functions proactively, depending on the workload’s performance. Further lowering the FPGAs’ adoption barrier, an accelerators’ registry hosts accelerated functions ready to be used within cloud-native applications, bringing the simplicity of a SaaS-like approach to the developers. After an extensive experimental campaign against state-of-the-art cloud scenarios, we show how BlastFunction leads to higher performance metrics (utilization and throughput) against native execution, with minimal latency and overhead differences. Moreover, the scaling scheme we propose outperforms the main serverless autoscaling algorithms in workload performance and scaling operation amount.

Download Full-text

Transient provisioning and performance evaluation for cloud computing platforms: A capacity value approach

Performance Evaluation ◽

10.1016/j.peva.2017.10.001 ◽

2018 ◽

Vol 118 ◽

pp. 48-62 ◽

Cited By ~ 2

Author(s):

Brendan Patch ◽

Thomas Taimre

Keyword(s):

Cloud Computing ◽

Performance Evaluation ◽

Computing Platforms ◽

And Performance

Download Full-text