ACM Transactions on Embedded Computing Systems

CORIDOR: Using CO herence and Tempo R al Local I ty to Mitigate Read D isurbance Err OR in STT-RAM Caches

ACM Transactions on Embedded Computing Systems ◽

10.1145/3484493 ◽

2022 ◽

Vol 21 (1) ◽

pp. 1-24

Author(s):

Sheel Sindhu Manohar ◽

Sparsh Mittal ◽

Hemangee K. Kapoor

Keyword(s):

Total Energy ◽

Energy Saving ◽

Energy Savings ◽

Spin Transfer Torque ◽

Spin Transfer ◽

Read Operation ◽

Near Future ◽

The Ideal

In the deep sub-micron region, “spin-transfer torque RAM” (STT-RAM ) suffers from “read-disturbance error” (RDE) , whereby a read operation disturbs the stored data. Mitigation of RDE requires restore operations, which imposes latency and energy penalties. Hence, RDE presents a crucial threat to the scaling of STT-RAM. In this paper, we offer three techniques to reduce the restore overhead. First, we avoid the restore operations for those reads, where the block will get updated at a higher level cache in the near future. Second, we identify read-intensive blocks using a lightweight mechanism and then migrate these blocks to a small SRAM buffer. On a future read to these blocks, the restore operation is avoided. Third, for data blocks having zero value, a write operation is avoided, and only a flag is set. Based on this flag, both read and restore operations to this block are avoided. We combine these three techniques to design our final policy, named CORIDOR. Compared to a baseline policy, which performs restore operation after each read, CORIDOR achieves a 31.6% reduction in total energy and brings the relative CPI (cycle-per-instruction) to 0.64×. By contrast, an ideal RDE-free STT-RAM saves 42.7% energy and brings the relative CPI to 0.62×. Thus, our CORIDOR policy achieves nearly the same performance as an ideal RDE-free STT-RAM cache. Also, it reaches three-fourths of the energy-saving achieved by the ideal RDE-free cache. We also compare CORIDOR with four previous techniques and show that CORIDOR provides higher restore energy savings than these techniques.

Download Full-text

An Energy-Efficient DRAM Cache Architecture for Mobile Platforms With PCM-Based Main Memory

ACM Transactions on Embedded Computing Systems ◽

10.1145/3451995 ◽

2022 ◽

Vol 21 (1) ◽

pp. 1-22

Author(s):

Dongsuk Shin ◽

Hakbeom Jang ◽

Kiseok Oh ◽

Jae W. Lee

Keyword(s):

Energy Consumption ◽

Main Memory ◽

Battery Life ◽

Mobile Platforms ◽

Total Energy Consumption ◽

Efficient Manner ◽

Hybrid Memory ◽

Spatial Locality ◽

Cache Architecture ◽

Energy Delay Product

A long battery life is a first-class design objective for mobile devices, and main memory accounts for a major portion of total energy consumption. Moreover, the energy consumption from memory is expected to increase further with ever-growing demands for bandwidth and capacity. A hybrid memory system with both DRAM and PCM can be an attractive solution to provide additional capacity and reduce standby energy. Although providing much greater density than DRAM, PCM has longer access latency and limited write endurance to make it challenging to architect it for main memory. To address this challenge, this article introduces CAMP, a novel DRAM c ache a rchitecture for m obile platforms with P CM-based main memory. A DRAM cache in this environment is required to filter most of the writes to PCM to increase its lifetime, and deliver highest efficiency even for a relatively small-sized DRAM cache that mobile platforms can afford. To address this CAMP divides DRAM space into two regions: a page cache for exploiting spatial locality in a bandwidth-efficient manner and a dirty block buffer for maximally filtering writes. CAMP improves the performance and energy-delay-product by 29.2% and 45.2%, respectively, over the baseline PCM-oblivious DRAM cache, while increasing PCM lifetime by 2.7×. And CAMP also improves the performance and energy-delay-product by 29.3% and 41.5%, respectively, over the state-of-the-art design with dirty block buffer, while increasing PCM lifetime by 2.5×.

Download Full-text

Holistic Resource Allocation Under Federated Scheduling for Parallel Real-time Tasks

ACM Transactions on Embedded Computing Systems ◽

10.1145/3489467 ◽

2022 ◽

Vol 21 (1) ◽

pp. 1-29

Author(s):

Lanshun Nie ◽

Chenghao Fan ◽

Shuang Lin ◽

Li Zhang ◽

Yajuan Li ◽

...

Keyword(s):

Resource Allocation ◽

Real Time ◽

Rapid Development ◽

Memory Bandwidth ◽

Computing Power ◽

Resource Allocation Algorithm ◽

Memory Space ◽

Parallel Task ◽

Processing Resources ◽

Cache Allocation

With the technology trend of hardware and workload consolidation for embedded systems and the rapid development of edge computing, there has been increasing interest in supporting parallel real-time tasks to better utilize the multi-core platforms while meeting the stringent real-time constraints. For parallel real-time tasks, the federated scheduling paradigm, which assigns each parallel task a set of dedicated cores, achieves good theoretical bounds by ensuring exclusive use of processing resources to reduce interferences. However, because cores share the last-level cache and memory bandwidth resources, in practice tasks may still interfere with each other despite executing on dedicated cores. Such resource interferences due to concurrent accesses can be even more severe for embedded platforms or edge servers, where the computing power and cache/memory space are limited. To tackle this issue, in this work, we present a holistic resource allocation framework for parallel real-time tasks under federated scheduling. Under our proposed framework, in addition to dedicated cores, each parallel task is also assigned with dedicated cache and memory bandwidth resources. Further, we propose a holistic resource allocation algorithm that well balances the allocation between different resources to achieve good schedulability. Additionally, we provide a full implementation of our framework by extending the federated scheduling system with Intel’s Cache Allocation Technology and MemGuard. Finally, we demonstrate the practicality of our proposed framework via extensive numerical evaluations and empirical experiments using real benchmark programs.

Download Full-text

Performance and Power Estimation of STT-MRAM Main Memory with Reliable System-level Simulation

ACM Transactions on Embedded Computing Systems ◽

10.1145/3476838 ◽

2022 ◽

Vol 21 (1) ◽

pp. 1-25

Author(s):

Kazi Asifuzzaman ◽

Rommel Sánchez Verdejo ◽

Petar Radojković

Keyword(s):

Academic Research ◽

Random Access ◽

Power Estimation ◽

Spin Transfer Torque ◽

Main Memory ◽

System Level ◽

Next Generation ◽

Reliable System ◽

Novel Technology ◽

System Level Simulation

It is questionable whether DRAM will continue to scale and will meet the needs of next-generation systems. Therefore, significant effort is invested in research and development of novel memory technologies. One of the candidates for next-generation memory is Spin-Transfer Torque Magnetic Random Access Memory (STT-MRAM). STT-MRAM is an emerging non-volatile memory with a lot of potential that could be exploited for various requirements of different computing systems. Being a novel technology, STT-MRAM devices are already approaching DRAM in terms of capacity, frequency, and device size. Although STT-MRAM technology got significant attention of various major memory manufacturers, academic research of STT-MRAM main memory remains marginal. This is mainly due to the unavailability of publicly available detailed timing and current parameters of this novel technology, which are required to perform a reliable main memory simulation on performance and power estimation. This study demonstrates an approach to perform a cycle accurate simulation of STT-MRAM main memory, being the first to release detailed timing and current parameters of this technology from academia—essentially enabling researchers to conduct reliable system-level simulation of STT-MRAM using widely accepted existing simulation infrastructure. The results show a fairly narrow overall performance deviation in response to significant variations in key timing parameters, and the power consumption experiments identify the key power component that is mostly affected with STT-MRAM.

Download Full-text

Telomere: Real-Time NAND Flash Storage

ACM Transactions on Embedded Computing Systems ◽

10.1145/3479157 ◽

2022 ◽

Vol 21 (1) ◽

pp. 1-24

Author(s):

Katherine Missimer ◽

Manos Athanassoulis ◽

Richard West

Keyword(s):

Solid State ◽

Real Time ◽

Flash Memory ◽

Autonomous Vehicles ◽

Garbage Collection ◽

Data Transfer ◽

Nand Flash ◽

Nand Flash Memory ◽

Solid State Disks ◽

Data Lifetime

Modern solid-state disks achieve high data transfer rates due to their massive internal parallelism. However, out-of-place updates for flash memory incur garbage collection costs when valid data needs to be copied during space reclamation. The root cause of this extra cost is that solid-state disks are not always able to accurately determine data lifetime and group together data that expires before the space needs to be reclaimed. Real-time systems found in autonomous vehicles, industrial control systems, and assembly-line robots store data from hundreds of sensors and often have predictable data lifetimes. These systems require guaranteed high storage bandwidth for read and write operations by mission-critical real-time tasks. In this article, we depart from the traditional block device interface to guarantee the high throughput needed to process large volumes of data. Using data lifetime information from the application layer, our proposed real-time design, called Telomere , is able to intelligently lay out data in NAND flash memory and eliminate valid page copies during garbage collection. Telomere’s real-time admission control is able to guarantee tasks their required read and write operations within their periods. Under randomly generated tasksets containing 500 tasks, Telomere achieves 30% higher throughput with a 5% storage cost compared to pre-existing techniques.

Download Full-text

Software Hint-Driven Data Management for Hybrid Memory in Mobile Systems

ACM Transactions on Embedded Computing Systems ◽

10.1145/3494536 ◽

2022 ◽

Vol 21 (1) ◽

pp. 1-18

Author(s):

Fei Wen ◽

Mian Qin ◽

Paul Gratz ◽

Narasimha Reddy

Keyword(s):

Time Window ◽

Memory Systems ◽

Memory Device ◽

Memory System ◽

Mobile Systems ◽

Data Migration ◽

Time Data ◽

Cooperative Approach ◽

Hybrid Memory ◽

Data Objects

Hybrid memory systems, comprised of emerging non-volatile memory (NVM) and DRAM, have been proposed to address the growing memory demand of current mobile applications. Recently emerging NVM technologies, such as phase-change memories (PCM), memristor, and 3D XPoint, have higher capacity density, minimal static power consumption and lower cost per GB. However, NVM has longer access latency and limited write endurance as opposed to DRAM. The different characteristics of distinct memory classes render a new challenge for memory system design. Ideally, pages should be placed or migrated between the two types of memories according to the data objects’ access properties. Prior system software approaches exploit the program information from OS but at the cost of high software latency incurred by related kernel processes. Hardware approaches can avoid these latencies, however, hardware’s vision is constrained to a short time window of recent memory requests, due to the limited on-chip resources. In this work, we propose OpenMem: a hardware-software cooperative approach that combines the execution time advantages of pure hardware approaches with the data object properties in a global scope. First, we built a hardware-based memory manager unit (HMMU) that can learn the short-term access patterns by online profiling, and execute data migration efficiently. Then, we built a heap memory manager for the heterogeneous memory systems that allows the programmer to directly customize each data object’s allocation to a favorable memory device within the presumed object life cycle. With the programmer’s hints guiding the data placement at allocation time, data objects with similar properties will be congregated to reduce unnecessary page migrations. We implemented the whole system on the FPGA board with embedded ARM processors. In testing under a set of benchmark applications from SPEC 2017 and PARSEC, experimental results show that OpenMem reduces 44.6% energy consumption with only a 16% performance degradation compared to the all-DRAM memory system. The amount of writes to the NVM is reduced by 14% versus the HMMU-only, extending the NVM device lifetime.

Download Full-text

L 2 C: Combining Lossy and Lossless Compression on Memory and I/O

ACM Transactions on Embedded Computing Systems ◽

10.1145/3481641 ◽

2022 ◽

Vol 21 (1) ◽

pp. 1-27

Author(s):

Albin Eldstål-Ahrens ◽

Angelos Arelakis ◽

Ioannis Sourdis

Keyword(s):

Execution Time ◽

State Of The Art ◽

Real Life ◽

Lossless Compression ◽

General Purpose ◽

Constrained Systems ◽

Access Time ◽

Total System ◽

Compression Scheme ◽

System Energy

In this article, we introduce L 2 C, a hybrid lossy/lossless compression scheme applicable both to the memory subsystem and I/O traffic of a processor chip. L 2 C employs general-purpose lossless compression and combines it with state-of-the-art lossy compression to achieve compression ratios up to 16:1 and to improve the utilization of chip’s bandwidth resources. Compressing memory traffic yields lower memory access time, improving system performance, and energy efficiency. Compressing I/O traffic offers several benefits for resource-constrained systems, including more efficient storage and networking. We evaluate L 2 C as a memory compressor in simulation with a set of approximation-tolerant applications. L 2 C improves baseline execution time by an average of 50% and total system energy consumption by 16%. Compared to the lossy and lossless current state-of-the-art memory compression approaches, L 2 C improves execution time by 9% and 26%, respectively, and reduces system energy costs by 3% and 5%, respectively. I/O compression efficacy is evaluated using a set of real-life datasets. L 2 C achieves compression ratios of up to 10.4:1 for a single dataset and on average about 4:1, while introducing no more than 0.4% error.

Download Full-text

Microarchitectural Exploration of STT-MRAM Last-level Cache Parameters for Energy-efficient Devices

ACM Transactions on Embedded Computing Systems ◽

10.1145/3490391 ◽

2022 ◽

Vol 21 (1) ◽

pp. 1-20

Author(s):

Tommaso Marinelli ◽

Jignacio Gómez Pérez ◽

Christian Tenllado ◽

Manu Komalan ◽

Mohit Gupta ◽

...

Keyword(s):

Spin Transfer Torque ◽

Spin Transfer ◽

Added Value ◽

Embedded Devices ◽

Physical Size ◽

Technology Scaling ◽

Performance Loss ◽

Write Buffer ◽

Architectural Exploration ◽

Different Levels

As the technology scaling advances, limitations of traditional memories in terms of density and energy become more evident. Modern caches occupy a large part of a CPU physical size and high static leakage poses a limit to the overall efficiency of the systems, including IoT/edge devices. Several alternatives to CMOS SRAM memories have been studied during the past few decades, some of which already represent a viable replacement for different levels of the cache hierarchy. One of the most promising technologies is the spin-transfer torque magnetic RAM (STT-MRAM), due to its small basic cell design, almost absent static current and non-volatility as an added value. However, nothing comes for free, and designers will have to deal with other limitations, such as the higher latencies and dynamic energy consumption for write operations compared to reads. The goal of this work is to explore several microarchitectural parameters that may overcome some of those drawbacks when using STT-MRAM as last-level cache (LLC) in embedded devices. Such parameters include: number of cache banks, number of miss status handling registers (MSHRs) and write buffer entries, presence of hardware prefetchers. We show that an effective tuning of those parameters may virtually remove any performance loss while saving more than 60% of the LLC energy on average. The analysis is then extended comparing the energy results from calibrated technology models with data obtained with freely available tools, highlighting the importance of using accurate models for architectural exploration.

Download Full-text

An Interpretable Machine Learning Model Enhanced Integrated CPU-GPU DVFS Governor

ACM Transactions on Embedded Computing Systems ◽

10.1145/3470974 ◽

2021 ◽

Vol 20 (6) ◽

pp. 1-28

Author(s):

Jurn-Gyu Park ◽

Nikil Dutt ◽

Sung-Soo Lim

Keyword(s):

Machine Learning ◽

Energy Efficiency ◽

High Performance ◽

Linear Models ◽

Piecewise Linear ◽

Prediction Errors ◽

Linear Regression Models ◽

Mobile Games ◽

Interpretable Machine Learning ◽

Mathematical Formulas

Modern heterogeneous CPU-GPU-based mobile architectures, which execute intensive mobile gaming/graphics applications, use software governors to achieve high performance with energy-efficiency. However, existing governors typically utilize simple statistical or heuristic models, assuming linear relationships using a small unbalanced dataset of mobile games; and the limitations result in high prediction errors for dynamic and diverse gaming workloads on heterogeneous platforms. To overcome these limitations, we propose an interpretable machine learning (ML) model enhanced integrated CPU-GPU governor: (1) It builds tree-based piecewise linear models (i.e., model trees) offline considering both high accuracy (low error) and interpretable ML models based on mathematical formulas using a simulatability operation counts quantitative metric. And then (2) it deploys the selected models for online estimation into an integrated CPU-GPU Dynamic Voltage Frequency Scaling governor. Our experiments on a test set of 20 mobile games exhibiting diverse characteristics show that our governor achieved significant energy efficiency gains of over 10% (up to 38%) improvements on average in energy-per-frame with a surprising-but-modest 3% improvement in Frames-per-Second performance, compared to a typical state-of-the-art governor that employs simple linear regression models.

Download Full-text

Mapping Computations in Heterogeneous Multicore Systems with Statistical Regression on Program Inputs

ACM Transactions on Embedded Computing Systems ◽

10.1145/3478288 ◽

2021 ◽

Vol 20 (6) ◽

pp. 1-35

Author(s):

Junio Cezar Ribeiro Da Silva ◽

Lorena Leão ◽

Vinicius Petrucci ◽

Abdoulaye Gamatié ◽

Fernando Magno Quintão Pereira

Keyword(s):

Linear Regression ◽

Heterogeneous System ◽

Multivariate Linear Regression ◽

Statistical Regression ◽

Efficient Tool ◽

Heterogeneous Multicore ◽

Heterogeneous Architectures ◽

Java Bytecode ◽

Hardware Configuration ◽

Adaptive Compilation

A hardware configuration is a set of processors and their frequency levels in a multicore heterogeneous system. This article presents a compiler-based technique to match functions with hardware configurations. Such a technique consists of using multivariate linear regression to associate function arguments with particular hardware configurations. By showing that this classification space tends to be convex in practice, this article demonstrates that linear regression is not only an efficient tool to map computations to heterogeneous hardware, but also an effective one. To demonstrate the viability of multivariate linear regression as a way to perform adaptive compilation for heterogeneous architectures, we have implemented our ideas onto the Soot Java bytecode analyzer. Code that we produce can predict the best configuration for a large class of Java and Scala benchmarks running on an Odroid XU4 big.LITTLE board; hence, outperforming prior techniques such as ARM’s GTS and CHOAMP, a recently released static program scheduler.

Download Full-text

ACM Transactions on Embedded Computing Systems
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Association For Computing Machinery

CORIDOR: Using CO herence and Tempo R al Local I ty to Mitigate Read D isurbance Err OR in STT-RAM Caches

An Energy-Efficient DRAM Cache Architecture for Mobile Platforms With PCM-Based Main Memory

Holistic Resource Allocation Under Federated Scheduling for Parallel Real-time Tasks

Performance and Power Estimation of STT-MRAM Main Memory with Reliable System-level Simulation

Telomere: Real-Time NAND Flash Storage

Software Hint-Driven Data Management for Hybrid Memory in Mobile Systems

L 2 C: Combining Lossy and Lossless Compression on Memory and I/O

Microarchitectural Exploration of STT-MRAM Last-level Cache Parameters for Energy-efficient Devices

An Interpretable Machine Learning Model Enhanced Integrated CPU-GPU DVFS Governor

Mapping Computations in Heterogeneous Multicore Systems with Statistical Regression on Program Inputs

Export Citation Format

ACM Transactions on Embedded Computing SystemsLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Association For Computing Machinery

CORIDOR: Using CO herence and Tempo R al Local I ty to Mitigate Read D isurbance Err OR in STT-RAM Caches

An Energy-Efficient DRAM Cache Architecture for Mobile Platforms With PCM-Based Main Memory

Holistic Resource Allocation Under Federated Scheduling for Parallel Real-time Tasks

Performance and Power Estimation of STT-MRAM Main Memory with Reliable System-level Simulation

Telomere: Real-Time NAND Flash Storage

Software Hint-Driven Data Management for Hybrid Memory in Mobile Systems

L 2 C: Combining Lossy and Lossless Compression on Memory and I/O

Microarchitectural Exploration of STT-MRAM Last-level Cache Parameters for Energy-efficient Devices

An Interpretable Machine Learning Model Enhanced Integrated CPU-GPU DVFS Governor

Mapping Computations in Heterogeneous Multicore Systems with Statistical Regression on Program Inputs

ACM Transactions on Embedded Computing Systems
Latest Publications