Heterogeneity-aware Multicore Synchronization for Intermittent Systems

Intermittent systems enable batteryless devices to operate through energy harvesting by leveraging the complementary characteristics of volatile (VM) and non-volatile memory (NVM). Unfortunately, alternate and frequent accesses to heterogeneous memories for accumulative execution across power cycles can significantly hinder computation progress. The progress impediment is mainly due to more CPU time being wasted for slow NVM accesses than for fast VM accesses. This paper explores how to leverage heterogeneous cores to mitigate the progress impediment caused by heterogeneous memories. In particular, a delegable and adaptive synchronization protocol is proposed to allow memory accesses to be delegated between cores and to dynamically adapt to diverse memory access latency. Moreover, our design guarantees task serializability across multiple cores and maintains data consistency despite frequent power failures. We integrated our design into FreeRTOS running on a Cypress device featuring heterogeneous dual cores and hybrid memories. Experimental results show that, compared to recent approaches that assume single-core intermittent systems, our design can improve computation progress at least 1.8x and even up to 33.9x by leveraging core heterogeneity.

Download Full-text

Off-chip prefetching based on Hidden Markov Model for non-volatile memory architectures

PLoS ONE ◽

10.1371/journal.pone.0257047 ◽

2021 ◽

Vol 16 (9) ◽

pp. e0257047

Author(s):

Adrián Lamela ◽

Óscar G. Ossorio ◽

Guillermo Vinuesa ◽

Benjamín Sahelices

Keyword(s):

Markov Model ◽

Hidden Markov Model ◽

Hidden Markov ◽

Multicore Processors ◽

Memory Access ◽

Non Volatile Memory ◽

Volatile Memory ◽

Memory Accesses ◽

Access Patterns ◽

Memory Architectures

Non-volatile memory technology is now available in commodity hardware. This technology can be used as a backup memory for an external dram cache memory without needing to modify the software. However, the higher read and write latencies of non-volatile memory may exacerbate the memory wall problem. In this work we present a novel off-chip prefetch technique based on a Hidden Markov Model that specifically deals with the latency problem caused by complexity of off-chip memory access patterns. Firstly, we present a thorough analysis of off-chip memory access patterns to identify its complexity in multicore processors. Based on this study, we propose a prefetching module located in the llc which uses two small tables, and where the computational complexity of which is linear with the number of computing threads. Our Markov-based technique is able to keep track and make clustering of several simultaneous groups of memory accesses coming from multiple simultaneous threads in a multicore processor. It can quickly identify complex address groups and trigger prefetch with very high accuracy. Our simulations show an improvement of up to 76% in the hit ratio of an off-chip dram cache for multicore architecture over the conventional prefetch technique (g/dc). Also, the overhead of prefetch requests (failed prefetches) is reduced by 48% in single core simulations and by 83% in multicore simulations.

Download Full-text

Online Thread and Data Mapping Using a Sharing-Aware Memory Management Unit

ACM Transactions on Modeling and Performance Evaluation of Computing Systems ◽

10.1145/3433687 ◽

2021 ◽

Vol 5 (4) ◽

pp. 1-28

Author(s):

Eduardo H. M. Cruz ◽

Matthias Diener ◽

Laércio L. Pilla ◽

Philippe O. A. Navaux

Keyword(s):

Energy Efficiency ◽

Memory Management ◽

Substantial Reduction ◽

Management Unit ◽

Memory Access ◽

Parallel Applications ◽

Data Mapping ◽

Wide Range ◽

Memory Accesses ◽

Level Parallelism

Current and future architectures rely on thread-level parallelism to sustain performance growth. These architectures have introduced a complex memory hierarchy, consisting of several cores organized hierarchically with multiple cache levels and NUMA nodes. These memory hierarchies can have an impact on the performance and energy efficiency of parallel applications as the importance of memory access locality is increased. In order to improve locality, the analysis of the memory access behavior of parallel applications is critical for mapping threads and data. Nevertheless, most previous work relies on indirect information about the memory accesses, or does not combine thread and data mapping, resulting in less accurate mappings. In this paper, we propose the Sharing-Aware Memory Management Unit (SAMMU), an extension to the memory management unit that allows it to detect the memory access behavior in hardware. With this information, the operating system can perform online mapping without any previous knowledge about the behavior of the application. In the evaluation with a wide range of parallel applications (NAS Parallel Benchmarks and PARSEC Benchmark Suite), performance was improved by up to 35.7% (10.0% on average) and energy efficiency was improved by up to 11.9% (4.1% on average). These improvements happened due to a substantial reduction of cache misses and interconnection traffic.

Download Full-text

Experimental Results on Energy Harvesting by Using AM Radio Broadcasting

2014 Ninth International Conference on Broadband and Wireless Computing, Communication and Applications ◽

10.1109/bwcca.2014.88 ◽

2014 ◽

Author(s):

Shingo Otsuka ◽

Norimasa Nakashima

Keyword(s):

Energy Harvesting ◽

Experimental Results ◽

Radio Broadcasting

Download Full-text

A Two-Level Task Scheduler on Multiple DSP System for OpenCL

Advances in Mechanical Engineering ◽

10.1155/2014/754835 ◽

2014 ◽

Vol 6 ◽

pp. 754835

Author(s):

Li Tian ◽

Cai Meng ◽

Fugen Zhou

Keyword(s):

Dynamic Loading ◽

Task Switching ◽

Experimental Results ◽

System Level ◽

Synchronization Mechanism ◽

Schedule System ◽

Multiple Cores ◽

Dsp System ◽

Dsp Systems ◽

Work Item

This paper addresses the problem that multiple DSP system does not support OpenCL programming. With the compiler, runtime, and the kernel scheduler proposed, an OpenCL application becomes portable not only between multiple CPU and GPU, but also between embedded multiple DSP systems. Firstly, the LLVM compiler was imported for source-to-source translation in which the translated source was supported by CCS. Secondly, two-level schedulers were proposed to support efficient OpenCL kernel execution. The DSP/BIOS is used to schedule system level tasks such as interrupts and drivers; however, the synchronization mechanism resulted in heavy overhead during task switching. So we designed an efficient second level scheduler especially for OpenCL kernel work-item scheduling. The context switch process utilizes the 8 functional units and cross path links which was superior to DSP/BIOS in the aspect of task switching. Finally, dynamic loading and software managed CACHE were redesigned for OpenCL running on multiple DSP system. We evaluated the performance using some common OpenCL kernels from NVIDIA, AMD, NAS, and Parboil benchmarks. Experimental results show that the DSP OpenCL can efficiently exploit the computing resource of multiple cores.

Download Full-text

Optimal scheduling to minimize non-volatile memory access time with hardware cache

2010 18th IEEE/IFIP International Conference on VLSI and System-on-Chip ◽

10.1109/vlsisoc.2010.5642609 ◽

2010 ◽

Cited By ~ 3

Author(s):

Wei-Che Tseng ◽

Chun Jason Xue ◽

Qingfeng Zhuge ◽

Jingtong Hu ◽

Edwin H.-M. Sha

Keyword(s):

Optimal Scheduling ◽

Memory Access ◽

Access Time ◽

Non Volatile Memory ◽

Volatile Memory

Download Full-text

Modeling and experimental results for condensing supercritical CO2 power cycles.

10.2172/1030354 ◽

2011 ◽

Cited By ~ 6

Author(s):

Steven Alan Wright ◽

Thomas M. Conboy ◽

Ross F. Radel ◽

Gary Eugene Rochau

Keyword(s):

Supercritical Co2 ◽

Experimental Results ◽

Power Cycles

Download Full-text

The Research of a Memory Accesses Behavior on Non-Uniform Memory Access Architecture

2019 10th International Conference on Information Technology in Medicine and Education (ITME) ◽

10.1109/itme.2019.00174 ◽

2019 ◽

Author(s):

Xiaomei Guo ◽

Haiyun Han

Keyword(s):

Memory Access ◽

Memory Accesses

Download Full-text

Experimental Results for Energy Harvesting by Exploiting Inherent Inadequacies of Sampling Process for IoT Applications

2020 IEEE International Conference on Communications Workshops (ICC Workshops) ◽

10.1109/iccworkshops49005.2020.9145405 ◽

2020 ◽

Author(s):

Neha Jain ◽

Navneet Anand Sah ◽

Vivek Ashok Bohara ◽

Anubha Gupta

Keyword(s):

Energy Harvesting ◽

Experimental Results ◽

Iot Applications ◽

Sampling Process

Download Full-text

A Novel Road Energy Harvester System Based on Sliding Deceleration Design and Model

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.805-806.477 ◽

2013 ◽

Vol 805-806 ◽

pp. 477-481 ◽

Cited By ~ 2

Author(s):

Zhi Feng Chao ◽

Zu Tao Zhang

Keyword(s):

Energy Harvesting ◽

Energy Harvester ◽

Road Traffic ◽

Experimental Results ◽

Mechanical Motion ◽

Main Components ◽

Simulation Condition ◽

Sliding Plate

Green, safe and efficient road energy harvesting is a challenge for road traffic application. In this paper, we present a novel road energy harvester system based on sliding deceleration design and model for the purposes of energy harvesting involved in road traffic. The main components of the system consist of sliding deceleration mechanism, rack and pinion transmission, mechanical motion rectifier, which are used to generate electricity from the vibration of the sliding plate excited by vehicles passing by. Compared with the conventional road harvester using the speed bump, the proposed road energy harvester system has potential advantages in reducing the dependence on the speed bump, increasing the safety of vehicle when passing by and expanding its application. The final experimental results show the validity of our method under simulation condition.

Download Full-text

Communicating Efficiently on Cluster-Based Remote Direct Memory Access (RDMA) over InfiniBand Protocol

Applied Sciences ◽

10.3390/app8112034 ◽

2018 ◽

Vol 8 (11) ◽

pp. 2034

Author(s):

Masoud Hemmatpour ◽

Bartolomeo Montrucchio ◽

Maurizio Rebaudengo

Keyword(s):

Distributed Systems ◽

Real World ◽

High Performance ◽

Direct Memory Access ◽

Distributed Applications ◽

Memory Access ◽

Experimental Results ◽

Distributed Application ◽

Communication Paradigm

Distributed systems are commonly built under the assumption that the network is the primary bottleneck, however this assumption no longer holds by emerging high-performance RDMA enabled protocols in datacenters. Designing distributed applications over such protocols requires a fundamental rethinking in communication components in comparison with traditional protocols (i.e., TCP/IP). In this paper, communication paradigms in existing systems and new possible paradigms have been investigated. Advantages and drawbacks of each paradigm have been comprehensively analyzed and experimentally evaluated. The experimental results show that writing the requests to server and reading the response presents up to 10 times better performance comparing to other communication paradigms. To further expand the investigation, the proposed communication paradigm has been substituted in a real-world distributed application, and the performance has been enhanced up to seven times.

Download Full-text