Heterogeneity-aware Multicore Synchronization for Intermittent Systems

2021 ◽  
Vol 20 (5s) ◽  
pp. 1-22
Author(s):  
Wei-Ming Chen ◽  
Tei-Wei Kuo ◽  
Pi-Cheng Hsiu

Intermittent systems enable batteryless devices to operate through energy harvesting by leveraging the complementary characteristics of volatile (VM) and non-volatile memory (NVM). Unfortunately, alternate and frequent accesses to heterogeneous memories for accumulative execution across power cycles can significantly hinder computation progress. The progress impediment is mainly due to more CPU time being wasted for slow NVM accesses than for fast VM accesses. This paper explores how to leverage heterogeneous cores to mitigate the progress impediment caused by heterogeneous memories. In particular, a delegable and adaptive synchronization protocol is proposed to allow memory accesses to be delegated between cores and to dynamically adapt to diverse memory access latency. Moreover, our design guarantees task serializability across multiple cores and maintains data consistency despite frequent power failures. We integrated our design into FreeRTOS running on a Cypress device featuring heterogeneous dual cores and hybrid memories. Experimental results show that, compared to recent approaches that assume single-core intermittent systems, our design can improve computation progress at least 1.8x and even up to 33.9x by leveraging core heterogeneity.

PLoS ONE ◽  
2021 ◽  
Vol 16 (9) ◽  
pp. e0257047
Author(s):  
Adrián Lamela ◽  
Óscar G. Ossorio ◽  
Guillermo Vinuesa ◽  
Benjamín Sahelices

Non-volatile memory technology is now available in commodity hardware. This technology can be used as a backup memory for an external dram cache memory without needing to modify the software. However, the higher read and write latencies of non-volatile memory may exacerbate the memory wall problem. In this work we present a novel off-chip prefetch technique based on a Hidden Markov Model that specifically deals with the latency problem caused by complexity of off-chip memory access patterns. Firstly, we present a thorough analysis of off-chip memory access patterns to identify its complexity in multicore processors. Based on this study, we propose a prefetching module located in the llc which uses two small tables, and where the computational complexity of which is linear with the number of computing threads. Our Markov-based technique is able to keep track and make clustering of several simultaneous groups of memory accesses coming from multiple simultaneous threads in a multicore processor. It can quickly identify complex address groups and trigger prefetch with very high accuracy. Our simulations show an improvement of up to 76% in the hit ratio of an off-chip dram cache for multicore architecture over the conventional prefetch technique (g/dc). Also, the overhead of prefetch requests (failed prefetches) is reduced by 48% in single core simulations and by 83% in multicore simulations.


Author(s):  
Eduardo H. M. Cruz ◽  
Matthias Diener ◽  
Laércio L. Pilla ◽  
Philippe O. A. Navaux

Current and future architectures rely on thread-level parallelism to sustain performance growth. These architectures have introduced a complex memory hierarchy, consisting of several cores organized hierarchically with multiple cache levels and NUMA nodes. These memory hierarchies can have an impact on the performance and energy efficiency of parallel applications as the importance of memory access locality is increased. In order to improve locality, the analysis of the memory access behavior of parallel applications is critical for mapping threads and data. Nevertheless, most previous work relies on indirect information about the memory accesses, or does not combine thread and data mapping, resulting in less accurate mappings. In this paper, we propose the Sharing-Aware Memory Management Unit (SAMMU), an extension to the memory management unit that allows it to detect the memory access behavior in hardware. With this information, the operating system can perform online mapping without any previous knowledge about the behavior of the application. In the evaluation with a wide range of parallel applications (NAS Parallel Benchmarks and PARSEC Benchmark Suite), performance was improved by up to 35.7% (10.0% on average) and energy efficiency was improved by up to 11.9% (4.1% on average). These improvements happened due to a substantial reduction of cache misses and interconnection traffic.


2014 ◽  
Vol 6 ◽  
pp. 754835
Author(s):  
Li Tian ◽  
Cai Meng ◽  
Fugen Zhou

This paper addresses the problem that multiple DSP system does not support OpenCL programming. With the compiler, runtime, and the kernel scheduler proposed, an OpenCL application becomes portable not only between multiple CPU and GPU, but also between embedded multiple DSP systems. Firstly, the LLVM compiler was imported for source-to-source translation in which the translated source was supported by CCS. Secondly, two-level schedulers were proposed to support efficient OpenCL kernel execution. The DSP/BIOS is used to schedule system level tasks such as interrupts and drivers; however, the synchronization mechanism resulted in heavy overhead during task switching. So we designed an efficient second level scheduler especially for OpenCL kernel work-item scheduling. The context switch process utilizes the 8 functional units and cross path links which was superior to DSP/BIOS in the aspect of task switching. Finally, dynamic loading and software managed CACHE were redesigned for OpenCL running on multiple DSP system. We evaluated the performance using some common OpenCL kernels from NVIDIA, AMD, NAS, and Parboil benchmarks. Experimental results show that the DSP OpenCL can efficiently exploit the computing resource of multiple cores.


2011 ◽  
Author(s):  
Steven Alan Wright ◽  
Thomas M. Conboy ◽  
Ross F. Radel ◽  
Gary Eugene Rochau

2013 ◽  
Vol 805-806 ◽  
pp. 477-481 ◽  
Author(s):  
Zhi Feng Chao ◽  
Zu Tao Zhang

Green, safe and efficient road energy harvesting is a challenge for road traffic application. In this paper, we present a novel road energy harvester system based on sliding deceleration design and model for the purposes of energy harvesting involved in road traffic. The main components of the system consist of sliding deceleration mechanism, rack and pinion transmission, mechanical motion rectifier, which are used to generate electricity from the vibration of the sliding plate excited by vehicles passing by. Compared with the conventional road harvester using the speed bump, the proposed road energy harvester system has potential advantages in reducing the dependence on the speed bump, increasing the safety of vehicle when passing by and expanding its application. The final experimental results show the validity of our method under simulation condition.


2018 ◽  
Vol 8 (11) ◽  
pp. 2034
Author(s):  
Masoud Hemmatpour ◽  
Bartolomeo Montrucchio ◽  
Maurizio Rebaudengo

Distributed systems are commonly built under the assumption that the network is the primary bottleneck, however this assumption no longer holds by emerging high-performance RDMA enabled protocols in datacenters. Designing distributed applications over such protocols requires a fundamental rethinking in communication components in comparison with traditional protocols (i.e., TCP/IP). In this paper, communication paradigms in existing systems and new possible paradigms have been investigated. Advantages and drawbacks of each paradigm have been comprehensively analyzed and experimentally evaluated. The experimental results show that writing the requests to server and reading the response presents up to 10 times better performance comparing to other communication paradigms. To further expand the investigation, the proposed communication paradigm has been substituted in a real-world distributed application, and the performance has been enhanced up to seven times.


Sign in / Sign up

Export Citation Format

Share Document