memory accesses
Recently Published Documents


TOTAL DOCUMENTS

193
(FIVE YEARS 60)

H-INDEX

12
(FIVE YEARS 2)

2022 ◽  
Vol 15 (2) ◽  
pp. 1-33
Author(s):  
Mikhail Asiatici ◽  
Paolo Ienne

Applications such as large-scale sparse linear algebra and graph analytics are challenging to accelerate on FPGAs due to the short irregular memory accesses, resulting in low cache hit rates. Nonblocking caches reduce the bandwidth required by misses by requesting each cache line only once, even when there are multiple misses corresponding to it. However, such reuse mechanism is traditionally implemented using an associative lookup. This limits the number of misses that are considered for reuse to a few tens, at most. In this article, we present an efficient pipeline that can process and store thousands of outstanding misses in cuckoo hash tables in on-chip SRAM with minimal stalls. This brings the same bandwidth advantage as a larger cache for a fraction of the area budget, because outstanding misses do not need a data array, which can significantly speed up irregular memory-bound latency-insensitive applications. In addition, we extend nonblocking caches to generate variable-length bursts to memory, which increases the bandwidth delivered by DRAMs and their controllers. The resulting miss-optimized memory system provides up to 25% speedup with 24× area reduction on 15 large sparse matrix-vector multiplication benchmarks evaluated on an embedded and a datacenter FPGA system.


Author(s):  
Antonio Fuentes-Alventosa ◽  
Juan Gómez-Luna ◽  
José Maria González-Linares ◽  
Nicolás Guil ◽  
R. Medina-Carnicer

AbstractCAVLC (Context-Adaptive Variable Length Coding) is a high-performance entropy method for video and image compression. It is the most commonly used entropy method in the video standard H.264. In recent years, several hardware accelerators for CAVLC have been designed. In contrast, high-performance software implementations of CAVLC (e.g., GPU-based) are scarce. A high-performance GPU-based implementation of CAVLC is desirable in several scenarios. On the one hand, it can be exploited as the entropy component in GPU-based H.264 encoders, which are a very suitable solution when GPU built-in H.264 hardware encoders lack certain necessary functionality, such as data encryption and information hiding. On the other hand, a GPU-based implementation of CAVLC can be reused in a wide variety of GPU-based compression systems for encoding images and videos in formats other than H.264, such as medical images. This is not possible with hardware implementations of CAVLC, as they are non-separable components of hardware H.264 encoders. In this paper, we present CAVLCU, an efficient implementation of CAVLC on GPU, which is based on four key ideas. First, we use only one kernel to avoid the long latency global memory accesses required to transmit intermediate results among different kernels, and the costly launches and terminations of additional kernels. Second, we apply an efficient synchronization mechanism for thread-blocks (In this paper, to prevent confusion, a block of pixels of a frame will be referred to as simply block and a GPU thread block as thread-block.) that process adjacent frame regions (in horizontal and vertical dimensions) to share results in global memory space. Third, we exploit fully the available global memory bandwidth by using vectorized loads to move directly the quantized transform coefficients to registers. Fourth, we use register tiling to implement the zigzag sorting, thus obtaining high instruction-level parallelism. An exhaustive experimental evaluation showed that our approach is between 2.5$$\times$$ × and 5.4$$\times$$ × faster than the only state-of-the-art GPU-based implementation of CAVLC.


2021 ◽  
Vol 20 (5s) ◽  
pp. 1-22
Author(s):  
Wei-Ming Chen ◽  
Tei-Wei Kuo ◽  
Pi-Cheng Hsiu

Intermittent systems enable batteryless devices to operate through energy harvesting by leveraging the complementary characteristics of volatile (VM) and non-volatile memory (NVM). Unfortunately, alternate and frequent accesses to heterogeneous memories for accumulative execution across power cycles can significantly hinder computation progress. The progress impediment is mainly due to more CPU time being wasted for slow NVM accesses than for fast VM accesses. This paper explores how to leverage heterogeneous cores to mitigate the progress impediment caused by heterogeneous memories. In particular, a delegable and adaptive synchronization protocol is proposed to allow memory accesses to be delegated between cores and to dynamically adapt to diverse memory access latency. Moreover, our design guarantees task serializability across multiple cores and maintains data consistency despite frequent power failures. We integrated our design into FreeRTOS running on a Cypress device featuring heterogeneous dual cores and hybrid memories. Experimental results show that, compared to recent approaches that assume single-core intermittent systems, our design can improve computation progress at least 1.8x and even up to 33.9x by leveraging core heterogeneity.


2021 ◽  
Vol 20 (5s) ◽  
pp. 1-20
Author(s):  
Qingfeng Zhuge ◽  
Hao Zhang ◽  
Edwin Hsing-Mean Sha ◽  
Rui Xu ◽  
Jun Liu ◽  
...  

Efficiently accessing remote file data remains a challenging problem for data processing systems. Development of technologies in non-volatile dual in-line memory modules (NVDIMMs), in-memory file systems, and RDMA networks provide new opportunities towards solving the problem of remote data access. A general understanding about NVDIMMs, such as Intel Optane DC Persistent Memory (DCPM), is that they expand main memory capacity with a cost of multiple times lower performance than DRAM. With an in-depth exploration presented in this paper, however, we show an interesting finding that the potential of NVDIMMs for high-performance, remote in-memory accesses can be revealed through careful design. We explore multiple architectural structures for accessing remote NVDIMMs in a real system using Optane DCPM, and compare the performance of various structures. Experiments are conducted to show significant performance gaps among different ways of using NVDIMMs as memory address space accessible through RDMA interface. Furthermore, we design and implement a prototype of user-level, in-memory file system, RIMFS, in the device DAX mode on Optane DCPM. By comparing against the DAX-supported Linux file system, Ext4-DAX, we show that the performance of remote reads on RIMFS over RDMA is 11.44 higher than that on a remote Ext4-DAX on average. The experimental results also show that the performance of remote accesses on RIMFS is maintained on a heavily loaded data server with CPU utilization as high as 90%, while the performance of remote reads on Ext4-DAX is significantly reduced by 49.3%, and the performance of local reads on Ext4-DAX is even more significantly reduced by 90.1%. The performance comparisons of writes exhibit the same trends.


2021 ◽  
Author(s):  
Xin Xin ◽  
Yanan Guo ◽  
Youtao Zhang ◽  
Jun Yang
Keyword(s):  

Cryptography ◽  
2021 ◽  
Vol 5 (4) ◽  
pp. 27
Author(s):  
Mohammad Anagreh ◽  
Peeter Laud ◽  
Eero Vainikko

In this paper, we propose and present secure multiparty computation (SMC) protocols for single-source shortest distance (SSSD) and all-pairs shortest distance (APSD) in sparse and dense graphs. Our protocols follow the structure of classical algorithms—Bellman–Ford and Dijkstra for SSSD; Johnson, Floyd–Warshall, and transitive closure for APSD. As the computational platforms offered by SMC protocol sets have performance profiles that differ from typical processors, we had to perform extensive changes to the structure (including their control flow and memory accesses) and the details of these algorithms in order to obtain good performance. We implemented our protocols on top of the secret sharing based protocol set offered by the Sharemind SMC platform, using single-instruction-multiple-data (SIMD) operations as much as possible to reduce the round complexity. We benchmarked our protocols under several different parameters for network performance and compared our performance figures against each other and with ones reported previously.


PLoS ONE ◽  
2021 ◽  
Vol 16 (9) ◽  
pp. e0257047
Author(s):  
Adrián Lamela ◽  
Óscar G. Ossorio ◽  
Guillermo Vinuesa ◽  
Benjamín Sahelices

Non-volatile memory technology is now available in commodity hardware. This technology can be used as a backup memory for an external dram cache memory without needing to modify the software. However, the higher read and write latencies of non-volatile memory may exacerbate the memory wall problem. In this work we present a novel off-chip prefetch technique based on a Hidden Markov Model that specifically deals with the latency problem caused by complexity of off-chip memory access patterns. Firstly, we present a thorough analysis of off-chip memory access patterns to identify its complexity in multicore processors. Based on this study, we propose a prefetching module located in the llc which uses two small tables, and where the computational complexity of which is linear with the number of computing threads. Our Markov-based technique is able to keep track and make clustering of several simultaneous groups of memory accesses coming from multiple simultaneous threads in a multicore processor. It can quickly identify complex address groups and trigger prefetch with very high accuracy. Our simulations show an improvement of up to 76% in the hit ratio of an off-chip dram cache for multicore architecture over the conventional prefetch technique (g/dc). Also, the overhead of prefetch requests (failed prefetches) is reduced by 48% in single core simulations and by 83% in multicore simulations.


2021 ◽  
Author(s):  
Youngmok Jung ◽  
Dongsu Han

The growing use of next-generation sequencing and enlarged sequencing throughput require efficient short-read alignment, where seeding is one of the major performance bottlenecks. The key challenge in the seeding phase is searching for exact matches of substrings of short reads in the reference DNA sequence. Existing algorithms, however, present limitations in performance due to their frequent memory accesses. This paper presents BWA-MEME, the first full-fledged short read alignment software that leverages learned indices for solving the exact match search problem for efficient seeding. BWA-MEME is a practical and efficient seeding algorithm based on a suffix array search algorithm that solves the challenges in utilizing learned indices for SMEM search which is extensively used in the seeding phase. Our evaluation shows that BWA-MEME achieves up to 3.45x speedup in seeding throughput over BWA-MEM2 by reducing the number of instructions by 4.60x, memory accesses by 8.77x, and LLC misses by 2.21x, while ensuring the identical SAM output to BWA-MEM2.


Algorithms ◽  
2021 ◽  
Vol 14 (8) ◽  
pp. 218
Author(s):  
João V. Roque ◽  
João D. Lopes ◽  
Mário P. Véstias ◽  
José T. de Sousa

Open-source processors are increasingly being adopted by the industry, which requires all sorts of open-source implementations of peripherals and other system-on-chip modules. Despite the recent advent of open-source hardware, the available open-source caches have low configurability, limited lack of support for single-cycle pipelined memory accesses, and use non-standard hardware interfaces. In this paper, the IObundle cache (IOb-Cache), a high-performance configurable open-source cache is proposed, developed and deployed. The cache has front-end and back-end modules for fast integration with processors and memory controllers. The front-end module supports the native interface, and the back-end module supports the native interface and the standard Advanced eXtensible Interface (AXI). The cache is highly configurable in structure and access policies. The back-end can be configured to read bursts of multiple words per transfer to take advantage of the available memory bandwidth. To the best of our knowledge, IOb-Cache is currently the only configurable cache that supports pipelined Central Processing Unit (CPU) interfaces and AXI memory bus interface. Additionally, it has a write-through buffer and an independent controller for fast, most of the time 1-cycle writing together with 1-cycle reading, while previous works only support 1-cycle reading. This allows the best clocks-per-Instruction (CPI) to be close to one (1.055). IOb-Cache is integrated into IOb System-on-Chip (IOb-SoC) Github repository, which has 29 stars and is already being used in 50 projects (forks).


Sign in / Sign up

Export Citation Format

Share Document