Virtual-Cache: A cache-line borrowing technique for efficient GPU cache architectures

2021 ◽  
pp. 104301
Author(s):  
Bingchao Li ◽  
Jizeng Wei ◽  
Nam Sung Kim
Keyword(s):  
2018 ◽  
pp. 47-53
Author(s):  
B. Z. Shmeylin ◽  
E. A. Alekseeva

In this paper the tasks of managing the directory in coherence maintenance systems in multiprocessor systems with a large number of processors are solved. In microprocessor systems with a large number of processors (MSLP) the problem of maintaining the coherence of processor caches is significantly complicated. This is due to increased traffic on the memory buses and increased complexity of interprocessor communications. This problem is solved in various ways. In this paper, we propose the use of Bloom filters used to accelerate the determination of an element’s belonging to a certain array. In this article, such filters are used to establish the fact that the processor belongs to some subset of the processors and determine if the processor has a cache line in the set. In the paper, the processes of writing and reading information in the data shared between processors are discussed in detail, as well as the process of data replacement from private caches. The article also shows how the addresses of cache lines and processor numbers are removed from the Bloom filters. The system proposed in this paper allows significantly speeding up the implementation of operations to maintain cache coherence in the MSLP as compared to conventional systems. In terms of performance and additional hardware and software costs, the proposed system is not inferior to the most efficient of similar systems, but on some applications and significantly exceeds them.


Author(s):  
Rodrigo Machniewicz Sokulski ◽  
Emmanuell Diaz Carreno ◽  
Marco Antonio Zanata Alves
Keyword(s):  

2022 ◽  
Vol 15 (2) ◽  
pp. 1-33
Author(s):  
Mikhail Asiatici ◽  
Paolo Ienne

Applications such as large-scale sparse linear algebra and graph analytics are challenging to accelerate on FPGAs due to the short irregular memory accesses, resulting in low cache hit rates. Nonblocking caches reduce the bandwidth required by misses by requesting each cache line only once, even when there are multiple misses corresponding to it. However, such reuse mechanism is traditionally implemented using an associative lookup. This limits the number of misses that are considered for reuse to a few tens, at most. In this article, we present an efficient pipeline that can process and store thousands of outstanding misses in cuckoo hash tables in on-chip SRAM with minimal stalls. This brings the same bandwidth advantage as a larger cache for a fraction of the area budget, because outstanding misses do not need a data array, which can significantly speed up irregular memory-bound latency-insensitive applications. In addition, we extend nonblocking caches to generate variable-length bursts to memory, which increases the bandwidth delivered by DRAMs and their controllers. The resulting miss-optimized memory system provides up to 25% speedup with 24× area reduction on 15 large sparse matrix-vector multiplication benchmarks evaluated on an embedded and a datacenter FPGA system.


2019 ◽  
Vol E102.D (12) ◽  
pp. 2441-2450
Author(s):  
Dokeun LEE ◽  
Seongjin LEE ◽  
Youjip WON

Author(s):  
Stefanos Kaxiras ◽  
Zhigang Hu ◽  
Girija Narlikar ◽  
Rae McLellan
Keyword(s):  

2008 ◽  
Vol 32 (7) ◽  
pp. 394-404 ◽  
Author(s):  
Ismail Kadayif ◽  
Ayhan Zorlubas ◽  
Selcuk Koyuncu ◽  
Olcay Kabal ◽  
Davut Akcicek ◽  
...  
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document