scholarly journals Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPU

Micromachines ◽  
2021 ◽  
Vol 12 (10) ◽  
pp. 1262
Author(s):  
Juan Fang ◽  
Zelin Wei ◽  
Huijing Yang

GPGPUs has gradually become a mainstream acceleration component in high-performance computing. The long latency of memory operations is the bottleneck of GPU performance. In the GPU, multiple threads are divided into one warp for scheduling and execution. The L1 data caches have little capacity, while multiple warps share one small cache. That makes the cache suffer a large amount of cache contention and pipeline stall. We propose Locality-Based Cache Management (LCM), combined with the Locality-Based Warp Scheduling (LWS), to reduce cache contention and improve GPU performance. Each load instruction can be divided into three types according to locality: only used once as streaming data locality, accessed multiple times in the same warp as intra-warp locality, and accessed in different warps as inter-warp data locality. According to the locality of the load instruction, LWS applies cache bypass to the streaming locality request to improve the cache utilization rate, extend inter-warp memory request coalescing to make full use of the inter-warp locality, and combine with the LWS to alleviate cache contention. LCM and LWS can effectively improve cache performance, thereby improving overall GPU performance. Through experimental evaluation, our LCM and LWS can obtain an average performance improvement of 26% over baseline GPU.

Micromachines ◽  
2019 ◽  
Vol 10 (2) ◽  
pp. 124 ◽  
Author(s):  
Ho Shin ◽  
Eui-Young Chung

Recently, 3D-stacked dynamic random access memory (DRAM) has become a promising solution for ultra-high capacity and high-bandwidth memory implementations. However, it also suffers from memory wall problems due to long latency, such as with typical 2D-DRAMs. Although there are various cache management techniques and latency hiding schemes to reduce DRAM access time, in a high-performance system using high-capacity 3D-stacked DRAM, it is ultimately essential to reduce the latency of the DRAM itself. To solve this problem, various asymmetric in-DRAM cache structures have recently been proposed, which are more attractive for high-capacity DRAMs because they can be implemented at a lower cost in 3D-stacked DRAMs. However, most research mainly focuses on the architecture of the in-DRAM cache itself and does not pay much attention to proper management methods. In this paper, we propose two new management algorithms for the in-DRAM caches to achieve a low-latency and low-power 3D-stacked DRAM device. Through the computing system simulation, we demonstrate the improvement of energy delay product up to 67%.


2021 ◽  
Vol 17 (2) ◽  
pp. 1-45
Author(s):  
Cheng Pan ◽  
Xiaolin Wang ◽  
Yingwei Luo ◽  
Zhenlin Wang

Due to large data volume and low latency requirements of modern web services, the use of an in-memory key-value (KV) cache often becomes an inevitable choice (e.g., Redis and Memcached). The in-memory cache holds hot data, reduces request latency, and alleviates the load on background databases. Inheriting from the traditional hardware cache design, many existing KV cache systems still use recency-based cache replacement algorithms, e.g., least recently used or its approximations. However, the diversity of miss penalty distinguishes a KV cache from a hardware cache. Inadequate consideration of penalty can substantially compromise space utilization and request service time. KV accesses also demonstrate locality, which needs to be coordinated with miss penalty to guide cache management. In this article, we first discuss how to enhance the existing cache model, the Average Eviction Time model, so that it can adapt to modeling a KV cache. After that, we apply the model to Redis and propose pRedis, Penalty- and Locality-aware Memory Allocation in Redis, which synthesizes data locality and miss penalty, in a quantitative manner, to guide memory allocation and replacement in Redis. At the same time, we also explore the diurnal behavior of a KV store and exploit long-term reuse. We replace the original passive eviction mechanism with an automatic dump/load mechanism, to smooth the transition between access peaks and valleys. Our evaluation shows that pRedis effectively reduces the average and tail access latency with minimal time and space overhead. For both real-world and synthetic workloads, our approach delivers an average of 14.0%∼52.3% latency reduction over a state-of-the-art penalty-aware cache management scheme, Hyperbolic Caching (HC), and shows more quantitative predictability of performance. Moreover, we can obtain even lower average latency (1.1%∼5.5%) when dynamically switching policies between pRedis and HC.


Author(s):  
Sameh Zakhary ◽  
Julian Rosser ◽  
Peer-Olaf Siebers ◽  
Yong Mao ◽  
Darren Robinson

Microsimulation is a class of Urban Building Energy Modeling techniques in which energetic interactions between buildings are explicitly resolved. Examples include SUNtool and CitySim+, both of which employ a sophisticated radiosity-based algorithm to solve for radiation exchange. The computational cost of this algorithm increases in proportion to the square of the number of surfaces of which an urban scene is comprised. To simulate large scenes, of the order of 10,000 to 1,000,000 surfaces, it is desirable to divide the scene to distribute the simulation task. However, this partitioning is not trivial as the energy-related interactions create uneven inter-dependencies between computing nodes. To this end, we describe in this paper two approaches ( K-means and Greedy Community Detection algorithms) for partitioning urban scenes, and subsequently performing building energy microsimulation using CitySim+ on a distributed memory High-Performance Computing Cluster. To compare the performance of these partitioning techniques, we propose two measures evaluating the extent to which the obtained clusters exploit data locality. We show that our approach using Greedy Community Detection performs well in terms of exploiting data locality and reducing inter-dependencies among sub-scenes, but at the expense of a higher data preparation cost and algorithm run-time.


2000 ◽  
Vol 8 (3) ◽  
pp. 143-162 ◽  
Author(s):  
Dimitrios S. Nikolopoulos ◽  
Theodore S. Papatheodorou ◽  
Constantine D. Polychronopoulos ◽  
Jesús Labarta ◽  
Eduard Ayguadé

This paper makes two important contributions. First, the paper investigates the performance implications of data placement in OpenMP programs running on modern NUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of contemporary NUMA architectures, reasonably balanced page placement schemes, such as round-robin or random distribution, incur modest performance losses. Second, the paper presents a transparent, user-level page migration engine with an ability to gain back any performance loss that stems from suboptimal placement of pages in iterative OpenMP programs. The main body of the paper describes how our OpenMP runtime environment uses page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results verify the effectiveness of the proposed framework and provide a proof of concept that it is not necessary to introduce data distribution directives in OpenMP and warrant the simplicity or the portability of the programming model.


Sign in / Sign up

Export Citation Format

Share Document