Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPU

GPGPUs has gradually become a mainstream acceleration component in high-performance computing. The long latency of memory operations is the bottleneck of GPU performance. In the GPU, multiple threads are divided into one warp for scheduling and execution. The L1 data caches have little capacity, while multiple warps share one small cache. That makes the cache suffer a large amount of cache contention and pipeline stall. We propose Locality-Based Cache Management (LCM), combined with the Locality-Based Warp Scheduling (LWS), to reduce cache contention and improve GPU performance. Each load instruction can be divided into three types according to locality: only used once as streaming data locality, accessed multiple times in the same warp as intra-warp locality, and accessed in different warps as inter-warp data locality. According to the locality of the load instruction, LWS applies cache bypass to the streaming locality request to improve the cache utilization rate, extend inter-warp memory request coalescing to make full use of the inter-warp locality, and combine with the LWS to alleviate cache contention. LCM and LWS can effectively improve cache performance, thereby improving overall GPU performance. Through experimental evaluation, our LCM and LWS can obtain an average performance improvement of 26% over baseline GPU.

Download Full-text

In-DRAM Cache Management for Low Latency and Low Power 3D-Stacked DRAMs

Micromachines ◽

10.3390/mi10020124 ◽

2019 ◽

Vol 10 (2) ◽

pp. 124 ◽

Cited By ~ 2

Author(s):

Ho Shin ◽

Eui-Young Chung

Keyword(s):

Low Power ◽

High Performance ◽

High Capacity ◽

Random Access ◽

Computing System ◽

Access Time ◽

Low Latency ◽

Cache Management ◽

Long Latency ◽

High Bandwidth

Recently, 3D-stacked dynamic random access memory (DRAM) has become a promising solution for ultra-high capacity and high-bandwidth memory implementations. However, it also suffers from memory wall problems due to long latency, such as with typical 2D-DRAMs. Although there are various cache management techniques and latency hiding schemes to reduce DRAM access time, in a high-performance system using high-capacity 3D-stacked DRAM, it is ultimately essential to reduce the latency of the DRAM itself. To solve this problem, various asymmetric in-DRAM cache structures have recently been proposed, which are more attractive for high-capacity DRAMs because they can be implemented at a lower cost in 3D-stacked DRAMs. However, most research mainly focuses on the architecture of the in-DRAM cache itself and does not pay much attention to proper management methods. In this paper, we propose two new management algorithms for the in-DRAM caches to achieve a low-latency and low-power 3D-stacked DRAM device. Through the computing system simulation, we demonstrate the improvement of energy delay product up to 67%.

Download Full-text

Penalty- and Locality-aware Memory Allocation in Redis Using Enhanced AET

ACM Transactions on Storage ◽

10.1145/3447573 ◽

2021 ◽

Vol 17 (2) ◽

pp. 1-45

Author(s):

Cheng Pan ◽

Xiaolin Wang ◽

Yingwei Luo ◽

Zhenlin Wang

Keyword(s):

Large Data ◽

Data Locality ◽

Memory Allocation ◽

Cache Management ◽

Cache Replacement ◽

Time Model ◽

Memory Cache ◽

Management Scheme ◽

Data Volume ◽

Replacement Algorithms

Due to large data volume and low latency requirements of modern web services, the use of an in-memory key-value (KV) cache often becomes an inevitable choice (e.g., Redis and Memcached). The in-memory cache holds hot data, reduces request latency, and alleviates the load on background databases. Inheriting from the traditional hardware cache design, many existing KV cache systems still use recency-based cache replacement algorithms, e.g., least recently used or its approximations. However, the diversity of miss penalty distinguishes a KV cache from a hardware cache. Inadequate consideration of penalty can substantially compromise space utilization and request service time. KV accesses also demonstrate locality, which needs to be coordinated with miss penalty to guide cache management. In this article, we first discuss how to enhance the existing cache model, the Average Eviction Time model, so that it can adapt to modeling a KV cache. After that, we apply the model to Redis and propose pRedis, Penalty- and Locality-aware Memory Allocation in Redis, which synthesizes data locality and miss penalty, in a quantitative manner, to guide memory allocation and replacement in Redis. At the same time, we also explore the diurnal behavior of a KV store and exploit long-term reuse. We replace the original passive eviction mechanism with an automatic dump/load mechanism, to smooth the transition between access peaks and valleys. Our evaluation shows that pRedis effectively reduces the average and tail access latency with minimal time and space overhead. For both real-world and synthetic workloads, our approach delivers an average of 14.0%∼52.3% latency reduction over a state-of-the-art penalty-aware cache management scheme, Hyperbolic Caching (HC), and shows more quantitative predictability of performance. Moreover, we can obtain even lower average latency (1.1%∼5.5%) when dynamically switching policies between pRedis and HC.

Download Full-text

Study of data locality for real-time biomedical signal processing of streaming data on Cell Broadband Engine

Proceedings of the IEEE SoutheastCon 2010 (SoutheastCon) ◽

10.1109/secon.2010.5453907 ◽

2010 ◽

Cited By ~ 3

Author(s):

Ashish Panday ◽

Bharat Joshi ◽

Arun Ravindran ◽

Jongho Byun ◽

Hitten Zaveri

Keyword(s):

Signal Processing ◽

Real Time ◽

Data Locality ◽

Streaming Data ◽

Biomedical Signal Processing ◽

Biomedical Signal ◽

Cell Broadband Engine

Download Full-text

High Performance Cache Management for Parallel File Systems

Vector and Parallel Processing – VECPAR’98 - Lecture Notes in Computer Science ◽

10.1007/10703040_35 ◽

1999 ◽

pp. 466-479 ◽

Cited By ~ 4

Author(s):

F. García ◽

J. Carretero ◽

F. Pérez ◽

P. de Miguel

Keyword(s):

High Performance ◽

File Systems ◽

Cache Management ◽

Parallel File Systems ◽

Parallel File

Download Full-text

Using unsupervised learning to partition 3D city scenes for distributed building energy microsimulation

Environment and Planning B Urban Analytics and City Science ◽

10.1177/2399808320914313 ◽

2020 ◽

pp. 239980832091431

Author(s):

Sameh Zakhary ◽

Julian Rosser ◽

Peer-Olaf Siebers ◽

Yong Mao ◽

Darren Robinson

Keyword(s):

Community Detection ◽

High Performance ◽

Computational Cost ◽

Building Energy ◽

Data Locality ◽

Urban Scenes ◽

Detection Algorithms ◽

Radiation Exchange ◽

Energetic Interactions ◽

High Performance Computing Cluster

Microsimulation is a class of Urban Building Energy Modeling techniques in which energetic interactions between buildings are explicitly resolved. Examples include SUNtool and CitySim+, both of which employ a sophisticated radiosity-based algorithm to solve for radiation exchange. The computational cost of this algorithm increases in proportion to the square of the number of surfaces of which an urban scene is comprised. To simulate large scenes, of the order of 10,000 to 1,000,000 surfaces, it is desirable to divide the scene to distribute the simulation task. However, this partitioning is not trivial as the energy-related interactions create uneven inter-dependencies between computing nodes. To this end, we describe in this paper two approaches ( K-means and Greedy Community Detection algorithms) for partitioning urban scenes, and subsequently performing building energy microsimulation using CitySim+ on a distributed memory High-Performance Computing Cluster. To compare the performance of these partitioning techniques, we propose two measures evaluating the extent to which the obtained clusters exploit data locality. We show that our approach using Greedy Community Detection performs well in terms of exploiting data locality and reducing inter-dependencies among sub-scenes, but at the expense of a higher data preparation cost and algorithm run-time.

Download Full-text

High-performance IoT streaming data prediction system using Spark: a case study of air pollution

Neural Computing and Applications ◽

10.1007/s00521-019-04678-9 ◽

2019 ◽

Vol 32 (17) ◽

pp. 13147-13154

Author(s):

Ho-Yong Jin ◽

Eun-Sung Jung ◽

Duckki Lee

Keyword(s):

Air Pollution ◽

High Performance ◽

Streaming Data ◽

Prediction System ◽

Data Prediction

Download Full-text

Efficient Cache Management Protocol Based on Data Locality in Mobile DBMSs

Current Issues in Databases and Information Systems - Lecture Notes in Computer Science ◽

10.1007/3-540-44472-6_5 ◽

2000 ◽

pp. 51-64 ◽

Cited By ~ 5

Author(s):

IlYoung Chung ◽

JeHyok Ryu ◽

Chong -Sun Hwang

Keyword(s):

Data Locality ◽

Cache Management ◽

Management Protocol

Download Full-text

High performance cache management for sequential data access

Proceedings of the 1992 ACM SIGMETRICS joint international conference on Measurement and modeling of computer systems - SIGMETRICS '92/PERFORMANCE '92 ◽

10.1145/133057.133141 ◽

1992 ◽

Cited By ~ 1

Author(s):

Erhard Rahm ◽

Donald Ferguson

Keyword(s):

High Performance ◽

Data Access ◽

Cache Management ◽

Sequential Data

Download Full-text

A Transparent Runtime Data Distribution Engine for OpenMP

Scientific Programming ◽

10.1155/2000/417570 ◽

2000 ◽

Vol 8 (3) ◽

pp. 143-162 ◽

Cited By ~ 4

Author(s):

Dimitrios S. Nikolopoulos ◽

Theodore S. Papatheodorou ◽

Constantine D. Polychronopoulos ◽

Jesús Labarta ◽

Eduard Ayguadé

Keyword(s):

High Performance ◽

Programming Model ◽

Data Distribution ◽

Data Locality ◽

Remote Memory ◽

Main Body ◽

Performance Loss ◽

Page Migration ◽

Runtime Environment ◽

Memory Accesses

This paper makes two important contributions. First, the paper investigates the performance implications of data placement in OpenMP programs running on modern NUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of contemporary NUMA architectures, reasonably balanced page placement schemes, such as round-robin or random distribution, incur modest performance losses. Second, the paper presents a transparent, user-level page migration engine with an ability to gain back any performance loss that stems from suboptimal placement of pages in iterative OpenMP programs. The main body of the paper describes how our OpenMP runtime environment uses page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results verify the effectiveness of the proposed framework and provide a proof of concept that it is not necessary to introduce data distribution directives in OpenMP and warrant the simplicity or the portability of the programming model.

Download Full-text

IACM: Integrated adaptive cache management for high-performance and energy-efficient GPGPU computing

2016 IEEE 34th International Conference on Computer Design (ICCD) ◽

10.1109/iccd.2016.7753308 ◽

2016 ◽

Cited By ~ 2

Author(s):

Kyu Yeun Kim ◽

Jinsu Park ◽

Woongki Baek

Keyword(s):

Energy Efficient ◽

High Performance ◽

Cache Management ◽

Gpgpu Computing ◽

Adaptive Cache

Download Full-text