cache capacity
Recently Published Documents


TOTAL DOCUMENTS

29
(FIVE YEARS 5)

H-INDEX

7
(FIVE YEARS 0)

2022 ◽  
Vol 19 (1) ◽  
pp. 1-26
Author(s):  
Aditya Ukarande ◽  
Suryakant Patidar ◽  
Ram Rangan

The compute work rasterizer or the GigaThread Engine of a modern NVIDIA GPU focuses on maximizing compute work occupancy across all streaming multiprocessors in a GPU while retaining design simplicity. In this article, we identify the operational aspects of the GigaThread Engine that help it meet those goals but also lead to less-than-ideal cache locality for texture accesses in 2D compute shaders, which are an important optimization target for gaming applications. We develop three software techniques, namely LargeCTAs , Swizzle , and Agents , to show that it is possible to effectively exploit the texture data working set overlap intrinsic to 2D compute shaders. We evaluate these techniques on gaming applications across two generations of NVIDIA GPUs, RTX 2080 and RTX 3080, and find that they are effective on both GPUs. We find that the bandwidth savings from all our software techniques on RTX 2080 is much higher than the bandwidth savings on baseline execution from inter-generational cache capacity increase going from RTX 2080 to RTX 3080. Our best-performing technique, Agents , records up to a 4.7% average full-frame speedup by reducing bandwidth demand of targeted shaders at the L1-L2 and L2-DRAM interfaces by 23% and 32%, respectively, on the latest generation RTX 3080. These results acutely highlight the sensitivity of cache locality to compute work rasterization order and the importance of locality-aware cooperative thread array scheduling for gaming applications.


2022 ◽  
Vol 19 (1) ◽  
pp. 1-23
Author(s):  
Yaosheng Fu ◽  
Evgeny Bolotin ◽  
Niladrish Chatterjee ◽  
David Nellans ◽  
Stephen W. Keckler

As GPUs scale their low-precision matrix math throughput to boost deep learning (DL) performance, they upset the balance between math throughput and memory system capabilities. We demonstrate that a converged GPU design trying to address diverging architectural requirements between FP32 (or larger)-based HPC and FP16 (or smaller)-based DL workloads results in sub-optimal configurations for either of the application domains. We argue that a C omposable O n- PA ckage GPU (COPA-GPU) architecture to provide domain-specialized GPU products is the most practical solution to these diverging requirements. A COPA-GPU leverages multi-chip-module disaggregation to support maximal design reuse, along with memory system specialization per application domain. We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4× higher off-die bandwidth, 32× larger on-package cache, and 2.3× higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs. This work explores the microarchitectural design necessary to enable composable GPUs and evaluates the benefits composability can provide to HPC, DL training, and DL inference. We show that when compared to a converged GPU design, a DL-optimized COPA-GPU featuring a combination of 16× larger cache capacity and 1.6× higher DRAM bandwidth scales per-GPU training and inference performance by 31% and 35%, respectively, and reduces the number of GPU instances by 50% in scale-out training scenarios.


2021 ◽  
Vol 12 (1) ◽  
pp. 344
Author(s):  
Salman Rashid ◽  
Shukor Abd Razak ◽  
Fuad A. Ghaleb

In-network caching is the essential part of Content-Centric Networking (CCN). The main aim of a CCN caching module is data distribution within the network. Each CCN node can cache content according to its placement policy. Therefore, it is fully equipped to meet the requirements of future networks demands. The placement strategy decides to cache the content at the optimized location and minimize content redundancy within the network. When cache capacity is full, the content eviction policy decides which content should stay in the cache and which content should be evicted. Hence, network performance and cache hit ratio almost equally depend on the content placement and replacement policies. Content eviction policies have diverse requirements due to limited cache capacity, higher request rates, and the rapid change of cache states. Many replacement policies follow the concept of low or high popularity and data freshness for content eviction. However, when content loses its popularity after becoming very popular in a certain period, it remains in the cache space. Moreover, content is evicted from the cache space before it becomes popular. To handle the above-mentioned issue, we introduced the concept of maturity/immaturity of the content. The proposed policy, named Immature Used (IMU), finds the content maturity index by using the content arrival time and its frequency within a specific time frame. Also, it determines the maturity level through a maturity classifier. In the case of a full cache, the least immature content is evicted from the cache space. We performed extensive simulations in the simulator (Icarus) to evaluate the performance (cache hit ratio, path stretch, latency, and link load) of the proposed policy with different well-known cache replacement policies in CCN. The obtained results, with varying popularity and cache sizes, indicate that our proposed policy can achieve up to 14.31% more cache hits, 5.91% reduced latency, 3.82% improved path stretch, and 9.53% decreased link load, compared to the recently proposed technique. Moreover, the proposed policy performed significantly better compared to other baseline approaches.


2020 ◽  
Vol 2020 ◽  
pp. 1-11
Author(s):  
Chenyu Wu ◽  
Shuo Shi ◽  
Shushi Gu ◽  
Lingyan Zhang ◽  
Xuemai Gu

Cache-enabled unmanned aerial vehicles (UAVs) have been envisioned as a promising technology for many applications in future urban wireless communication. However, to utilize UAVs properly is challenging due to limited endurance and storage capacity as well as the continuous roam of the mobile users. To meet the diversity of urban communication services, it is essential to exploit UAVs’ potential of mobility and storage resource. Toward this end, we consider an urban cache-enabled communication network where the UAVs serve mobile users with energy and cache capacity constraints. We formulate an optimization problem to maximize the sum achievable throughput in this system. To solve this problem, we propose a deep reinforcement learning-based joint content placement and trajectory design algorithm (DRL-JCT), whose progress can be divided into two stages: offline content placement stage and online user tracking stage. First, we present a link-based scheme to maximize the cache hit rate of all users’ file requirements under cache capacity constraint. The NP-hard problem is solved by approximation and convex optimization. Then, we leverage the Double Deep Q-Network (DDQN) to track mobile users online with their instantaneous two-dimensional coordinate under energy constraint. Numerical results show that our algorithm converges well after a small number of iterations. Compared with several benchmark schemes, our algorithm adapts to the dynamic conditions and provides significant performance in terms of sum achievable throughput.


2018 ◽  
Vol 2018 ◽  
pp. 1-12 ◽  
Author(s):  
Jiequ Ji ◽  
Kun Zhu ◽  
Ran Wang ◽  
Bing Chen ◽  
Chen Dai

Caching popular contents at base stations (BSs) has been regarded as an effective approach to alleviate the backhaul load and to improve the quality of service. To meet the explosive data traffic demand and to save energy consumption, energy efficiency (EE) has become an extremely important performance index for the 5th generation (5G) cellular networks. In general, there are two ways for improving the EE for caching, that is, improving the cache-hit rate and optimizing the cache size. In this work, we investigate the energy efficient caching problem in backhaul-aware cellular networks jointly considering these two approaches. Note that most existing works are based on the assumption that the content catalog and popularity are static. However, in practice, content popularity is dynamic. To timely estimate the dynamic content popularity, we propose a method based on shot noise model (SNM). Then we propose a distributed caching policy to improve the cache-hit rate in such a dynamic environment. Furthermore, we analyze the tradeoff between energy efficiency and cache capacity for which an optimization is formulated. We prove its convexity and derive a closed-form optimal cache capacity for maximizing the EE. Simulation results validate the proposed scheme and show that EE can be improved with appropriate choice of cache capacity.


2015 ◽  
Vol 50 ◽  
pp. 101-113 ◽  
Author(s):  
Dabin Kim ◽  
Sung-Won Lee ◽  
Young-Bae Ko ◽  
Jae-Hoon Kim

Sign in / Sign up

Export Citation Format

Share Document