cache access Latest Research Papers

ICLA Unit: Intra-Cluster Locality-Aware Unit to Reduce L2 Access and NoC Pressure in GPGPUs

Journal of Circuits System and Computers ◽

10.1142/s0218126622500153 ◽

2021 ◽

pp. 2250015

Author(s):

Siamak Biglari Ardabili ◽

Gholamreza Zare Fatin

Keyword(s):

Network Traffic ◽

High Probability ◽

Heavy Traffic ◽

Network On Chip ◽

Traffic Pattern ◽

Cache Access ◽

Average Improvement ◽

Network Latency ◽

On Chip

As the number of streaming multiprocessors (SMs) in GPUs increases, in order to gain better performance, the reply network faces heavy traffic. This causes congestion on Network-on-Chip (NoC) routers and memory controller’s (MC) buffers. By taking advantage of cooperative thread arrays (CTAs) that are scheduled locally in clusters, there is a high probability of finding the same copy of data in other SM’s [Formula: see text] cache in the same cluster. In order to make this feasible, it is necessary for the SMs to have access to local [Formula: see text] cache of the neighboring SMs. There is a considerable congestion in NoC due to unique traffic pattern called many-to-few-to-many. Thanks to the reduced number of requests that is attained by our proposed Intra-Cluster Locality-Aware (ICLA) unit, this congested replying network traffic becomes many-to-many traffic pattern and the replied data goes through the less-utilized core-to-core communication that mitigates the NoC traffic. The proposed architecture in this paper has been evaluated using 15 different workloads from CUDA SDK, Rodinia, and ISPASS2009 benchmarks. The proposed ICLA unit has been modeled and simulated in the GPGPU-Sim. The results show about 23.79% (up to 49.82%) reduction in average network latency, 15.49% (up to 36.82%) reduction in average [Formula: see text] cache access, and 18.18% (up to 58.1%) average improvement in the instruction per cycle (IPC).

Exploiting Long-Term Temporal Cache Access Patterns for LRU Insertion Prioritization

Parallel Processing Letters ◽

10.1142/s0129626421500109 ◽

2021 ◽

pp. 2150010

Author(s):

Shane Carroll ◽

Wei-Ming Lin

Keyword(s):

Extended Period ◽

Short Term ◽

Simple Modification ◽

Cache Access ◽

Simple Heuristic ◽

Recall Time ◽

Cache Line ◽

Access Patterns

In a CPU cache utilizing least recently used (LRU) replacement, cache sets manage a buffer which orders all cache lines in the set from LRU to most recently used (MRU). When a cache line is brought into cache, it is placed at the MRU and the LRU line is evicted. When re-accessed, a line is promoted to the MRU position. LRU replacement provides a simple heuristic to predict the optimal cache line to evict. However, LRU utilizes only simple, short-term access patterns. In this paper, we propose a method that uses a buffer called the history queue to record longer-term access-eviction patterns than the LRU buffer can capture. Using this information, we make a simple modification to LRU insertion policy such that recently-recalled blocks have priority over others. As lines are evicted, their addresses are recorded in a FIFO history queue. Incoming lines that have recently been evicted and now recalled (those in the history queue at recall time) remain in the MRU for an extended period of time as non-recalled lines entering the cache thereafter are placed below the MRU. We show that the proposed LRU insertion prioritization increases performance in single-threaded and multi-threaded workloads in simulations with simple adjustments to baseline LRU.

Energy and Latency Efficient Caching in Mobile Edge Networks: Survey, Solutions, and Challenges

10.36227/techrxiv.14461760 ◽

2021 ◽

Author(s):

Lubna Badri Mohammed ◽

Alagan Anpalagan ◽

Muhammad Jaseemuddin

Keyword(s):

Wireless Networks ◽

Fold Increase ◽

Base Station ◽

Base Stations ◽

Smart Devices ◽

Future Research ◽

Data Traffic ◽

Specific Domain ◽

Cache Access ◽

Edge Networks

<div><div><div><p>Future wireless networks provide research challenges with many fold increase of smart devices and the exponential growth in mobile data traffic. The advent of highly computational and real-time applications cause huge expansion in traffic volume. The emerging need to bring data closer to users and minimizing the traffic off the macrocell base station (MBS) introduces the use of caches at the edge of the networks. Storing most popular files at the edge of mobile edge networks (MENs) in user terminals (UTs) and small base stations (SBSs) caches is a promising approach to the challenges that face data-rich wireless networks. Caching at the mobile UT allows to obtain requested contents directly from its nearby UTs caches through the device-to- device (D2D) communication.</p><p>In this survey article, solutions for mobile edge computing and caching challenges in terms of energy and latency are presented. Caching in MENs and comparisons between different caching techniques in MENs are presented. An illustration of the research in cache development for wireless networks that apply intelligent and learning techniques (ILTs) in a specific domain in their design is presented. We summarize the challenges that face the design of caching system in MENs. Finally, some future research directions are discussed for the development of cache placement and cache access and delivery in MENs.</p></div></div></div>

Energy and Latency Efficient Caching in Mobile Edge Networks: Survey, Solutions, and Challenges

10.36227/techrxiv.14461760.v1 ◽

2021 ◽

Author(s):

Lubna Badri Mohammed ◽

Alagan Anpalagan ◽

Muhammad Jaseemuddin

Keyword(s):

Wireless Networks ◽

Fold Increase ◽

Base Station ◽

Base Stations ◽

Smart Devices ◽

Future Research ◽

Data Traffic ◽

Specific Domain ◽

Cache Access ◽

Edge Networks

<div><div><div><p>Future wireless networks provide research challenges with many fold increase of smart devices and the exponential growth in mobile data traffic. The advent of highly computational and real-time applications cause huge expansion in traffic volume. The emerging need to bring data closer to users and minimizing the traffic off the macrocell base station (MBS) introduces the use of caches at the edge of the networks. Storing most popular files at the edge of mobile edge networks (MENs) in user terminals (UTs) and small base stations (SBSs) caches is a promising approach to the challenges that face data-rich wireless networks. Caching at the mobile UT allows to obtain requested contents directly from its nearby UTs caches through the device-to- device (D2D) communication.</p><p>In this survey article, solutions for mobile edge computing and caching challenges in terms of energy and latency are presented. Caching in MENs and comparisons between different caching techniques in MENs are presented. An illustration of the research in cache development for wireless networks that apply intelligent and learning techniques (ILTs) in a specific domain in their design is presented. We summarize the challenges that face the design of caching system in MENs. Finally, some future research directions are discussed for the development of cache placement and cache access and delivery in MENs.</p></div></div></div>

Novel prioritized LRU circuits for shared cache in computer systems

Modern Physics Letters B ◽

10.1142/s0217984920502425 ◽

2020 ◽

Vol 34 (23) ◽

pp. 2050242

Author(s):

Yao Wang ◽

Lijun Sun ◽

Haibo Wang ◽

Lavanya Gopalakrishnan ◽

Ronald Eaton

Keyword(s):

Real Time ◽

High Performance ◽

Cmos Technology ◽

Cache Replacement ◽

Worst Case ◽

Shared Cache ◽

Cache Access ◽

Worst Case Execution Time ◽

Real Time Applications ◽

Time Systems

Cache sharing technique is critical in multi-core and multi-threading systems. It potentially delays the execution of real-time applications and makes the prediction of the worst-case execution time (WCET) of real-time applications more challenging. Prioritized cache has been demonstrated as a promising approach to address this challenge. Instead of the conventional prioritized cache schemes realized at the architecture level by using cache controllers, this work presents two prioritized least recently used (LRU) cache replacement circuits that directly accomplish the prioritization inside the cache circuits, hence significantly reduces the cache access latency. The performance, hardware and power overheads due to the proposed prioritized LRU circuits are investigated based on a 65 nm CMOS technology. It shows that the proposed circuits have very low overhead compared to conventional cache circuits. The presented techniques will lead to more effective prioritized shared cache implementations and benefit the development of high-performance real-time systems.

Development of a cycle-accurate simulator of the Elbrus processor core memory subsystem

Radio Industry (Russia) ◽

10.21778/2413-9599-2019-29-2-17-27 ◽

2019 ◽

Vol 29 (2) ◽

pp. 17-27

Author(s):

D. V. Znamenskiy ◽

V. N. Kutsevol

Keyword(s):

Optimization Methods ◽

Core Memory ◽

Semiconductor Technology ◽

Hardware Support ◽

Processor Core ◽

Access Latency ◽

Cache Access ◽

Memory Subsystem ◽

The Impact

Increasing complexity of modern microprocessors, combined with semiconductor technology progress slowdown, make a further increase in performance more difficult. Under these circumstances, the relevance of the performance estimations of prospective microprocessors by dint of cycle-accurate simulation prior to their production in silicon is of growing importance. The approach to implementation of cycle-accurate simulator of core memory subsystem for Elbrus architecture, controlled by the existing functional simulator of this architecture, is presented herein. The method for validation of a cycleaccurate simulator by comparison with modeling of the RTL description of the prospective microprocessor is considered. The data on the speed of the cycle-accurate simulator and the main optimization methods, which were used to achieve acceptable performance, are presented. The preliminary estimates of the impact on the performance of some changes in the prospective processor core, including the cache access latency and hardware support for virtualization, obtained with the help of the cycle-accurate simulator are given. These assessments are important for making architectural decisions when designing the prospective Elbrus architecture processors.

CART: Cache Access Reordering Tree for Efficient Cache and Memory Accesses in GPUs

2018 IEEE 36th International Conference on Computer Design (ICCD) ◽

10.1109/iccd.2018.00046 ◽

2018 ◽

Cited By ~ 1

Author(s):

Yongbin Gu ◽

Lizhong Chen

Keyword(s):

Cache Access ◽

Memory Accesses

Work-in-Progress: DRAM Cache Access Optimization leveraging Line Locking in Tag Cache

2018 International Conference on Compilers, Architectures and Synthesis for Embedded Systems (CASES) ◽

10.1109/cases.2018.8516812 ◽

2018 ◽

Author(s):

Shivani Tripathy ◽

Debiprasanna Sahoo ◽

Manoranjan Satpathy

Keyword(s):

Work In Progress ◽

Cache Access

Cache Access Fairness in 3D Mesh-Based NUCA

IEEE Access ◽

10.1109/access.2018.2862633 ◽

2018 ◽

Vol 6 ◽

pp. 42984-42996 ◽

Cited By ~ 1

Author(s):

Zicong Wang ◽

Xiaowen Chen ◽

Zhonghai Lu ◽

Yang Guo

Keyword(s):

3D Mesh ◽

Cache Access

Enhance the Performance of Associative Memory by Using New Methods

VFAST Transactions on Software Engineering ◽

10.21015/vtse.v12i3.504 ◽

2017 ◽

pp. 49-56

Author(s):

◽

Keyword(s):

Associative Memory ◽

Memory Performance ◽

Cache Memory ◽

Access Time ◽

Multicore Processor ◽

Cache Performance ◽

New Methods ◽

Cache Access ◽

Multiple Tasks ◽

Performance Evaluating

Data or instructions that are regularly used are saved in cache so that it is very easy to retrieve for the purpose of increase the cache performance. Evaluating the execution of multi-core systems the part of the cache memory is very important. A multicore processor is shared circuit in which two or more processors are joined to enhance the performance and perform multiple tasks. This paper describes the performance of cache memory based on cache access time, miss rate and miss penalty. Cache mapping methods are defined to increase the performance of cache but it face many difficulties. Some methods and algorithms are used to decrease these difficulties. In this paper describes the study of recent competing processors to evaluate the cache memory performance.

cache access
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

ICLA Unit: Intra-Cluster Locality-Aware Unit to Reduce L2 Access and NoC Pressure in GPGPUs

Exploiting Long-Term Temporal Cache Access Patterns for LRU Insertion Prioritization

Energy and Latency Efficient Caching in Mobile Edge Networks: Survey, Solutions, and Challenges

Energy and Latency Efficient Caching in Mobile Edge Networks: Survey, Solutions, and Challenges

Novel prioritized LRU circuits for shared cache in computer systems

Development of a cycle-accurate simulator of the Elbrus processor core memory subsystem

CART: Cache Access Reordering Tree for Efficient Cache and Memory Accesses in GPUs

Work-in-Progress: DRAM Cache Access Optimization leveraging Line Locking in Tag Cache

Cache Access Fairness in 3D Mesh-Based NUCA

Enhance the Performance of Associative Memory by Using New Methods

Export Citation Format

cache accessRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

ICLA Unit: Intra-Cluster Locality-Aware Unit to Reduce L2 Access and NoC Pressure in GPGPUs

Exploiting Long-Term Temporal Cache Access Patterns for LRU Insertion Prioritization

Energy and Latency Efficient Caching in Mobile Edge Networks: Survey, Solutions, and Challenges

Energy and Latency Efficient Caching in Mobile Edge Networks: Survey, Solutions, and Challenges

Novel prioritized LRU circuits for shared cache in computer systems

Development of a cycle-accurate simulator of the Elbrus processor core memory subsystem

CART: Cache Access Reordering Tree for Efficient Cache and Memory Accesses in GPUs

Work-in-Progress: DRAM Cache Access Optimization leveraging Line Locking in Tag Cache

Cache Access Fairness in 3D Mesh-Based NUCA

Enhance the Performance of Associative Memory by Using New Methods

cache access
Recently Published Documents