A SPLIT L2 DATA CACHE FOR SCALABLE CC-NUMA MULTIPROCESSORS

In scalable CC-NUMA multiprocessors, it is crucial to reduce the average memory access time. For applications where the second-level (L2) cache is large enough, we propose a split L2 cache to utilize the surplus space. The split L2 cache is composed of a traditional LRU cache and an RVC (Remote Victim Cache) which only stores the data of remote memory address range. Thus, it reduces the average L2 cache miss time by keeping remote blocks that would be discarded otherwise. Though the split cache does not reduce the miss rates, it is observed to reduce the total execution time effectively by up to 27%.It even outperform an LRU cache of double size.

Download Full-text

Poluição de Cache e Thrashing em Aplicações Paralelas de Alto Desempenho

10.5753/wscad.2019.8683 ◽

2019 ◽

Author(s):

Arthur Krause ◽

Francis Moreira ◽

Valéria Girelli ◽

Philippe Olivier Navaux

Keyword(s):

High Performance ◽

Computer Systems ◽

Memory Access ◽

Replacement Policy ◽

Parallel Applications ◽

Access Time ◽

L2 Cache ◽

Intelligent Management

Conforme os processadores evoluem, o desempenho dos sistemas computacionais se torna cada vez mais limitado pelo tempo de acesso à memória. Caches são empregadas a fim de contornar este problema, mas é necessária uma gerência inteligente dos dados que são armazenados nelas para impedir que problemas como poluição e thrashing degradem seu desempenho. Neste trabalho é apresentada uma análise da poluição de cache e thrashing em aplicações paralelas de alto desempenho. Os resultados mostram que caches com maior associatividade sofrem mais com estes problemas. Até 28% dos cache misses na L1 poderiam ser evitados com uma política de substituição de cache mais inteligente, chegando a até 62% na cache L2 e 98% na LLC. As processors evolve, the performance of computer systems becomes increasingly limited by the memory access time. Caches are employed in order to get around this problem, but an intelligent management of the data that is stored in them is necessary to prevent problems such as pollution and thrashing from degrading their performance. In this work, an analysis of cache and thrashing pollution in high performance parallel applications is presented. The results show that caches with greater associativity suffer more from these problems. Up to 28% of cache misses in the L1 cache could be avoided with a smarter replacement policy, up to 62% in the L2 cache and 98% in the LLC.

Download Full-text

A Varying Processor Cache Sets Architecture

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.c5679.098319 ◽

2019 ◽

Vol 8 (3) ◽

pp. 6141-6145

Keyword(s):

Power Saving ◽

Memory Access ◽

Access Time ◽

Design Time ◽

Cache Size ◽

System A ◽

Line Placement ◽

Proposed Model ◽

L2 Cache ◽

Cache System

Any processor cache has three parameters capacity, line size and associativity. Usually all three are fixed at design time. Algorithms to have variable cache sets are proposed in literature. This paper proposes a method to have variable cache sets logically. The cache comes with fixed sets. The cache is visualized to have logically any number of sets greater than or equal to one. An algorithm for line placement/replacement is proposed in this paper for this model. The proposed model is simulated with SPEC2K benchmarks using Simplescalar Toolkit for two level inclusive set associative cache system. A power saving of 8.4% for L1 cache size 512x4, 17.58% for 1024x4 and 31.3% for 2048x4 is observed compared with traditional set associative cache of same size. A power saving of 7.53% compared with model proposed in literature for L1 size 512x4, 7.64% for 1024x4 and 7.645% for 2048x4 is observed. The L2 cache size is fixed at 2048x8. The average memory access time (AMAT) is found to degrade compared with conventional set associative cache by 19.63% for L1 size of 512x4, 24.68% for 1024x4 and 2048x4. (Abstract)

Download Full-text

Adaptive Granularity Based Last-Level Cache Prefetching Method with eDRAM Prefetch Buffer for Graph Processing Applications

Applied Sciences ◽

10.3390/app11030991 ◽

2021 ◽

Vol 11 (3) ◽

pp. 991

Author(s):

Sae-Gyeol Choi ◽

Jeong-Geun Kim ◽

Shin-Dug Kim

Keyword(s):

Energy Consumption ◽

Execution Time ◽

Memory Access ◽

Graph Processing ◽

Global History ◽

Total Execution Time ◽

Proposed Model ◽

Working Set ◽

Access Patterns ◽

Time And Energy

The emergence of big data processing and machine learning has triggered the exponential growth of the working set sizes of applications. In addition, several modern applications are memory intensive with irregular memory access patterns. Therefore, we propose the concept of adaptive granularities to develop a prefetching methodology for analyzing memory access patterns based on a wider granularity concept that entails both cache lines and page granularity. The proposed prefetching module resides in the last-level cache (LLC) to handle the large working set of memory-intensive workloads. Additionally, to support memory access streams with variable intervals, we introduced an embedded-DRAM based LLC prefetch buffer that consists of three granularity-based prefetch engines and an access history table. By adaptively changing the granularity window for analyzing memory streams, the proposed model can swiftly and appropriately determine the stride of memory addresses to generate hidden delta chains from irregular memory access patterns. The proposed model achieves 18% and 15% improvements in terms of energy consumption and execution time compared to global history buffer and best offset prefetchers, respectively. In addition, our model reduced the total execution time and energy consumption by approximately 6% and 2.3%, compared to those of the Markov prefetcher and variable-length delta prefetcher.

Download Full-text

Decreasing the Miss Rate and Eliminating the Performance Penalty of a Data Filter Cache

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3449043 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1-22

Author(s):

Michael Stokes ◽

David Whalley ◽

Soner Onder

Keyword(s):

Energy Efficient ◽

Data Access ◽

Performance Degradation ◽

Access Time ◽

Data Cache ◽

Energy Usage ◽

Single Cycle ◽

Performance Penalty

While data filter caches (DFCs) have been shown to be effective at reducing data access energy, they have not been adopted in processors due to the associated performance penalty caused by high DFC miss rates. In this article, we present a design that both decreases the DFC miss rate and completely eliminates the DFC performance penalty even for a level-one data cache (L1 DC) with a single cycle access time. First, we show that a DFC that lazily fills each word in a DFC line from an L1 DC only when the word is referenced is more energy-efficient than eagerly filling the entire DFC line. For a 512B DFC, we are able to eliminate loads of words into the DFC that are never referenced before being evicted, which occurred for about 75% of the words in 32B lines. Second, we demonstrate that a lazily word filled DFC line can effectively share and pack data words from multiple L1 DC lines to lower the DFC miss rate. For a 512B DFC, we completely avoid accessing the L1 DC for loads about 23% of the time and avoid a fully associative L1 DC access for loads 50% of the time, where the DFC only requires about 2.5% of the size of the L1 DC. Finally, we present a method that completely eliminates the DFC performance penalty by speculatively performing DFC tag checks early and only accessing DFC data when a hit is guaranteed. For a 512B DFC, we improve data access energy usage for the DTLB and L1 DC by 33% with no performance degradation.

Download Full-text

Pre-Emphasis Pulse Design for Random-Access Memory

Electronics ◽

10.3390/electronics10121454 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1454

Author(s):

Yoshihiro Sugiura ◽

Toru Tanzawa

Keyword(s):

Time Constant ◽

Random Access ◽

Memory Cell ◽

Random Access Memory ◽

Memory Access ◽

Access Time ◽

Access Memory ◽

Delay Times ◽

Cell Current ◽

The Impact

This paper describes how one can reduce the memory access time with pre-emphasis (PE) pulses even in non-volatile random-access memory. Optimum PE pulse widths and resultant minimum word-line (WL) delay times are investigated as a function of column address. The impact of the process variation in the time constant of WL, the cell current, and the resistance of deciding path on optimum PE pulses are discussed. Optimum PE pulse widths and resultant minimum WL delay times are modeled with fitting curves as a function of column address of the accessed memory cell, which provides designers with the ability to set the optimum timing for WL and BL (bit-line) operations, reducing average memory access time.

Download Full-text

Decision Tree-Based Adaptive Reconfigurable Cache Scheme

Algorithms ◽

10.3390/a14060176 ◽

2021 ◽

Vol 14 (6) ◽

pp. 176

Author(s):

Wei Zhu ◽

Xiaoyang Zeng

Keyword(s):

Decision Tree ◽

Adaptive Algorithms ◽

Memory Access ◽

Access Time ◽

Decision Tree Algorithm ◽

Verilog Hdl ◽

Tree Model ◽

Cache Associativity ◽

Cache Scheme ◽

Reconfigurable Cache

Applications have different preferences for caches, sometimes even within the different running phases. Caches with fixed parameters may compromise the performance of a system. To solve this problem, we propose a real-time adaptive reconfigurable cache based on the decision tree algorithm, which can optimize the average memory access time of cache without modifying the cache coherent protocol. By monitoring the application running state, the cache associativity is periodically tuned to the optimal cache associativity, which is determined by the decision tree model. This paper implements the proposed decision tree-based adaptive reconfigurable cache in the GEM5 simulator and designs the key modules using Verilog HDL. The simulation results show that the proposed decision tree-based adaptive reconfigurable cache reduces the average memory access time compared with other adaptive algorithms.

Download Full-text

Preemptive Scheduling for Two-Processor Systems

Fundamenta Informaticae ◽

10.3233/fi-1988-11102 ◽

1988 ◽

Vol 11 (1) ◽

pp. 1-19

Author(s):

Andrzej Rowicki

Keyword(s):

Execution Time ◽

Preemptive Scheduling ◽

Total Execution Time ◽

Schedule Length ◽

Execution Times ◽

Dependent Tasks

The purpose of the paper is to consider an algorithm for preemptive scheduling for two-processor systems with identical processors. Computations submitted to the systems are composed of dependent tasks with arbitrary execution times and contain no loops and have only one output. We assume that preemptions times are completely unconstrained, and preemptions consume no time. Moreover, the algorithm determines the total execution time of the computation. It has been proved that this algorithm is optimal, that is, the total execution time of the computation (schedule length) is minimized.

Download Full-text