In-DRAM Cache Management for Low Latency and Low Power 3D-Stacked DRAMs

Recently, 3D-stacked dynamic random access memory (DRAM) has become a promising solution for ultra-high capacity and high-bandwidth memory implementations. However, it also suffers from memory wall problems due to long latency, such as with typical 2D-DRAMs. Although there are various cache management techniques and latency hiding schemes to reduce DRAM access time, in a high-performance system using high-capacity 3D-stacked DRAM, it is ultimately essential to reduce the latency of the DRAM itself. To solve this problem, various asymmetric in-DRAM cache structures have recently been proposed, which are more attractive for high-capacity DRAMs because they can be implemented at a lower cost in 3D-stacked DRAMs. However, most research mainly focuses on the architecture of the in-DRAM cache itself and does not pay much attention to proper management methods. In this paper, we propose two new management algorithms for the in-DRAM caches to achieve a low-latency and low-power 3D-stacked DRAM device. Through the computing system simulation, we demonstrate the improvement of energy delay product up to 67%.

Download Full-text

The CEDARtools Platform – Massive External Memory with High Bandwidth and Low Latency Under Fine-Granular Random Access Patterns

2019 29th International Conference on Field Programmable Logic and Applications (FPL) ◽

10.1109/fpl.2019.00079 ◽

2019 ◽

Author(s):

Thomas Preusser ◽

Alexander Weiss

Keyword(s):

Random Access ◽

External Memory ◽

Low Latency ◽

High Bandwidth ◽

Access Patterns

Download Full-text

High bandwidth low latency chip to chip interconnects using high performance MLC glass ceramic POWER4/sup R/ MCM

IEEE 10th Topical Meeting on Electrical Performance of Electronic Packaging (Cat No 01TH8565) EPEP-01 ◽

10.1109/epep.2001.967668 ◽

2002 ◽

Author(s):

P. Walling ◽

A. Tai ◽

H. Hamel ◽

R. Weekly ◽

A. Haridass

Keyword(s):

Glass Ceramic ◽

High Performance ◽

Low Latency ◽

High Bandwidth

Download Full-text

High performance and low power consumption resistive random access memory with Ag/Fe2O3/Pt structure

Nanotechnology ◽

10.1088/1361-6528/ac26fd ◽

2021 ◽

Author(s):

Yiru Niu ◽

Kang'an Jiang ◽

Xinyuan Dong ◽

Diyuan Zheng ◽

Binbin Liu ◽

...

Keyword(s):

Power Consumption ◽

Low Power ◽

High Performance ◽

Random Access ◽

Random Access Memory ◽

Resistive Random Access Memory ◽

Low Power Consumption ◽

Access Memory

Download Full-text

High-bandwidth, high-capacity, low-power memory systems

2014 International Conference on Embedded Computer Systems: Architectures, Modeling, and Simulation (SAMOS XIV) ◽

10.1109/samos.2014.6893186 ◽

2014 ◽

Author(s):

Bruce Jacob

Keyword(s):

Low Power ◽

High Capacity ◽

Memory Systems ◽

High Bandwidth

Download Full-text

Design and Implementation of 6-Stage 64-bit MIPS Pipelined Architecture

International Journal of Engineering and Advanced Technology - Regular Issue ◽

10.35940/ijeat.f1201.0886s219 ◽

2019 ◽

Vol 8 (6S2) ◽

pp. 790-796

Keyword(s):

Low Power ◽

High Speed ◽

High Performance ◽

Random Access ◽

Instruction Set ◽

Cache Memories ◽

Design And Implementation ◽

Pipelined Architecture ◽

Risc Processor ◽

High Speed Data

Pipelining is the concept of overlapping of multiple instructions to perform their operations to optimize the time and ability of hardware units. This paper presents the design and implementation of 6 stage pipelined architecture for High performance 64-bit Microprocessor without Interlocked Pipeline Stages (MIPS) based Reduced Instruction set computing (RISC) processor. In this work, combining efforts of pre-fetching unit, forwarding unit, Branch and Jump predicting unit, Hazard unit are used to reduce the hazards. Low power unit is used to minimize the power. Cache Memories, other devices and especially balancing pipeline stages optimize the Speed in this work. DDR4 SDRAM (Double Data Rate type4 Synchronous Dynamic Random Access Memory) controller is employed in this pipeline to achieve high-speed data transfers and to manage the entire system efficiently. Low power, Low delay Flip flops are used in pipeline registers that implicitly enhance the performance of the system. The proposed method provides better results compared to the existing models. The simulation and synthesis results of the proposed Architecture are evaluated by Xilinx 14.7 software and supporting graphs are plotted through MATLAB tool

Download Full-text

Programmable Chip Based High Performance MEC Router for Ultra-Low Latency and High Bandwidth Services in Distributed Computing Environment

IEICE Transactions on Information and Systems ◽

10.1587/transinf.2020pal0001 ◽

2020 ◽

Vol E103.D (12) ◽

pp. 2525-2527

Author(s):

SeokHwan KONG ◽

Saikia DIPJYOTI ◽

JaiYong LEE

Keyword(s):

Distributed Computing ◽

High Performance ◽

Low Latency ◽

Computing Environment ◽

High Bandwidth

Download Full-text

Methods to Reduce the Hierarchy of Interconnections in Electronic System

International Symposium on Microelectronics ◽

10.4071/2380-4505-2020.1.000156 ◽

2020 ◽

Vol 2020 (1) ◽

pp. 000156-000159

Author(s):

Dyi-Chung Hu ◽

James Ho

Keyword(s):

High Performance ◽

Electronic System ◽

Autonomous Driving ◽

Computing System ◽

Fine Pitch ◽

Size Limitation ◽

High Performance Computing System ◽

High Bandwidth ◽

The Cost ◽

Structure Enhancement

Abstract In the era of AI, 5G, big data, and autonomous driving, those applications all require a high bandwidth low latency data computing. Traditional electronic packaging structures are classified into many levels and each level are connected by solders or cables. These many levels of structures cause system performance degradation. Hence structure solutions of 2.5D, 2.1D, 2.3D and 2.0D with multi-chip packaging are needed for high performance computing system. Currently 2.5D is the HPC standard structure, however the cost and size limitation of 2.5D drive users to seek alternative solutions. The structure of 2.0D, 2.1D and 2.3D offer less solder and TXVs are emerging as contenders to fill the requirement of large substrate size and fine line requirements in the future. Among them 2.0D structure shows great potential. Three 2.0D test vehicles have been built to evaluate fine pitch assembly, reliability and structure enhancement. The results show 2.0D structure has great potential to be a HPC solution of the near future.

Download Full-text

3D-Wiz: A novel high bandwidth, optically interfaced 3D DRAM architecture with reduced random access time

2014 IEEE 32nd International Conference on Computer Design (ICCD) ◽

10.1109/iccd.2014.6974654 ◽

2014 ◽

Cited By ~ 9

Author(s):

Ishan G Thakkar ◽

Sudeep Pasricha

Keyword(s):

Random Access ◽

Access Time ◽

High Bandwidth

Download Full-text

Locality-Based Cache Management and Warp Scheduling for Reducing Cache Contention in GPU

Micromachines ◽

10.3390/mi12101262 ◽

2021 ◽

Vol 12 (10) ◽

pp. 1262

Author(s):

Juan Fang ◽

Zelin Wei ◽

Huijing Yang

Keyword(s):

High Performance ◽

Data Locality ◽

Streaming Data ◽

Utilization Rate ◽

Cache Management ◽

L1 Data ◽

Long Latency ◽

Multiple Threads ◽

Cache Contention ◽

Acceleration Component

GPGPUs has gradually become a mainstream acceleration component in high-performance computing. The long latency of memory operations is the bottleneck of GPU performance. In the GPU, multiple threads are divided into one warp for scheduling and execution. The L1 data caches have little capacity, while multiple warps share one small cache. That makes the cache suffer a large amount of cache contention and pipeline stall. We propose Locality-Based Cache Management (LCM), combined with the Locality-Based Warp Scheduling (LWS), to reduce cache contention and improve GPU performance. Each load instruction can be divided into three types according to locality: only used once as streaming data locality, accessed multiple times in the same warp as intra-warp locality, and accessed in different warps as inter-warp data locality. According to the locality of the load instruction, LWS applies cache bypass to the streaming locality request to improve the cache utilization rate, extend inter-warp memory request coalescing to make full use of the inter-warp locality, and combine with the LWS to alleviate cache contention. LCM and LWS can effectively improve cache performance, thereby improving overall GPU performance. Through experimental evaluation, our LCM and LWS can obtain an average performance improvement of 26% over baseline GPU.

Download Full-text

Polymorphic Memory: A Hybrid Approach for Utilizing On-Chip Memory in Manycore Systems

Electronics ◽

10.3390/electronics9122061 ◽

2020 ◽

Vol 9 (12) ◽

pp. 2061

Author(s):

Seung-Ho Lim ◽

Hyunchul Seok ◽

Ki-Woong Park

Keyword(s):

High Performance ◽

Hybrid Approach ◽

Random Access ◽

Previous Method ◽

Main Memory ◽

Memory Architecture ◽

Memory Hierarchies ◽

Dynamic Memory ◽

High Bandwidth ◽

On Chip

The key challenges of manycore systems are the large amount of memory and high bandwidth required to run many applications. Three-dimesnional integrated on-chip memory is a promising candidate for addressing these challenges. The advent of on-chip memory has provided new opportunities to rethink traditional memory hierarchies and their management. In this study, we propose a polymorphic memory as a hybrid approach when using on-chip memory. In contrast to previous studies, we use the on-chip memory as both a main memory (called M1 memory) and a Dynamic Random Access Memory (DRAM) cache (called M2 cache). The main memory consists of M1 memory and a conventional DRAM memory called M2 memory. To achieve high performance when running many applications on this memory architecture, we propose management techniques for the main memory with M1 and M2 memories and for polymorphic memory with dynamic memory allocations for many applications in a manycore system. The first technique is to move frequently accessed pages to M1 memory via hardware monitoring in a memory controller. The second is M1 memory partitioning to mitigate contention problems among many processes. Finally, we propose a method to use M2 cache between a conventional last-level cache and M2 memory, and we determine the best cache size for improving the performance with polymorphic memory. The proposed schemes are evaluated with the SPEC CPU2006 benchmark, and the experimental results show that the proposed approaches can improve the performance under various workloads of the benchmark. The performance evaluation confirms that the average performance improvement of polymorphic memory is 21.7%, with 0.026 standard deviation for the normalized results, compared to the previous method of using on-chip memory as a last-level cache.

Download Full-text