scholarly journals In-DRAM Cache Management for Low Latency and Low Power 3D-Stacked DRAMs

Micromachines ◽  
2019 ◽  
Vol 10 (2) ◽  
pp. 124 ◽  
Author(s):  
Ho Shin ◽  
Eui-Young Chung

Recently, 3D-stacked dynamic random access memory (DRAM) has become a promising solution for ultra-high capacity and high-bandwidth memory implementations. However, it also suffers from memory wall problems due to long latency, such as with typical 2D-DRAMs. Although there are various cache management techniques and latency hiding schemes to reduce DRAM access time, in a high-performance system using high-capacity 3D-stacked DRAM, it is ultimately essential to reduce the latency of the DRAM itself. To solve this problem, various asymmetric in-DRAM cache structures have recently been proposed, which are more attractive for high-capacity DRAMs because they can be implemented at a lower cost in 3D-stacked DRAMs. However, most research mainly focuses on the architecture of the in-DRAM cache itself and does not pay much attention to proper management methods. In this paper, we propose two new management algorithms for the in-DRAM caches to achieve a low-latency and low-power 3D-stacked DRAM device. Through the computing system simulation, we demonstrate the improvement of energy delay product up to 67%.

Pipelining is the concept of overlapping of multiple instructions to perform their operations to optimize the time and ability of hardware units. This paper presents the design and implementation of 6 stage pipelined architecture for High performance 64-bit Microprocessor without Interlocked Pipeline Stages (MIPS) based Reduced Instruction set computing (RISC) processor. In this work, combining efforts of pre-fetching unit, forwarding unit, Branch and Jump predicting unit, Hazard unit are used to reduce the hazards. Low power unit is used to minimize the power. Cache Memories, other devices and especially balancing pipeline stages optimize the Speed in this work. DDR4 SDRAM (Double Data Rate type4 Synchronous Dynamic Random Access Memory) controller is employed in this pipeline to achieve high-speed data transfers and to manage the entire system efficiently. Low power, Low delay Flip flops are used in pipeline registers that implicitly enhance the performance of the system. The proposed method provides better results compared to the existing models. The simulation and synthesis results of the proposed Architecture are evaluated by Xilinx 14.7 software and supporting graphs are plotted through MATLAB tool


2020 ◽  
Vol 2020 (1) ◽  
pp. 000156-000159
Author(s):  
Dyi-Chung Hu ◽  
James Ho

Abstract In the era of AI, 5G, big data, and autonomous driving, those applications all require a high bandwidth low latency data computing. Traditional electronic packaging structures are classified into many levels and each level are connected by solders or cables. These many levels of structures cause system performance degradation. Hence structure solutions of 2.5D, 2.1D, 2.3D and 2.0D with multi-chip packaging are needed for high performance computing system. Currently 2.5D is the HPC standard structure, however the cost and size limitation of 2.5D drive users to seek alternative solutions. The structure of 2.0D, 2.1D and 2.3D offer less solder and TXVs are emerging as contenders to fill the requirement of large substrate size and fine line requirements in the future. Among them 2.0D structure shows great potential. Three 2.0D test vehicles have been built to evaluate fine pitch assembly, reliability and structure enhancement. The results show 2.0D structure has great potential to be a HPC solution of the near future.


Micromachines ◽  
2021 ◽  
Vol 12 (10) ◽  
pp. 1262
Author(s):  
Juan Fang ◽  
Zelin Wei ◽  
Huijing Yang

GPGPUs has gradually become a mainstream acceleration component in high-performance computing. The long latency of memory operations is the bottleneck of GPU performance. In the GPU, multiple threads are divided into one warp for scheduling and execution. The L1 data caches have little capacity, while multiple warps share one small cache. That makes the cache suffer a large amount of cache contention and pipeline stall. We propose Locality-Based Cache Management (LCM), combined with the Locality-Based Warp Scheduling (LWS), to reduce cache contention and improve GPU performance. Each load instruction can be divided into three types according to locality: only used once as streaming data locality, accessed multiple times in the same warp as intra-warp locality, and accessed in different warps as inter-warp data locality. According to the locality of the load instruction, LWS applies cache bypass to the streaming locality request to improve the cache utilization rate, extend inter-warp memory request coalescing to make full use of the inter-warp locality, and combine with the LWS to alleviate cache contention. LCM and LWS can effectively improve cache performance, thereby improving overall GPU performance. Through experimental evaluation, our LCM and LWS can obtain an average performance improvement of 26% over baseline GPU.


Electronics ◽  
2020 ◽  
Vol 9 (12) ◽  
pp. 2061
Author(s):  
Seung-Ho Lim ◽  
Hyunchul Seok ◽  
Ki-Woong Park

The key challenges of manycore systems are the large amount of memory and high bandwidth required to run many applications. Three-dimesnional integrated on-chip memory is a promising candidate for addressing these challenges. The advent of on-chip memory has provided new opportunities to rethink traditional memory hierarchies and their management. In this study, we propose a polymorphic memory as a hybrid approach when using on-chip memory. In contrast to previous studies, we use the on-chip memory as both a main memory (called M1 memory) and a Dynamic Random Access Memory (DRAM) cache (called M2 cache). The main memory consists of M1 memory and a conventional DRAM memory called M2 memory. To achieve high performance when running many applications on this memory architecture, we propose management techniques for the main memory with M1 and M2 memories and for polymorphic memory with dynamic memory allocations for many applications in a manycore system. The first technique is to move frequently accessed pages to M1 memory via hardware monitoring in a memory controller. The second is M1 memory partitioning to mitigate contention problems among many processes. Finally, we propose a method to use M2 cache between a conventional last-level cache and M2 memory, and we determine the best cache size for improving the performance with polymorphic memory. The proposed schemes are evaluated with the SPEC CPU2006 benchmark, and the experimental results show that the proposed approaches can improve the performance under various workloads of the benchmark. The performance evaluation confirms that the average performance improvement of polymorphic memory is 21.7%, with 0.026 standard deviation for the normalized results, compared to the previous method of using on-chip memory as a last-level cache.


Sign in / Sign up

Export Citation Format

Share Document