Runtime Memory Controller Profiling with Performance Analysis for DRAM Memory Controllers

2018 ◽  
Vol 27 (08) ◽  
pp. 1850126
Author(s):  
Dong-Ik Jeon ◽  
Min-Kyu Lee ◽  
Ji-Chan Kim ◽  
Ki-Seok Chung

The main memory system has become crucial not only because it has to meet an increasing bandwidth requirement, but also because it has to seamlessly support many concurrently executing applications. In order to improve memory performance, a memory controller with efficient arbitration is necessary. It is well known that memory performance is dependent on the memory access patterns. The offline performance analysis has difficulty analyzing the Dynamic Random Access Memory (DRAM) performance accurately because a huge set of trace patterns is needed. This paper proposes a novel profiler that is synthesized with a memory controller in order to monitor and analyze the memory controller performance at runtime. In this paper, five key metrics for performance evaluation are defined and they are monitored and evaluated at runtime by the proposed profiler. A prototype system with a processor core, a memory controller, DRAM modules, and peripheral devices are implemented on a field-programmable gate array (FPGA) board to carry out the experiments. It has been observed that the worst latency overhead differs for each benchmark. In addition, a new overall overhead estimation method is proposed to estimate the memory access latency overhead in time, and this method can be used to evaluate the performance of a certain memory arbitration method depending on running applications.

Electronics ◽  
2021 ◽  
Vol 10 (4) ◽  
pp. 438
Author(s):  
Rongshan Wei ◽  
Chenjia Li ◽  
Chuandong Chen ◽  
Guangyu Sun ◽  
Minghua He

Special accelerator architecture has achieved great success in processor architecture, and it is trending in computer architecture development. However, as the memory access pattern of an accelerator is relatively complicated, the memory access performance is relatively poor, limiting the overall performance improvement of hardware accelerators. Moreover, memory controllers for hardware accelerators have been scarcely researched. We consider that a special accelerator memory controller is essential for improving the memory access performance. To this end, we propose a dynamic random access memory (DRAM) memory controller called NNAMC for neural network accelerators, which monitors the memory access stream of an accelerator and transfers it to the optimal address mapping scheme bank based on the memory access characteristics. NNAMC includes a stream access prediction unit (SAPU) that analyzes the type of data stream accessed by the accelerator via hardware, and designs the address mapping for different banks using a bank partitioning model (BPM). The image mapping method and hardware architecture were analyzed in a practical neural network accelerator. In the experiment, NNAMC achieved significantly lower access latency of the hardware accelerator than the competing address mapping schemes, increased the row buffer hit ratio by 13.68% on average (up to 26.17%), reduced the system access latency by 26.3% on average (up to 37.68%), and lowered the hardware cost. In addition, we also confirmed that NNAMC efficiently adapted to different network parameters.


Author(s):  
Wesley Petersen ◽  
Peter Arbenz

Since first proposed by Gordon Moore (an Intel founder) in 1965, his law [107] that the number of transistors on microprocessors doubles roughly every one to two years has proven remarkably astute. Its corollary, that central processing unit (CPU) performance would also double every two years or so has also remained prescient. Figure 1.1 shows Intel microprocessor data on the number of transistors beginning with the 4004 in 1972. Figure 1.2 indicates that when one includes multi-processor machines and algorithmic development, computer performance is actually better than Moore’s 2-year performance doubling time estimate. Alas, however, in recent years there has developed a disagreeable mismatch between CPU and memory performance: CPUs now outperform memory systems by orders of magnitude according to some reckoning [71]. This is not completely accurate, of course: it is mostly a matter of cost. In the 1980s and 1990s, Cray Research Y-MP series machines had well balanced CPU to memory performance. Likewise, NEC (Nippon Electric Corp.), using CMOS (see glossary, Appendix F) and direct memory access, has well balanced CPU/Memory performance. ECL (see glossary, Appendix F) and CMOS static random access memory (SRAM) systems were and remain expensive and like their CPU counterparts have to be carefully kept cool. Worse, because they have to be cooled, close packing is difficult and such systems tend to have small storage per volume. Almost any personal computer (PC) these days has a much larger memory than supercomputer memory systems of the 1980s or early 1990s. In consequence, nearly all memory systems these days are hierarchical, frequently with multiple levels of cache. Figure 1.3 shows the diverging trends between CPUs and memory performance. Dynamic random access memory (DRAM) in some variety has become standard for bulk memory. There are many projects and ideas about how to close this performance gap, for example, the IRAM [78] and RDRAM projects [85]. We are confident that this disparity between CPU and memory access performance will eventually be tightened, but in the meantime, we must deal with the world as it is.


Author(s):  
Sowmya K B ◽  
Gagana P

<span>Memory performance has become the major bottleneck to improve the overall performance of the computer system. By using memory controller, there is effective control of data between processor and memory. In this paper, a memory controller for interfacing Synchronous Static Random Access Memory (SSRAM), Synchronous Dynamic Random Access Memory (SDRAM), Read Only Memory (ROM) and FLASH which is Electrically Erasable Programmable Read-Only Memory is designed and a coverage driven Constraint random verification environment is built for the designed memory controller. Verification plays an important role in any design flow as it is done before silicon development. It is done at time of product development for quality checking and bug fixing in design.</span>


Electronics ◽  
2021 ◽  
Vol 10 (12) ◽  
pp. 1454
Author(s):  
Yoshihiro Sugiura ◽  
Toru Tanzawa

This paper describes how one can reduce the memory access time with pre-emphasis (PE) pulses even in non-volatile random-access memory. Optimum PE pulse widths and resultant minimum word-line (WL) delay times are investigated as a function of column address. The impact of the process variation in the time constant of WL, the cell current, and the resistance of deciding path on optimum PE pulses are discussed. Optimum PE pulse widths and resultant minimum WL delay times are modeled with fitting curves as a function of column address of the accessed memory cell, which provides designers with the ability to set the optimum timing for WL and BL (bit-line) operations, reducing average memory access time.


2020 ◽  
Vol 10 (3) ◽  
pp. 999
Author(s):  
Hyokyung Bahn ◽  
Kyungwoon Cho

Recently, non-volatile memory (NVM) has advanced as a fast storage medium, and legacy memory subsystems optimized for DRAM (dynamic random access memory) and HDD (hard disk drive) hierarchies need to be revisited. In this article, we explore the memory subsystems that use NVM as an underlying storage device and discuss the challenges and implications of such systems. As storage performance becomes close to DRAM performance, existing memory configurations and I/O (input/output) mechanisms should be reassessed. This article explores the performance of systems with NVM based storage emulated by the RAMDisk under various configurations. Through our measurement study, we make the following findings. (1) We can decrease the main memory size without performance penalties when NVM storage is adopted instead of HDD. (2) For buffer caching to be effective, judicious management techniques like admission control are necessary. (3) Prefetching is not effective in NVM storage. (4) The effect of synchronous I/O and direct I/O in NVM storage is less significant than that in HDD storage. (5) Performance degradation due to the contention of multi-threads is less severe in NVM based storage than in HDD. Based on these observations, we discuss a new PC configuration consisting of small memory and fast storage in comparison with a traditional PC consisting of large memory and slow storage. We show that this new memory-storage configuration can be an alternative solution for ever-growing memory demands and the limited density of DRAM memory. We anticipate that our results will provide directions in system software development in the presence of ever-faster storage devices.


Electronics ◽  
2021 ◽  
Vol 10 (4) ◽  
pp. 469
Author(s):  
Hyun Woo Oh ◽  
Ji Kwang Kim ◽  
Gwan Beom Hwang ◽  
Seung Eun Lee

Recently, advances in technology have enabled embedded systems to be adopted for a variety of applications. Some of these applications require real-time 2D graphics processing running on limited design specifications such as low power consumption and a small area. In order to satisfy such conditions, including a specific 2D graphics accelerator in the embedded system is an effective method. This method reduces the workload of the processor in the embedded system by exploiting the accelerator. The accelerator assists the system to perform 2D graphics processing in real-time. Therefore, a variety of applications that require 2D graphics processing can be implemented with an embedded processor. In this paper, we present a 2D graphics accelerator for tiny embedded systems. The accelerator includes an optimized line-drawing operation based on Bresenham’s algorithm. The optimized operation enables the accelerator to deal with various kinds of 2D graphics processing and to perform the line-drawing instead of the system processor. Moreover, the accelerator also distributes the workload of the processor core by removing the need for the core to access the frame buffer memory. We measure the performance of the accelerator by implementing the processor, including the accelerator, on a field-programmable gate array (FPGA), and ascertaining the possibility of realization by synthesizing using the 180 nm CMOS process.


2021 ◽  
Vol 21 (8) ◽  
pp. 4216-4222
Author(s):  
Songyi Yoo ◽  
In-Man Kang ◽  
Sung-Jae Cho ◽  
Wookyung Sun ◽  
Hyungsoon Shin

A capacitorless one-transistor dynamic random-access memory cell with a polysilicon body (poly-Si 1T-DRAM) has a cost-effective fabrication process and allows a three-dimensional stacked architecture that increases the integration density of memory cells. Also, since this device uses grain boundaries (GBs) as a storage region, it can be operated as a memory cell even in a thin body device. GBs are important to the memory characteristics of poly-Si 1T-DRAM because the amount of trapped charge in the GBs determines the memory’s data state. In this paper, we report on a statistical analysis of the memory characteristics of poly-Si 1T-DRAM cells according to the number and location of GBs using TCAD simulation. As the number of GBs increases, the sensing margin and retention time of memory cells deteriorate due to increasing trapped electron charge. Also, “0” state current increases and memory performance degrades in cells where all GBs are adjacent to the source or drain junction side in a strong electric field. These results mean that in poly-Si 1T-DRAM design, the number and location of GBs in a channel should be considered for optimal memory performance.


Sign in / Sign up

Export Citation Format

Share Document