Memory Access Optimization of a Neural Network Accelerator Based on Memory Controller

Rongshan Wei; Chenjia Li; Chuandong Chen; Guangyu Sun; Minghua He

doi:10.3390/electronics10040438

Memory Access Optimization of a Neural Network Accelerator Based on Memory Controller

Electronics ◽

10.3390/electronics10040438 ◽

2021 ◽

Vol 10 (4) ◽

pp. 438

Author(s):

Rongshan Wei ◽

Chenjia Li ◽

Chuandong Chen ◽

Guangyu Sun ◽

Minghua He

Keyword(s):

Neural Network ◽

Computer Architecture ◽

Random Access ◽

Mapping Method ◽

Memory Access ◽

Hardware Accelerators ◽

Great Success ◽

Memory Controller ◽

Access Latency ◽

Address Mapping

Special accelerator architecture has achieved great success in processor architecture, and it is trending in computer architecture development. However, as the memory access pattern of an accelerator is relatively complicated, the memory access performance is relatively poor, limiting the overall performance improvement of hardware accelerators. Moreover, memory controllers for hardware accelerators have been scarcely researched. We consider that a special accelerator memory controller is essential for improving the memory access performance. To this end, we propose a dynamic random access memory (DRAM) memory controller called NNAMC for neural network accelerators, which monitors the memory access stream of an accelerator and transfers it to the optimal address mapping scheme bank based on the memory access characteristics. NNAMC includes a stream access prediction unit (SAPU) that analyzes the type of data stream accessed by the accelerator via hardware, and designs the address mapping for different banks using a bank partitioning model (BPM). The image mapping method and hardware architecture were analyzed in a practical neural network accelerator. In the experiment, NNAMC achieved significantly lower access latency of the hardware accelerator than the competing address mapping schemes, increased the row buffer hit ratio by 13.68% on average (up to 26.17%), reduced the system access latency by 26.3% on average (up to 37.68%), and lowered the hardware cost. In addition, we also confirmed that NNAMC efficiently adapted to different network parameters.

Design Tradeoff of Internal Memory Size and Memory Access Energy in Deep Neural Network Hardware Accelerators

2018 IEEE 7th Global Conference on Consumer Electronics (GCCE) ◽

10.1109/gcce.2018.8574742 ◽

2018 ◽

Author(s):

Shen-Fu Hsiao ◽

Pei-Hsuen Wu

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

Memory Access ◽

Hardware Accelerators ◽

Internal Memory ◽

Memory Size ◽

Neural Network Hardware

Reducing main memory access latency through SDRAM address mapping techniques and access reordering mechanisms

10.37099/mtu.dc.etds/72 ◽

2006 ◽

Author(s):

Jun Shao

Keyword(s):

Main Memory ◽

Memory Access ◽

Access Latency ◽

Address Mapping ◽

Mapping Techniques

Runtime Memory Controller Profiling with Performance Analysis for DRAM Memory Controllers

Journal of Circuits System and Computers ◽

10.1142/s0218126618501268 ◽

2018 ◽

Vol 27 (08) ◽

pp. 1850126

Author(s):

Dong-Ik Jeon ◽

Min-Kyu Lee ◽

Ji-Chan Kim ◽

Ki-Seok Chung

Keyword(s):

Performance Analysis ◽

Memory Performance ◽

Estimation Method ◽

Random Access ◽

Main Memory ◽

Memory Access ◽

Prototype System ◽

Memory Controller ◽

Processor Core ◽

Field Programmable

The main memory system has become crucial not only because it has to meet an increasing bandwidth requirement, but also because it has to seamlessly support many concurrently executing applications. In order to improve memory performance, a memory controller with efficient arbitration is necessary. It is well known that memory performance is dependent on the memory access patterns. The offline performance analysis has difficulty analyzing the Dynamic Random Access Memory (DRAM) performance accurately because a huge set of trace patterns is needed. This paper proposes a novel profiler that is synthesized with a memory controller in order to monitor and analyze the memory controller performance at runtime. In this paper, five key metrics for performance evaluation are defined and they are monitored and evaluated at runtime by the proposed profiler. A prototype system with a processor core, a memory controller, DRAM modules, and peripheral devices are implemented on a field-programmable gate array (FPGA) board to carry out the experiments. It has been observed that the worst latency overhead differs for each benchmark. In addition, a new overall overhead estimation method is proposed to estimate the memory access latency overhead in time, and this method can be used to evaluate the performance of a certain memory arbitration method depending on running applications.

Reducing memory access latency with asymmetric DRAM bank organizations

ACM SIGARCH Computer Architecture News ◽

10.1145/2508148.2485955 ◽

2013 ◽

Vol 41 (3) ◽

pp. 380-391 ◽

Cited By ~ 7

Author(s):

Young Hoon Son ◽

O. Seongil ◽

Yuhwan Ro ◽

Jae W. Lee ◽

Jung Ho Ahn

Keyword(s):

Memory Access ◽

Access Latency

ROMANet: Fine-Grained Reuse-Driven Off-Chip Memory Access Management and Data Organization for Deep Neural Network Accelerators

IEEE Transactions on Very Large Scale Integration (VLSI) Systems ◽

10.1109/tvlsi.2021.3060509 ◽

2021 ◽

pp. 1-14

Author(s):

Rachmad Vidya Wicaksana Putra ◽

Muhammad Abdullah Hanif ◽

Muhammad Shafique

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

Memory Access ◽

Data Organization ◽

Access Management ◽

Fine Grained

Pre-Emphasis Pulse Design for Random-Access Memory

Electronics ◽

10.3390/electronics10121454 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1454

Author(s):

Yoshihiro Sugiura ◽

Toru Tanzawa

Keyword(s):

Time Constant ◽

Random Access ◽

Memory Cell ◽

Random Access Memory ◽

Memory Access ◽

Access Time ◽

Access Memory ◽

Delay Times ◽

Cell Current ◽

The Impact

This paper describes how one can reduce the memory access time with pre-emphasis (PE) pulses even in non-volatile random-access memory. Optimum PE pulse widths and resultant minimum word-line (WL) delay times are investigated as a function of column address. The impact of the process variation in the time constant of WL, the cell current, and the resistance of deciding path on optimum PE pulses are discussed. Optimum PE pulse widths and resultant minimum WL delay times are modeled with fitting curves as a function of column address of the accessed memory cell, which provides designers with the ability to set the optimum timing for WL and BL (bit-line) operations, reducing average memory access time.

On the Correct Measurement of Application Memory Bandwidth and Memory Access Latency

Proceedings of the International Conference on High Performance Computing in Asia-Pacific Region ◽

10.1145/3368474.3368476 ◽

2020 ◽

Cited By ~ 1

Author(s):

Christian Helm ◽

Kenjiro Taura

Keyword(s):

Memory Access ◽

Memory Bandwidth ◽

Access Latency

Diabetic Retinal Grading Using Attention-Based Bilinear Convolutional Neural Network and Complement Cross Entropy

Entropy ◽

10.3390/e23070816 ◽

2021 ◽

Vol 23 (7) ◽

pp. 816

Author(s):

Pingping Liu ◽

Xiaokang Yang ◽

Baixin Jin ◽

Qiuzhan Zhou

Keyword(s):

Neural Network ◽

Image Processing ◽

Diabetic Retinopathy ◽

Convolutional Neural Network ◽

Image Classification ◽

Network Model ◽

Rapid Development ◽

Image Data ◽

Lesion Detection ◽

Great Success

Diabetic retinopathy (DR) is a common complication of diabetes mellitus (DM), and it is necessary to diagnose DR in the early stages of treatment. With the rapid development of convolutional neural networks in the field of image processing, deep learning methods have achieved great success in the field of medical image processing. Various medical lesion detection systems have been proposed to detect fundus lesions. At present, in the image classification process of diabetic retinopathy, the fine-grained properties of the diseased image are ignored and most of the retinopathy image data sets have serious uneven distribution problems, which limits the ability of the network to predict the classification of lesions to a large extent. We propose a new non-homologous bilinear pooling convolutional neural network model and combine it with the attention mechanism to further improve the network’s ability to extract specific features of the image. The experimental results show that, compared with the most popular fundus image classification models, the network model we proposed can greatly improve the prediction accuracy of the network while maintaining computational efficiency.

Noncontact thermal mapping method based on local temperature data using deep neural network regression

International Journal of Heat and Mass Transfer ◽

10.1016/j.ijheatmasstransfer.2021.122236 ◽

2021 ◽

pp. 122236

Author(s):

Sanghun Shin ◽

Byeongjo Ko ◽

Hongyun So

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

Mapping Method ◽

Temperature Data ◽

Local Temperature ◽

Thermal Mapping

Static Analysis of Run-Time Inter-Core Interferences for Concurrent Programs in Shared Cache Multicore Architectures

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.198-199.523 ◽

2012 ◽

Vol 198-199 ◽

pp. 523-527

Author(s):

Fang Yuan Chen ◽

Dong Song Zhang ◽

Zhi Ying Wang

Keyword(s):

Multicore Processors ◽

Mapping Method ◽

Concurrent Programs ◽

Multicore Architectures ◽

Shared Resources ◽

Worst Case ◽

Shared Cache ◽

Address Mapping ◽

Novel Approach ◽

Time Systems

Worst-Case Execution Time (WCET) is crucial in real-time systems and is very challenging in multicore processors due to the possible runtime inter-thread interferences caused by shared resources. This paper proposes a novel approach to analyze runtime inter-core interferences for consecutive or inconsecutive concurrent programs. Our approach can reasonably estimate runtime inter-core interferences in shared cache by introducing lifetime and instruction fetching timing relations analysis into address mapping method. Compared with the method based on lifetime alone, our proposed approach efficiently improves the tightness of WCET estimation.