memory bottleneck Latest Research Papers

Resistive Random Access Memory (ReRAM) is an emerging non-volatile memory technology. Besides its low power consumption and its high scalability, its inherent computation capabilities make ReRAM especially interesting for future computer architectures. Merging computations into the memory is a promising solution for overcoming the memory bottleneck. To perform computations in ReRAM, efficient synthesis strategies for Boolean functions have to be developed. In this article, we give a thorough presentation of how to employ parallel computing capabilities of ReRAM for the synthesis of functions given state-of-the-art graph-based representations AIGs or BDDs. Additionally, we introduce a new graph-based representation called m-And-Inverter Graph (m-AIGs), which allows us to fully exploit the computing capabilities of ReRAM. In the simulations, we show that our proposed approaches outperform state-of-the art synthesis strategies, and we show the superiority of m-AIGs over the standard AIG representation for ReRAM-based synthesis.

Download Full-text

Polyhedral-Based Compilation Framework for In-Memory Neural Network Accelerators

ACM Journal on Emerging Technologies in Computing Systems ◽

10.1145/3469847 ◽

2022 ◽

Vol 18 (1) ◽

pp. 1-23

Author(s):

Jianhui Han ◽

Xiang Fei ◽

Zhaolin Li ◽

Youhui Zhang

Keyword(s):

Neural Network ◽

Case Studies ◽

Memory Architecture ◽

Polyhedral Model ◽

Order Of Magnitude ◽

Promising Solution ◽

Memory Bottleneck ◽

High Level ◽

Programming Interfaces ◽

Compilation Framework

Memristor-based processing-in-memory architecture is a promising solution to the memory bottleneck in the neural network ( NN ) processing. A major challenge for the programmability of such architectures is the automatic compilation of high-level NN workloads, from various operators to the memristor-based hardware that may provide programming interfaces with different granularities. This article proposes a source-to-source compilation framework for such memristor-based NN accelerators, which can conduct automatic detection and mapping of multiple NN operators based on the flexible and rich representation capability of the polyhedral model. In contrast to previous studies, it implements support for pipeline generation to exploit the parallelism in the NN loads to leverage hardware resources for higher efficiency. The evaluation based on synthetic kernels and NN benchmarks demonstrates that the proposed framework can reliably detect and map the target operators. Case studies on typical memristor-based architectures also show its generality over various architectural designs. The evaluation further demonstrates that compared with existing polyhedral-based compilation frameworks that do not support the pipelined execution, the performance can upgrade by an order of magnitude with the pipelined execution, which emphasizes the necessity of our improvement.

Download Full-text

Crossbar-Compatible Stateful Logic Using Phase Change Memory

10.21203/rs.3.rs-1061047/v1 ◽

2021 ◽

Author(s):

Barak Hoffer ◽

Nicolás Wainstein ◽

Christopher M. Neumann ◽

Eric Pop ◽

Eilam Yalon ◽

...

Keyword(s):

Phase Change ◽

Logic Gate ◽

Phase Change Memory ◽

Logic Gates ◽

Digital Processing ◽

Memory Systems ◽

Logic Function ◽

Von Neumann ◽

Memory Bottleneck ◽

Change Memory

Abstract Stateful logic is a digital processing-in-memory technique that could address von Neumann memory bottleneck challenges while maintaining backward compatibility with von Neumann architectures. In stateful logic, memory cells are used to perform the logic operations without reading or moving any data outside the memory array. This has been previously demonstrated using several resistive memory types, but not with commercially available phase-change memory (PCM). Here we present the first implementation of stateful logic using PCM. We experimentally demonstrate four logic gate types (NOR, IMPLY, OR, NIMP) using commonly used PCM materials and crossbar-compatible structures. Our stateful logic gates form a functionally complete set, which enables sequential execution of any logic function within the memory and paves the way to PCM-based digital processing-in-memory systems.

Download Full-text

Survey on Near-Data Processing: Applications and Architectures

Journal of Integrated Circuits and Systems ◽

10.29292/jics.v16i2.502 ◽

2021 ◽

Vol 16 (2) ◽

pp. 1-17

Author(s):

Paulo Cesar Santos ◽

Francis Birck Moreira ◽

Aline Santana Cordeiro ◽

Sairo Raoní Santos ◽

Tiago Rodrigo Kepe ◽

...

Keyword(s):

Energy Consumption ◽

Data Processing ◽

Data Transfer ◽

High Energy ◽

Data Movement ◽

History Of ◽

Memory Bottleneck ◽

High Energy Consumption

One of the main challenges for modern processors is the data transfer between processor and memory. Such data movement implies high latency and high energy consumption. In this context, Near-Data Processing (NDP) proposals have started to gain acceptance as an accelerator device. Such proposals alleviate the memory bottleneck by moving instructions to data whereabouts. The first proposals date back to the 1990s, but it was only in the 2010s that we could observe an increase in papers addressing NDP. It occurred together with the appearance of 3D-stacked chips with logic and memory stacked layers. This survey presents a brief history of these accelerators, focusing on the applications domains migrated to near-data and the proposed architectures. We also introduce a new taxonomy to classify such architectural proposals according to their data distance.

Download Full-text

Convolution Kernel Operations on a Two-Dimensional Spin Memristor Cross Array

Sensors ◽

10.3390/s20216229 ◽

2020 ◽

Vol 20 (21) ◽

pp. 6229

Author(s):

Saike Zhu ◽

Lidan Wang ◽

Zhekang Dong ◽

Shukai Duan

Keyword(s):

Color Image ◽

Signal To Noise Ratio ◽

Structural Similarity ◽

Computing System ◽

Edge Extraction ◽

Data Intensive ◽

New Type ◽

Processing Effects ◽

Memory Bottleneck ◽

Time And Energy

In recent years, convolution operations often consume a lot of time and energy in deep learning algorithms, and convolution is usually used to remove noise or extract the edges of an image. However, under data-intensive conditions, frequent operations of the above algorithms will cause a significant memory/communication burden to the computing system. This paper proposes a circuit based on spin memristor cross array to solve the problems mentioned above. First, a logic switch based on spin memristors is proposed, which realizes the control of the memristor cross array. Secondly, a new type of spin memristor cross array and peripheral circuits is proposed, which realizes the multiplication and addition operation in the convolution operation and significantly alleviates the computational memory bottleneck. At last, the color image filtering and edge extraction simulation are carried out. By calculating the peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) of the image result, the processing effects of different operators are compared, and the correctness of the circuit is verified.

Download Full-text

UL-CNN: An Ultra-Lightweight Convolutional Neural Network Aiming at Flash-Based Computing-In-Memory Architecture for Pedestrian Recognition

Journal of Circuits System and Computers ◽

10.1142/s0218126621500225 ◽

2020 ◽

pp. 2150022

Author(s):

Chen Yang ◽

Jingyu Zhang ◽

Qi Chen ◽

Yi Xu ◽

Cimang Lu

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Design Methodology ◽

Hardware Implementation ◽

State Of The Art ◽

Memory Architecture ◽

Storage Overhead ◽

Speed Up ◽

Memory Bottleneck ◽

On Chip

Pedestrian recognition has achieved the state-of-the-art performance due to the progress of recent convolutional neural network (CNN). However, mainstream CNN models are too complicated to emerging Computing-In-Memory (CIM) architectures for hardware implementation, because enormous parameters and massive intermediate processing results may incur severe “memory bottleneck”. This paper proposed a design methodology of Parameter Substitution with Nodes Compensation (PSNC) to significantly reduce parameters of CNN model without inference accuracy degradation. Based on the PSNC methodology, an ultra-lightweight convolutional neural network (UL-CNN) was designed. The UL-CNN model is a specially optimized convolutional neural network aiming at a flash-based CIM architecture (Conv-Flash) and to apply for recognizing person. The implementation result of running UL-CNN on Conv-Flash shows that the inference accuracy is up to 94.7%. Compared to LeNet-5, on the premise of the similar operations and accuracy, the amounts of UL-CNN’s parameters are less than 37% of LeNet-5 at the same dataset benchmark. Such parameter reduction can dramatically speed up the training process and economize on-chip storage overhead, as well as save the power consumption of the memory access. With the aid of UL-CNN, the Conv-Flash architecture can provide the best energy efficiency compared to other platforms (CPU, GPU, FPGA, etc.), which consumes only 2.2[Formula: see text] 105J to complete pedestrian recognition for one frame.

Download Full-text

Efficient Acceleration of Stencil Applications through In-Memory Computing

Micromachines ◽

10.3390/mi11060622 ◽

2020 ◽

Vol 11 (6) ◽

pp. 622

Author(s):

Hasan Erdem Yantır ◽

Ahmed M. Eltawil ◽

Khaled N. Salama

Keyword(s):

Artificial Intelligence ◽

Computer Architecture ◽

Computer Architectures ◽

Huge Amount ◽

Promising Candidate ◽

Processing Elements ◽

Iterative Processing ◽

Processor Architectures ◽

Stencil Codes ◽

Memory Bottleneck

The traditional computer architectures severely suffer from the bottleneck between the processing elements and memory that is the biggest barrier in front of their scalability. Nevertheless, the amount of data that applications need to process is increasing rapidly, especially after the era of big data and artificial intelligence. This fact forces new constraints in computer architecture design towards more data-centric principles. Therefore, new paradigms such as in-memory and near-memory processors have begun to emerge to counteract the memory bottleneck by bringing memory closer to computation or integrating them. Associative processors are a promising candidate for in-memory computation, which combines the processor and memory in the same location to alleviate the memory bottleneck. One of the applications that need iterative processing of a huge amount of data is stencil codes. Considering this feature, associative processors can provide a paramount advantage for stencil codes. For demonstration, two in-memory associative processor architectures for 2D stencil codes are proposed, implemented by both emerging memristor and traditional SRAM technologies. The proposed architecture achieves a promising efficiency for a variety of stencil applications and thus proves its applicability for scientific stencil computing.

Download Full-text

A Novel Memory-Scheduling Strategy for Large Convolutional Neural Network on Memory-Limited Devices

Computational Intelligence and Neuroscience ◽

10.1155/2019/4328653 ◽

2019 ◽

Vol 2019 ◽

pp. 1-12 ◽

Cited By ~ 1

Author(s):

Shijie Li ◽

Xiaolong Shen ◽

Yong Dou ◽

Shice Ni ◽

Jinwei Xu ◽

...

Keyword(s):

Neural Network ◽

Deep Learning ◽

Convolutional Neural Network ◽

Mobile Devices ◽

Language Processing ◽

Memory Management ◽

Learning Methods ◽

Large Size ◽

Memory Bottleneck ◽

Memory Resources

Recently, machine learning, especially deep learning, has been a core algorithm to be widely used in many fields such as natural language processing, speech recognition, object recognition, and so on. At the same time, another trend is that more and more applications are moved to wearable and mobile devices. However, traditional deep learning methods such as convolutional neural network (CNN) and its variants consume a lot of memory resources. In this case, these powerful deep learning methods are difficult to apply on mobile memory-limited platforms. In order to solve this problem, we present a novel memory-management strategy called mmCNN in this paper. With the help of this method, we can easily deploy a trained large-size CNN on any memory size platform such as GPU, FPGA, or memory-limited mobile devices. In our experiments, we run a feed-forward CNN process in some extremely small memory sizes (as low as 5 MB) on a GPU platform. The result shows that our method saves more than 98% memory compared to a traditional CNN algorithm and further saves more than 90% compared to the state-of-the-art related work “vDNNs” (virtualized deep neural networks). Our work in this paper improves the computing scalability of lightweight applications and breaks the memory bottleneck of using deep learning method on memory-limited devices.

Download Full-text

Design of GPU Based Non-coherent Signal Tracking Module for Real-time GNSS SDR

E3S Web of Conferences ◽

10.1051/e3sconf/20199403013 ◽

2019 ◽

Vol 94 ◽

pp. 03013

Author(s):

Jong-Il Park ◽

Kwi Woo Park ◽

Chansik Park

Keyword(s):

Real Time ◽

Coherent Signal ◽

Signal Tracking ◽

The Real ◽

Notebook Computer ◽

Computational Performance ◽

Memory Bottleneck ◽

Tracking Module

In this paper, we design and implement GPU-based non-coherent GPS signal tracking module for real-time SDR. When using CPU and GPU simultaneously, the signal tracking module should be designed considering the memory bottleneck between the two processors. The basic non-coherent module, which accumulates the 1msec correlation value 20 times, is changed to accumulate M(20/N) times of Nmsec units. From the experiments using real GPS signals, the computational performance of N=20 is improved by 80% compared to N=1. Therefore, the implemented SDR using notebook computer can tracks more channels simultaneously in the real time.

Download Full-text

Overcoming the Memory Bottleneck in Auxiliary Field Quantum Monte Carlo Simulations with Interpolative Separable Density Fitting

Journal of Chemical Theory and Computation ◽

10.1021/acs.jctc.8b00944 ◽

2018 ◽

Vol 15 (1) ◽

pp. 256-264 ◽

Cited By ~ 14

Author(s):

Fionn D. Malone ◽

Shuai Zhang ◽

Miguel A. Morales

Keyword(s):

Monte Carlo ◽

Monte Carlo Simulations ◽

Quantum Monte Carlo ◽

Auxiliary Field ◽

Density Fitting ◽

Memory Bottleneck ◽

Quantum Monte Carlo Simulations

Download Full-text

memory bottleneck
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Parallel Computing of Graph-based Functions in ReRAM

Polyhedral-Based Compilation Framework for In-Memory Neural Network Accelerators

Crossbar-Compatible Stateful Logic Using Phase Change Memory

Survey on Near-Data Processing: Applications and Architectures

Convolution Kernel Operations on a Two-Dimensional Spin Memristor Cross Array

UL-CNN: An Ultra-Lightweight Convolutional Neural Network Aiming at Flash-Based Computing-In-Memory Architecture for Pedestrian Recognition

Efficient Acceleration of Stencil Applications through In-Memory Computing

A Novel Memory-Scheduling Strategy for Large Convolutional Neural Network on Memory-Limited Devices

Design of GPU Based Non-coherent Signal Tracking Module for Real-time GNSS SDR

Overcoming the Memory Bottleneck in Auxiliary Field Quantum Monte Carlo Simulations with Interpolative Separable Density Fitting

Export Citation Format

memory bottleneckRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Parallel Computing of Graph-based Functions in ReRAM

Polyhedral-Based Compilation Framework for In-Memory Neural Network Accelerators

Crossbar-Compatible Stateful Logic Using Phase Change Memory

Survey on Near-Data Processing: Applications and Architectures

Convolution Kernel Operations on a Two-Dimensional Spin Memristor Cross Array

UL-CNN: An Ultra-Lightweight Convolutional Neural Network Aiming at Flash-Based Computing-In-Memory Architecture for Pedestrian Recognition

Efficient Acceleration of Stencil Applications through In-Memory Computing

A Novel Memory-Scheduling Strategy for Large Convolutional Neural Network on Memory-Limited Devices

Design of GPU Based Non-coherent Signal Tracking Module for Real-time GNSS SDR

Overcoming the Memory Bottleneck in Auxiliary Field Quantum Monte Carlo Simulations with Interpolative Separable Density Fitting

memory bottleneck
Recently Published Documents