High-throughput Near-Memory Processing on CNNs with 3D HBM-like Memory

2021 ◽  
Vol 26 (6) ◽  
pp. 1-20
Author(s):  
Naebeom Park ◽  
Sungju Ryu ◽  
Jaeha Kung ◽  
Jae-Joon Kim

This article discusses the high-performance near-memory neural network (NN) accelerator architecture utilizing the logic die in three-dimensional (3D) High Bandwidth Memory– (HBM) like memory. As most of the previously reported 3D memory-based near-memory NN accelerator designs used the Hybrid Memory Cube (HMC) memory, we first focus on identifying the key differences between HBM and HMC in terms of near-memory NN accelerator design. One of the major differences between the two 3D memories is that HBM has the centralized through- silicon-via (TSV) channels while HMC has distributed TSV channels for separate vaults. Based on the observation, we introduce the Round-Robin Data Fetching and Groupwise Broadcast schemes to exploit the centralized TSV channels for improvement of the data feeding rate for the processing elements. Using synthesized designs in a 28-nm CMOS technology, performance and energy consumption of the proposed architectures with various dataflow models are evaluated. Experimental results show that the proposed schemes reduce the runtime by 16.4–39.3% on average and the energy consumption by 2.1–5.1% on average compared to conventional data fetching schemes.

2015 ◽  
Vol 2015 (1) ◽  
pp. 000050-000054
Author(s):  
Andy Heinig ◽  
Muhammad Waqas Chaudhary ◽  
Robert Fischbach ◽  
Michael Dittrich

Further improvements in system performance are often limited by the achievable bandwidth between processor and memory. In this paper we look at interposer-based and stacked solutions to integrate processor and 3D memory into a high performance system. The comparison is made for different technological decisions, design problems faced for choosing a certain 3D memory type from Wide IO/1–2, High bandwidth memory (HBM) and Hybrid Memory Cube (HMC). Logic die size, metal layers and material of interposer affected by routing requirements of memory systems are discussed.


2022 ◽  
Vol 6 (1) ◽  
Author(s):  
Taikyu Kim ◽  
Cheol Hee Choi ◽  
Pilgyu Byeon ◽  
Miso Lee ◽  
Aeran Song ◽  
...  

AbstractAchieving high-performance p-type semiconductors has been considered one of the most challenging tasks for three-dimensional vertically integrated nanoelectronics. Although many candidates have been presented to date, the facile and scalable realization of high-mobility p-channel field-effect transistors (FETs) is still elusive. Here, we report a high-performance p-channel tellurium (Te) FET fabricated through physical vapor deposition at room temperature. A growth route involving Te deposition by sputtering, oxidation and subsequent reduction to an elemental Te film through alumina encapsulation allows the resulting p-channel FET to exhibit a high field-effect mobility of 30.9 cm2 V−1 s−1 and an ION/OFF ratio of 5.8 × 105 with 4-inch wafer-scale integrity on a SiO2/Si substrate. Complementary metal-oxide semiconductor (CMOS) inverters using In-Ga-Zn-O and 4-nm-thick Te channels show a remarkably high gain of ~75.2 and great noise margins at small supply voltage of 3 V. We believe that this low-cost and high-performance Te layer can pave the way for future CMOS technology enabling monolithic three-dimensional integration.


2018 ◽  
Vol 24 (6) ◽  
pp. 1-9 ◽  
Author(s):  
Myung-Jae Lee ◽  
Augusto Ronchini Ximenes ◽  
Preethi Padmanabhan ◽  
Tzu-Jui Wang ◽  
Kuo-Chin Huang ◽  
...  

Electronics ◽  
2019 ◽  
Vol 8 (10) ◽  
pp. 1096
Author(s):  
Chae Eun Rhee ◽  
Seung-Won Park ◽  
Jungwoo Choi ◽  
Hyunmin Jung ◽  
Hyuk-Jae Lee

Recently, dramatic improvements in memory performance have been highly required for data demanding application services such as deep learning, big data, and immersive videos. To this end, the throughput-oriented memory such as high bandwidth memory (HBM) and hybrid memory cube (HMC) has been introduced to provide a high bandwidth. For its effective use, various research efforts have been conducted. Among them, the near-memory-processing (NMP) is a concept that utilizes bandwidth and power consumption by placing computation logic near the memory. In the NMP-enabled system, a processor hierarchy consisting of hosts and NMPs is formed based on the distance from the main memory. In this paper, an evaluation tool is proposed to obtain the optimal design decision considering the power-time trade-off in the processor hierarchy. Every time the operating condition and constraints change, the decision of task-level offloading is dynamically made. For the realistic NMP-enabled system environment, the relationship among HBM, host, and NMP should be carefully considered. Hosts and NMPs are almost hidden from each other and the communications between them are extremely limited. In the simulation results, popular benchmarks and a machine learning application are used to demonstrate power-time trade-offs depending on applications and system conditions.


2021 ◽  
Vol 17 (2) ◽  
pp. 1-25
Author(s):  
Palash Das ◽  
Hemangee K. Kapoor

Convolutional/Deep Neural Networks (CNNs/DNNs) are rapidly growing workloads for the emerging AI-based systems. The gap between the processing speed and the memory-access latency in multi-core systems affects the performance and energy efficiency of the CNN/DNN tasks. This article aims to alleviate this gap by providing a simple and yet efficient near-memory accelerator-based system that expedites the CNN inference. Towards this goal, we first design an efficient parallel algorithm to accelerate CNN/DNN tasks. The data is partitioned across the multiple memory channels (vaults) to assist in the execution of the parallel algorithm. Second, we design a hardware unit, namely the convolutional logic unit (CLU), which implements the parallel algorithm. To optimize the inference, the CLU is designed, and it works in three phases for layer-wise processing of data. Last, to harness the benefits of near-memory processing (NMP), we integrate homogeneous CLUs on the logic layer of the 3D memory, specifically the Hybrid Memory Cube (HMC). The combined effect of these results in a high-performing and energy-efficient system for CNNs/DNNs. The proposed system achieves a substantial gain in the performance and energy reduction compared to multi-core CPU- and GPU-based systems with a minimal area overhead of 2.37%.


2011 ◽  
Vol 4 (1) ◽  
pp. 109-117 ◽  
Author(s):  
Mohamed Mehdi Jatlaoui ◽  
Daniela Dragomirescu ◽  
Mariano Ercoli ◽  
Michael Krämer ◽  
Samuel Charlot ◽  
...  

This paper presents the research done at LAAS-CNRS and in the context of “NANOCOMM” project. This project aims to demonstrate the potential of nanotechnology for the development of reconfigurable, ultra-sensitive, low consumption, and easy installation sensor networks with high performance in terms of reliability in line with the requirements of aeronautics and space. Each node of the sensor network is composed of nano-sensors, transceiver, and planar antenna. In this project, three-dimensional (3D) heterogeneous integration of these different components, on flexible polyimide substrate, is planned. Two types of sensors are selected for this project: strain gauges are used for the structure health monitoring (SHM) application and electrochemical cells are used to demonstrate the ability to detect frost phenomenon. After processing, sensors data are processed and transmitted to the reader unit using an ultra-wide band (UWB) transceiver. (digital baseband and radiofrequency (RF) head). The design and implementation of reconfigurable wireless communication architectures are provided according to the application requirements using nanoscale 65 nm CMOS technology. It is proposed to integrate on flexible substrate the transceiver using the flip-chip technique. A 60 GHz planar antenna is connected to the transceiver for the wireless data transmission. This paper is focused on the 3D integration techniques and the technological process used for the realization of such communicating nano-objects on polyimide substrate. The first assembly tests were carried out. Tests of interconnections quality and electrical contacts (Daisy Chain, calibration kit, etc.) were also performed with good results. A bumps contact resistance of 15 mΩ is measured.


Author(s):  
Haroon Rasheed S ◽  
Mohan Das S ◽  
Samba Sivudu Gaddam

This paper presents an energy efficient 1-bit full adder designed with a low voltage and high performance internal logic cells which leads to have abridged Power Delay Product (PDP). The customized XNOR and XOR gates, a necessary entity, are also presented. The simulations for the designed circuits performed in cadence virtuoso tool with 45-nm CMOS technology at a supply voltage of 0.9 Volts. The proposed 1-bit adder cell is compared with various trendy adders based on speed, power consumption and energy (PDP). The proposed adder schemes with modified internal entity cells achieve significant savings in terms of delay and energy consumption and which are more than 77% and 40.47% respectively when compared with conventional “C-CMOS” 1-bit full adder and other counter parts.


Author(s):  
Lee D. Peachey ◽  
Lou Fodor ◽  
John C. Haselgrove ◽  
Stanley M. Dunn ◽  
Junqing Huang

Stereo pairs of electron microscope images provide valuable visual impressions of the three-dimensional nature of specimens, including biological objects. Beyond this one seeks quantitatively accurate models and measurements of the three dimensional positions and sizes of structures in the specimen. In our laboratory, we have sought to combine high resolution video cameras with high performance computer graphics systems to improve both the ease of building 3D reconstructions and the accuracy of 3D measurements, by using multiple tilt images of the same specimen tilted over a wider range of angles than can be viewed stereoscopically. Ultimately we also wish to automate the reconstruction and measurement process, and have initiated work in that direction.Figure 1 is a stereo pair of 400 kV images from a 1 micrometer thick transverse section of frog skeletal muscle stained with the Golgi stain. This stain selectively increases the density of the transverse tubular network in these muscle cells, and it is this network that we reconstruct in this example.


Sign in / Sign up

Export Citation Format

Share Document