Range-Lookup Approximate Computing Acceleration for Any Activation Functions in Low-Power Neural Network

Author(s):  
Wen-Chang Yang ◽  
Shu-Yun Lin ◽  
Tsung-Chu Huang
2016 ◽  
Vol 7 ◽  
pp. 1397-1403 ◽  
Author(s):  
Andrey E Schegolev ◽  
Nikolay V Klenov ◽  
Igor I Soloviev ◽  
Maxim V Tereshonok

We propose the concept of using superconducting quantum interferometers for the implementation of neural network algorithms with extremely low power dissipation. These adiabatic elements are Josephson cells with sigmoid- and Gaussian-like activation functions. We optimize their parameters for application in three-layer perceptron and radial basis function networks.


2021 ◽  
Vol 17 (2) ◽  
pp. 1-29
Author(s):  
Anand Kumar Mukhopadhyay ◽  
Atul Sharma ◽  
Indrajit Chakrabarti ◽  
Arindam Basu ◽  
Mrigank Sharad

The method to map the neural signals to the neuron from which it originates is spike sorting. A low-power spike sorting system is presented for a neural implant device. The spike sorter constitutes a two-step trainer module that is shared by the signal acquisition channel associated with multiple electrodes. A low-power Spiking Neural Network (SNN) module is responsible for assigning the spike class. The two-step shared supervised on-chip training module is presented for improved training accuracy for the SNN. Post implant, the relatively power-hungry training module can be activated conditionally based on a statistics-driven retraining algorithm that allows on the fly training and adaptation. A low-power analog implementation for the SNN classifier is proposed based on resistive crossbar memory exploiting its approximate computing nature. Owing to the direct mapping of SNN functionality using physical characteristics of devices, the analog mode implementation can achieve ∼21 × lower power than its fully digital counterpart. We also incorporate the effect of device variation in the training process to suppress the impact of inevitable inaccuracies in such resistive crossbar devices on the classification accuracy. A variation-aware, digitally calibrated analog front-end is also presented, which consumes less than ∼50 nW power and interfaces with the digital training module as well as the analog SNN spike sorting module. Hence, the proposed scheme is a low-power, variation-tolerant, adaptive, digitally trained, all-analog spike sorter device, applicable to implantable and wearable multichannel brain-machine interfaces.


2021 ◽  
Vol 17 (2) ◽  
pp. 1-27
Author(s):  
Morteza Hosseini ◽  
Tinoosh Mohsenin

This article presents a low-power, programmable, domain-specific manycore accelerator, Binarized neural Network Manycore Accelerator (BiNMAC), which adopts and efficiently executes binary precision weight/activation neural network models. Such networks have compact models in which weights are constrained to only 1 bit and can be packed several in one memory entry that minimizes memory footprint to its finest. Packing weights also facilitates executing single instruction, multiple data with simple circuitry that allows maximizing performance and efficiency. The proposed BiNMAC has light-weight cores that support domain-specific instructions, and a router-based memory access architecture that helps with efficient implementation of layers in binary precision weight/activation neural networks of proper size. With only 3.73% and 1.98% area and average power overhead, respectively, novel instructions such as Combined Population-Count-XNOR , Patch-Select , and Bit-based Accumulation are added to the instruction set architecture of the BiNMAC, each of which replaces execution cycles of frequently used functions with 1 clock cycle that otherwise would have taken 54, 4, and 3 clock cycles, respectively. Additionally, customized logic is added to every core to transpose 16×16-bit blocks of memory on a bit-level basis, that expedites reshaping intermediate data to be well-aligned for bitwise operations. A 64-cluster architecture of the BiNMAC is fully placed and routed in 65-nm TSMC CMOS technology, where a single cluster occupies an area of 0.53 mm 2 with an average power of 232 mW at 1-GHz clock frequency and 1.1 V. The 64-cluster architecture takes 36.5 mm 2 area and, if fully exploited, consumes a total power of 16.4 W and can perform 1,360 Giga Operations Per Second (GOPS) while providing full programmability. To demonstrate its scalability, four binarized case studies including ResNet-20 and LeNet-5 for high-performance image classification, as well as a ConvNet and a multilayer perceptron for low-power physiological applications were implemented on BiNMAC. The implementation results indicate that the population-count instruction alone can expedite the performance by approximately 5×. When other new instructions are added to a RISC machine with existing population-count instruction, the performance is increased by 58% on average. To compare the performance of the BiNMAC with other commercial-off-the-shelf platforms, the case studies with their double-precision floating-point models are also implemented on the NVIDIA Jetson TX2 SoC (CPU+GPU). The results indicate that, within a margin of ∼2.1%--9.5% accuracy loss, BiNMAC on average outperforms the TX2 GPU by approximately 1.9× (or 7.5× with fabrication technology scaled) in energy consumption for image classification applications. On low power settings and within a margin of ∼3.7%--5.5% accuracy loss compared to ARM Cortex-A57 CPU implementation, BiNMAC is roughly ∼9.7×--17.2× (or 38.8×--68.8× with fabrication technology scaled) more energy efficient for physiological applications while meeting the application deadline.


Sign in / Sign up

Export Citation Format

Share Document