Simba

Package-level integration using multi-chip-modules (MCMs) is a promising approach for building large-scale systems. Compared to a large monolithic die, an MCM combines many smaller chiplets into a larger system, substantially reducing fabrication and design costs. Current MCMs typically only contain a handful of coarse-grained large chiplets due to the high area, performance, and energy overheads associated with inter-chiplet communication. This work investigates and quantifies the costs and benefits of using MCMs with finegrained chiplets for deep learning inference, an application domain with large compute and on-chip storage requirements. To evaluate the approach, we architected, implemented, fabricated, and tested Simba, a 36-chiplet prototype MCM system for deep-learning inference. Each chiplet achieves 4 TOPS peak performance, and the 36-chiplet MCM package achieves up to 128 TOPS and up to 6.1 TOPS/W. The MCM is configurable to support a flexible mapping of DNN layers to the distributed compute and storage units. To mitigate inter-chiplet communication overheads, we introduce three tiling optimizations that improve data locality. These optimizations achieve up to 16% speedup compared to the baseline layer mapping. Our evaluation shows that Simba can process 1988 images/s running ResNet-50 with a batch size of one, delivering an inference latency of 0.50 ms.

Download Full-text

Counterfeit Currency Detection using Resource Efficient Neural Networks

International Journal of Innovative Technology and Exploring Engineering - Special Issue ◽

10.35940/ijitee.f3626.049620 ◽

2020 ◽

Vol 9 (6) ◽

pp. 220-223

Keyword(s):

Deep Learning ◽

Large Scale ◽

Paper Currency ◽

Currency Note ◽

Fast Pace ◽

Tremendous Amount ◽

Media Reports ◽

The Government ◽

To Come ◽

And Storage

One of the leading causes of economic instability is the large-scale counterfeiting of the paper currency notes. Several media reports bring to light the alarming cases and the humungous scales of currency counterfeiting and how this issue has become very serious now. A report on how the Government is coping with these threats with new and stricter rules however counterfeiters adapt to the new rules in an alarmingly fast pace. Criminals continue to find a loophole in the system despite such strict security features. There have been impressive discoveries in the field of counterfeit currency, and this coupled with new age digital technology, counterfeiting is being fought well. However, it is impossible to track all counterfeit notes and impossible to have them checked at a short amount of time. Existing systems involve filing a case with the police, sending the documents for verification and waiting for the results to come. This method is based on Deep Learning, which has seen tremendous success in image classification tasks in recent times. This technique can help both people and machine in identifying a fake currency note in real time through an image of the same. Traditional Deep Learning algorithms require tremendous amount of compute power and storage and hence it is an expensive and elaborate process. The main goal is to make a faster and simpler mechanism to detect a counterfeit note that can be implemented in any random place like an ATM dispenser or an android application. The success of this application will greatly help the quick identification of the threat and help law enforcement in finding the source of the threat faster.

Download Full-text

The Backpropagation Algorithm Implemented on Spiking Neuromorphic Hardware

10.21203/rs.3.rs-701752/v1 ◽

2021 ◽

Author(s):

Alpha Renner ◽

Forrest Sheldon ◽

Anatoly Zlotnik ◽

Louis Tao ◽

Andrew Sornborger

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Low Power ◽

Large Scale ◽

Learning Algorithms ◽

Vlsi Circuits ◽

Machine Learning Algorithms ◽

Backpropagation Algorithm ◽

Neuromorphic Hardware ◽

On Chip

Abstract The capabilities of natural neural systems have inspired new generations of machine learning algorithms as well as neuromorphic very large-scale integrated (VLSI) circuits capable of fast, low-power information processing. However, it has been argued that most modern machine learning algorithms are not neurophysiologically plausible. In particular, the workhorse of modern deep learning, the backpropagation algorithm, has proven difficult to translate to neuromorphic hardware. In this study, we present a neuromorphic, spiking backpropagation algorithm based on synfire-gated dynamical information coordination and processing, implemented on Intel's Loihi neuromorphic research processor. We demonstrate a proof-of-principle three-layer circuit that learns to classify digits from the MNIST dataset. To our knowledge, this is the first work to show a Spiking Neural Network (SNN) implementation of the backpropagation algorithm that is fully on-chip, without a computer in the loop. It is competitive in accuracy with off-chip trained SNNs and achieves an energy-delay product suitable for edge computing. This implementation shows a path for using in-memory, massively parallel neuromorphic processors for low-power, low-latency implementation of modern deep learning applications.

Download Full-text

Adaptive neuron apoptosis for accelerating deep learning on large scale systems

2016 IEEE International Conference on Big Data (Big Data) ◽

10.1109/bigdata.2016.7840668 ◽

2016 ◽

Cited By ~ 2

Author(s):

Charles Siegel ◽

Jeff Daily ◽

Abhinav Vishnu

Keyword(s):

Deep Learning ◽

Large Scale ◽

Large Scale Systems ◽

Adaptive Neuron ◽

Neuron Apoptosis

Download Full-text

Strategies to Improve the Performance and Energy Efficiency of Stencil Computations for NVIDIA GPUs

10.5753/wperformance.2018.3348 ◽

2018 ◽

Author(s):

Pablo José Pavan ◽

Matheus da Silva Serpa ◽

Víctor Martínez ◽

Edson Luiz Padoin ◽

Jairo Panetta ◽

...

Keyword(s):

Energy Efficiency ◽

Energy Efficient ◽

Large Scale ◽

Data Locality ◽

Large Scale Systems ◽

Systems Research ◽

Gpu Algorithms ◽

Read Only Memory ◽

And Performance ◽

Gpu Architecture

Energy and performance of parallel systems are an increasing concern for new large-scale systems. Research has been developed in response to this challenge aiming the manufacture of more energy efficient systems. In this context, we improved the performance and achieved energy efficiency by the development of three different strategies which use the GPU memory subsystem (global-, shared-, and read-only- memory). We also develop two optimizations to use data locality and use of registers of GPU architecture. Our developed optimizations were applied to GPU algorithms for stencil applications achieve a performance improvement of up to 201:5% in K80 and 264:6% in P 100 when used shared memory and read-only cache respectively over the naive version. The computational results have shown that the combination of use read-only memory, the Z-axis internalization of stencil application and reuse of specific architecture registers allow increasing the energy efficiency of up to 255:6% in K80 and 314:8% in P 100.

Download Full-text