Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation

Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too large to be deployed on edge devices like smartphones or embedded sensor nodes. There have been efforts to compress these networks, and a popular method is knowledge distillation, where a large (teacher) pre-trained network is used to train a smaller (student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is large. Given a fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To alleviate this shortcoming, we introduce multi-step knowledge distillation, which employs an intermediate-sized network (teacher assistant) to bridge the gap between the student and the teacher. Moreover, we study the effect of teacher assistant size and extend the framework to multi-step distillation. Theoretical analysis and extensive experiments on CIFAR-10,100 and ImageNet datasets and on CNN and ResNet architectures substantiate the effectiveness of our proposed approach.

Download Full-text

KDnet-RUL: A Knowledge Distillation Framework to Compress Deep Neural Networks for Machine Remaining Useful Life Prediction

IEEE Transactions on Industrial Electronics ◽

10.1109/tie.2021.3057030 ◽

2021 ◽

pp. 1-1

Author(s):

Qing Xu ◽

Zhenghua Chen ◽

Keyu Wu ◽

Chao Wang ◽

Min Wu ◽

...

Keyword(s):

Neural Networks ◽

Life Prediction ◽

Deep Neural Networks ◽

Remaining Useful Life ◽

Knowledge Distillation ◽

Useful Life

Download Full-text

Multi-teacher Knowledge Distillation for Compressed Video Action Recognition on Deep Neural Networks

ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2019.8682450 ◽

2019 ◽

Author(s):

Meng-Chieh Wu ◽

Ching-Te Chiu ◽

Kun-Hsuan Wu

Keyword(s):

Neural Networks ◽

Teacher Knowledge ◽

Action Recognition ◽

Deep Neural Networks ◽

Compressed Video ◽

Knowledge Distillation

Download Full-text

From complex to simple : hierarchical free-energy landscape renormalized in deep neural networks

10.21468/scipostphyscore.2.2.005 ◽

2020 ◽

Vol 2 (2) ◽

Author(s):

Hajime Yoshino

Keyword(s):

Neural Networks ◽

Free Energy ◽

Deep Neural Networks ◽

Solid Phase ◽

Energy Landscape ◽

Training Data ◽

Free Energy Landscape ◽

Finite Width ◽

Random Inputs ◽

Teacher Student

We develop a statistical mechanical approach based on the replica method to study the design space of deep and wide neural networks constrained to meet a large number of training data. Specifically, we analyze the configuration space of the synaptic weights and neurons in the hidden layers in a simple feed-forward perceptron network for two scenarios: a setting with random inputs/outputs and a teacher-student setting. By increasing the strength of constraints,~i.e. increasing the number of training data, successive 2nd order glass transition (random inputs/outputs) or 2nd order crystalline transition (teacher-student setting) take place layer-by-layer starting next to the inputs/outputs boundaries going deeper into the bulk with the thickness of the solid phase growing logarithmically with the data size. This implies the typical storage capacity of the network grows exponentially fast with the depth. In a deep enough network, the central part remains in the liquid phase. We argue that in systems of finite width N, the weak bias field can remain in the center and plays the role of a symmetry-breaking field that connects the opposite sides of the system. The successive glass transitions bring about a hierarchical free-energy landscape with ultrametricity, which evolves in space: it is most complex close to the boundaries but becomes renormalized into progressively simpler ones in deeper layers. These observations provide clues to understand why deep neural networks operate efficiently. Finally, we present some numerical simulations of learning which reveal spatially heterogeneous glassy dynamics truncated by a finite width N effect.

Download Full-text