Improved Knowledge Distillation via Teacher Assistant

Despite the fact that deep neural networks are powerful models and achieve appealing results on many tasks, they are too large to be deployed on edge devices like smartphones or embedded sensor nodes. There have been efforts to compress these networks, and a popular method is knowledge distillation, where a large (teacher) pre-trained network is used to train a smaller (student) network. However, in this paper, we show that the student network performance degrades when the gap between student and teacher is large. Given a fixed student network, one cannot employ an arbitrarily large teacher, or in other words, a teacher can effectively transfer its knowledge to students up to a certain size, not smaller. To alleviate this shortcoming, we introduce multi-step knowledge distillation, which employs an intermediate-sized network (teacher assistant) to bridge the gap between the student and the teacher. Moreover, we study the effect of teacher assistant size and extend the framework to multi-step distillation. Theoretical analysis and extensive experiments on CIFAR-10,100 and ImageNet datasets and on CNN and ResNet architectures substantiate the effectiveness of our proposed approach.

Download Full-text

Inplace knowledge distillation with teacher assistant for improved training of flexible deep neural networks

10.23919/eusipco54536.2021.9616244 ◽

2021 ◽

Author(s):

Alexey Ozerov ◽

Ngoc Q. K. Duong

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Knowledge Distillation ◽

Teacher Assistant

Download Full-text

Improving the Interpretability of Deep Neural Networks with Knowledge Distillation

2018 IEEE International Conference on Data Mining Workshops (ICDMW) ◽

10.1109/icdmw.2018.00132 ◽

2018 ◽

Cited By ~ 1

Author(s):

Xuan Liu ◽

Xiaoguang Wang ◽

Stan Matwin

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Knowledge Distillation

Download Full-text

Modeling Teacher-Student Techniques in Deep Neural Networks for Knowledge Distillation

2020 International Conference on Machine Vision and Image Processing (MVIP) ◽

10.1109/mvip49855.2020.9116923 ◽

2020 ◽

Author(s):

Sajjad Abbasi ◽

Mohsen Hajabdollahi ◽

Nader Karimi ◽

Shadrokh Samavi

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Teacher Student ◽

Knowledge Distillation

Download Full-text

Knowledge Distillation for Optimization of Quantized Deep Neural Networks

2020 IEEE Workshop on Signal Processing Systems (SiPS) ◽

10.1109/sips50750.2020.9195219 ◽

2020 ◽

Author(s):

Sungho Shin ◽

Yoonho Boo ◽

Wonyong Sung

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Knowledge Distillation

Download Full-text

A Review of Knowledge Distillation in Deep Neural Networks

Computer Science and Application ◽

10.12677/csa.2020.109171 ◽

2020 ◽

Vol 10 (09) ◽

pp. 1625-1630

Author(s):

宇韩

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Knowledge Distillation

Download Full-text

A theoretical analysis of Deep Neural Networks for texture classification

2016 International Joint Conference on Neural Networks (IJCNN) ◽

10.1109/ijcnn.2016.7727306 ◽

2016 ◽

Cited By ~ 12

Author(s):

Saikat Basu ◽

Manohar Karki ◽

Supratik Mukhopadhyay ◽

Sangram Ganguly ◽

Ramakrishna Nemani ◽

...

Keyword(s):

Neural Networks ◽

Theoretical Analysis ◽

Deep Neural Networks ◽

Texture Classification

Download Full-text

Generative Adversarial Positive-Unlabelled Learning

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/312 ◽

2018 ◽

Cited By ~ 6

Author(s):

Ming Hou ◽

Brahim Chaib-draa ◽

Chao Li ◽

Qibin Zhao

Keyword(s):

Neural Networks ◽

Theoretical Analysis ◽

Supervised Classification ◽

Deep Neural Networks ◽

Generative Adversarial Networks ◽

Discriminative Learning ◽

Decision Boundary ◽

Negative Data ◽

New Paradigm ◽

Adversarial Networks

In this work, we consider the task of classifying binary positive-unlabeled (PU) data. The existing discriminative learning based PU models attempt to seek an optimal reweighting strategy for U data, so that a decent decision boundary can be found. However, given limited P data, the conventional PU models tend to suffer from overfitting when adapted to very flexible deep neural networks. In contrast, we are the first to innovate a totally new paradigm to attack the binary PU task, from perspective of generative learning by leveraging the powerful generative adversarial networks (GAN). Our generative positive-unlabeled (GenPU) framework incorporates an array of discriminators and generators that are endowed with different roles in simultaneously producing positive and negative realistic samples. We provide theoretical analysis to justify that, at equilibrium, GenPU is capable of recovering both positive and negative data distributions. Moreover, we show GenPU is generalizable and closely related to the semi-supervised classification. Given rather limited P data, experiments on both synthetic and real-world dataset demonstrate the effectiveness of our proposed framework. With infinite realistic and diverse sample streams generated from GenPU, a very flexible classifier can then be trained using deep neural networks.

Download Full-text

KDnet-RUL: A Knowledge Distillation Framework to Compress Deep Neural Networks for Machine Remaining Useful Life Prediction

IEEE Transactions on Industrial Electronics ◽

10.1109/tie.2021.3057030 ◽

2021 ◽

pp. 1-1

Author(s):

Qing Xu ◽

Zhenghua Chen ◽

Keyu Wu ◽

Chao Wang ◽

Min Wu ◽

...

Keyword(s):

Neural Networks ◽

Life Prediction ◽

Deep Neural Networks ◽

Remaining Useful Life ◽

Knowledge Distillation ◽

Useful Life

Download Full-text

Heterogeneous Gaussian Mechanism: Preserving Differential Privacy in Deep Learning with Provable Robustness

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/660 ◽

2019 ◽

Cited By ~ 1

Author(s):

NhatHai Phan ◽

Minh N. Vu ◽

Yang Liu ◽

Ruoming Jin ◽

Dejing Dou ◽

...

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Theoretical Analysis ◽

Gaussian Noise ◽

Deep Neural Networks ◽

Differential Privacy ◽

Trade Off ◽

Adversarial Examples ◽

Hidden Layer ◽

Privacy Budget

In this paper, we propose a novel Heterogeneous Gaussian Mechanism (HGM) to preserve differential privacy in deep neural networks, with provable robustness against adversarial examples. We first relax the constraint of the privacy budget in the traditional Gaussian Mechanism from (0, 1] to (0, infty), with a new bound of the noise scale to preserve differential privacy. The noise in our mechanism can be arbitrarily redistributed, offering a distinctive ability to address the trade-off between model utility and privacy loss. To derive provable robustness, our HGM is applied to inject Gaussian noise into the first hidden layer. Then, a tighter robustness bound is proposed. Theoretical analysis and thorough evaluations show that our mechanism notably improves the robustness of differentially private deep neural networks, compared with baseline approaches, under a variety of model attacks.

Download Full-text

Adam and the Ants: On the Influence of the Optimization Algorithm on the Detectability of DNN Watermarks

Entropy ◽

10.3390/e22121379 ◽

2020 ◽

Vol 22 (12) ◽

pp. 1379

Author(s):

Betty Cortiñas-Lorenzo ◽

Fernando Pérez-González

Keyword(s):

Neural Networks ◽

Theoretical Analysis ◽

Digital Watermarking ◽

Optimization Algorithm ◽

Gradient Descent ◽

Deep Neural Networks ◽

Stochastic Gradient Descent ◽

Secret Message ◽

Weight Distributions ◽

A Minor

As training Deep Neural Networks (DNNs) becomes more expensive, the interest in protecting the ownership of the models with watermarking techniques increases. Uchida et al. proposed a digital watermarking algorithm that embeds the secret message into the model coefficients. However, despite its appeal, in this paper, we show that its efficacy can be compromised by the optimization algorithm being used. In particular, we found through a theoretical analysis that, as opposed to Stochastic Gradient Descent (SGD), the update direction given by Adam optimization strongly depends on the sign of a combination of columns of the projection matrix used for watermarking. Consequently, as observed in the empirical results, this makes the coefficients move in unison giving rise to heavily spiked weight distributions that can be easily detected by adversaries. As a way to solve this problem, we propose a new method called Block-Orthonormal Projections (BOP) that allows one to combine watermarking with Adam optimization with a minor impact on the detectability of the watermark and an increased robustness.

Download Full-text