Online Knowledge Distillation with Diverse Peers

Defang Chen; Jian-Ping Mei; Can Wang; Yan Feng; Chun Chen

doi:10.1609/aaai.v34i04.5746

Online Knowledge Distillation with Diverse Peers

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5746 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3430-3437

Author(s):

Defang Chen ◽

Jian-Ping Mei ◽

Can Wang ◽

Yan Feng ◽

Chun Chen

Keyword(s):

Knowledge Transfer ◽

State Of The Art ◽

High Capacity ◽

Group Leader ◽

Student Model ◽

Aggregation Functions ◽

Knowledge Distillation ◽

Group Members ◽

Student Models ◽

Soft Targets

Distillation is an effective knowledge-transfer technique that uses predicted distributions of a powerful teacher model as soft targets to train a less-parameterized student model. A pre-trained high capacity teacher, however, is not always available. Recently proposed online variants use the aggregated intermediate predictions of multiple student models as targets to train each student model. Although group-derived targets give a good recipe for teacher-free distillation, group members are homogenized quickly with simple aggregation functions, leading to early saturated solutions. In this work, we propose Online Knowledge Distillation with Diverse peers (OKDDip), which performs two-level distillation during training with multiple auxiliary peers and one group leader. In the first-level distillation, each auxiliary peer holds an individual set of aggregation weights generated with an attention-based mechanism to derive its own targets from predictions of other auxiliary peers. Learning from distinct target distributions helps to boost peer diversity for effectiveness of group-based distillation. The second-level distillation is performed to transfer the knowledge in the ensemble of auxiliary peers further to the group leader, i.e., the model used for inference. Experimental results show that the proposed framework consistently gives better performance than state-of-the-art approaches without sacrificing training or inference complexity, demonstrating the effectiveness of the proposed two-level distillation framework.

Download Full-text

Cross-layer knowledge distillation with KL divergence and offline ensemble for compressing deep neural network

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2021.16 ◽

2021 ◽

Vol 10 ◽

Author(s):

Hsing-Hung Chou ◽

Ching-Te Chiu ◽

Yi-Ping Liao

Keyword(s):

Semantic Segmentation ◽

Experimental Results ◽

Cross Layer ◽

Compression Rate ◽

Student Model ◽

Kl Divergence ◽

Knowledge Distillation ◽

Student Models ◽

Trained Teachers ◽

High Level

Deep neural networks (DNN) have solved many tasks, including image classification, object detection, and semantic segmentation. However, when there are huge parameters and high level of computation associated with a DNN model, it becomes difficult to deploy on mobile devices. To address this difficulty, we propose an efficient compression method that can be split into three parts. First, we propose a cross-layer matrix to extract more features from the teacher's model. Second, we adopt Kullback Leibler (KL) Divergence in an offline environment to make the student model find a wider robust minimum. Finally, we propose the offline ensemble pre-trained teachers to teach a student model. To address dimension mismatch between teacher and student models, we adopt a $1\times 1$ convolution and two-stage knowledge distillation to release this constraint. We conducted experiments with VGG and ResNet models, using the CIFAR-100 dataset. With VGG-11 as the teacher's model and VGG-6 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.57% with a $2.08\times$ compression rate and 3.5x computation rate. With ResNet-32 as the teacher's model and ResNet-8 as the student's model, experimental results showed that Top-1 accuracy increased by 4.38% with a $6.11\times$ compression rate and $5.27\times$ computation rate. In addition, we conducted experiments using the ImageNet $64\times 64$ dataset. With MobileNet-16 as the teacher's model and MobileNet-9 as the student's model, experimental results showed that the Top-1 accuracy increased by 3.98% with a $1.59\times$ compression rate and $2.05\times$ computation rate.

Download Full-text

Adversarially Robust Distillation

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5816 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3996-4003

Author(s):

Micah Goldblum ◽

Liam Fowl ◽

Soheil Feizi ◽

Tom Goldstein

Keyword(s):

Neural Networks ◽

High Performance ◽

State Of The Art ◽

Test Accuracy ◽

Training Methods ◽

Knowledge Distillation ◽

Adversarial Training ◽

Student Models ◽

Small Models ◽

High Test

Knowledge distillation is effective for producing small, high-performance neural networks for classification, but these small networks are vulnerable to adversarial attacks. This paper studies how adversarial robustness transfers from teacher to student during knowledge distillation. We find that a large amount of robustness may be inherited by the student even when distilled on only clean images. Second, we introduce Adversarially Robust Distillation (ARD) for distilling robustness onto student networks. In addition to producing small models with high test accuracy like conventional distillation, ARD also passes the superior robustness of large networks onto the student. In our experiments, we find that ARD student models decisively outperform adversarially trained networks of identical architecture in terms of robust accuracy, surpassing state-of-the-art methods on standard robustness benchmarks. Finally, we adapt recent fast adversarial training methods to ARD for accelerated robust distillation.

Download Full-text

KDAS-ReID: Architecture Search for Person Re-Identification via Distilled Knowledge with Dynamic Temperature

Algorithms ◽

10.3390/a14050137 ◽

2021 ◽

Vol 14 (5) ◽

pp. 137

Author(s):

Zhou Lei ◽

Kangkang Yang ◽

Kai Jiang ◽

Shengbo Chen

Keyword(s):

State Of The Art ◽

Identification Algorithm ◽

Student Model ◽

Deep Convolutional Neural Networks ◽

Fast Speed ◽

Training Stage ◽

Knowledge Distillation ◽

And Training ◽

Better Than ◽

Teacher Model

Person re-Identification(Re-ID) based on deep convolutional neural networks (CNNs) achieves remarkable success with its fast speed. However, prevailing Re-ID models are usually built upon backbones that manually design for classification. In order to automatically design an effective Re-ID architecture, we propose a pedestrian re-identification algorithm based on knowledge distillation, called KDAS-ReID. When the knowledge of the teacher model is transferred to the student model, the importance of knowledge in the teacher model will gradually decrease with the improvement of the performance of the student model. Therefore, instead of applying the distillation loss function directly, we consider using dynamic temperatures during the search stage and training stage. Specifically, we start searching and training at a high temperature and gradually reduce the temperature to 1 so that the student model can better learn from the teacher model through soft targets. Extensive experiments demonstrate that KDAS-ReID performs not only better than other state-of-the-art Re-ID models on three benchmarks, but also better than the teacher model based on the ResNet-50 backbone.

Download Full-text

Light Multi-Segment Activation for Model Compression

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6128 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6542-6549

Author(s):

Zhenhui Xu ◽

Guolin Ke ◽

Jia Zhang ◽

Jiang Bian ◽

Tie-Yan Liu

Keyword(s):

State Of The Art ◽

Model Complexity ◽

Student Model ◽

Model Accuracy ◽

Compression Performance ◽

Model Compression ◽

Comparable Performance ◽

Knowledge Distillation ◽

Resource Cost ◽

Strict Requirement

Model compression has become necessary when applying neural networks (NN) into many real application tasks that can accept slightly-reduced model accuracy but with strict tolerance to model complexity. Recently, Knowledge Distillation, which distills the knowledge from well-trained and highly complex teacher model into a compact student model, has been widely used for model compression. However, under the strict requirement on the resource cost, it is quite challenging to make student model achieve comparable performance with the teacher one, essentially due to the drastically-reduced expressiveness ability of the compact student model. Inspired by the nature of the expressiveness ability in NN, we propose to use multi-segment activation, which can significantly improve the expressiveness ability with very little cost, in the compact student model. Specifically, we propose a highly efficient multi-segment activation, called Light Multi-segment Activation (LMA), which can rapidly produce multiple linear regions with very few parameters by leveraging the statistical information. With using LMA, the compact student model is capable of achieving much better performance effectively and efficiently, than the ReLU-equipped one with same model complexity. Furthermore, the proposed method is compatible with other model compression techniques, such as quantization, which means they can be used jointly for better compression performance. Experiments on state-of-the-art NN architectures over the real-world tasks demonstrate the effectiveness and extensibility of the LMA.

Download Full-text

Efficient knowledge distillation for liver CT segmentation using growing assistant network

Physics in Medicine and Biology ◽

10.1088/1361-6560/ac3935 ◽

2021 ◽

Author(s):

Pengcheng Xu ◽

Kyungsang Kim ◽

Jeongwan Koh ◽

Dufan Wu ◽

Yu Rim Lee ◽

...

Keyword(s):

Knowledge Transfer ◽

Real Time ◽

Liver Surgery ◽

Computational Cost ◽

Lesion Detection ◽

Dice Similarity Coefficient ◽

Student Model ◽

Liver Ct ◽

Knowledge Distillation ◽

Ct Segmentation

Abstract Segmentation has been widely used in diagnosis, lesion detection, and surgery planning. Although deep learning (DL)-based segmentation methods currently outperform traditional methods, most DL-based segmentation models are computationally expensive and memory inefficient, which are not suitable for the intervention of liver surgery. To address this issue, a simple solution is to make a segmentation model very small for the fast inference time, however, there is a trade-off between the model size and performance. In this paper, we propose a DL-based real- time 3-D liver CT segmentation method, where knowledge distillation (KD) method, referred to as knowledge transfer from teacher to student models, is incorporated to compress the model while preserving the performance. Because it is known that the knowledge transfer is inefficient when the disparity of teacher and student model sizes is large, we propose a growing teacher assistant network (GTAN) to gradually learn the knowledge without extra computational cost, which can efficiently transfer knowledges even with the large gap of teacher and student model sizes. In our results, dice similarity coefficient of the student model with KD improved 1.2% (85.9% to 87.1%) compared to the student model without KD, which is a similar performance of the teacher model using only 8% (100k) parameters. Furthermore, with a student model of 2% (30k) parameters, the proposed model using the GTAN improved the dice coefficient about 2% compared to the student model without KD, with the inference time of 13ms per case. Therefore, the proposed method has a great potential for intervention in liver surgery, which also can be utilized in many real-time applications.

Download Full-text

Robust CNN Compression Framework for Security-Sensitive Embedded Systems

Applied Sciences ◽

10.3390/app11031093 ◽

2021 ◽

Vol 11 (3) ◽

pp. 1093

Author(s):

Jeonghyun Lee ◽

Sangkyun Lee

Keyword(s):

Embedded Systems ◽

Optimization Problem ◽

State Of The Art ◽

Classification Problems ◽

Proximal Gradient Method ◽

Knowledge Distillation ◽

New Type ◽

Adversarial Examples ◽

Adversarial Training ◽

Memory Efficient

Convolutional neural networks (CNNs) have achieved tremendous success in solving complex classification problems. Motivated by this success, there have been proposed various compression methods for downsizing the CNNs to deploy them on resource-constrained embedded systems. However, a new type of vulnerability of compressed CNNs known as the adversarial examples has been discovered recently, which is critical for security-sensitive systems because the adversarial examples can cause malfunction of CNNs and can be crafted easily in many cases. In this paper, we proposed a compression framework to produce compressed CNNs robust against such adversarial examples. To achieve the goal, our framework uses both pruning and knowledge distillation with adversarial training. We formulate our framework as an optimization problem and provide a solution algorithm based on the proximal gradient method, which is more memory-efficient than the popular ADMM-based compression approaches. In experiments, we show that our framework can improve the trade-off between adversarial robustness and compression rate compared to the existing state-of-the-art adversarial pruning approach.

Download Full-text

Communication Failure Resilient Distributed Neural Network for Edge Devices

Electronics ◽

10.3390/electronics10141614 ◽

2021 ◽

Vol 10 (14) ◽

pp. 1614

Author(s):

Jonghun Jeong ◽

Jong Sung Park ◽

Hoeseok Yang

Keyword(s):

Neural Network ◽

Neural Networks ◽

High Performance ◽

State Of The Art ◽

Wearable Devices ◽

Communication Failure ◽

Canadian Institute ◽

Multiple Devices ◽

Knowledge Distillation ◽

Partitioning Technique

Recently, the necessity to run high-performance neural networks (NN) is increasing even in resource-constrained embedded systems such as wearable devices. However, due to the high computational and memory requirements of the NN applications, it is typically infeasible to execute them on a single device. Instead, it has been proposed to run a single NN application cooperatively on top of multiple devices, a so-called distributed neural network. In the distributed neural network, workloads of a single big NN application are distributed over multiple tiny devices. While the computation overhead could effectively be alleviated by this approach, the existing distributed NN techniques, such as MoDNN, still suffer from large traffics between the devices and vulnerability to communication failures. In order to get rid of such big communication overheads, a knowledge distillation based distributed NN, called Network of Neural Networks (NoNN), was proposed, which partitions the filters in the final convolutional layer of the original NN into multiple independent subsets and derives smaller NNs out of each subset. However, NoNN also has limitations in that the partitioning result may be unbalanced and it considerably compromises the correlation between filters in the original NN, which may result in an unacceptable accuracy degradation in case of communication failure. In this paper, in order to overcome these issues, we propose to enhance the partitioning strategy of NoNN in two aspects. First, we enhance the redundancy of the filters that are used to derive multiple smaller NNs by means of averaging to increase the immunity of the distributed NN to communication failure. Second, we propose a novel partitioning technique, modified from Eigenvector-based partitioning, to preserve the correlation between filters as much as possible while keeping the consistent number of filters distributed to each device. Throughout extensive experiments with the CIFAR-100 (Canadian Institute For Advanced Research-100) dataset, it has been observed that the proposed approach maintains high inference accuracy (over 70%, 1.53× improvement over the state-of-the-art approach), on average, even when a half of eight devices in a distributed NN fail to deliver their partial inference results.

Download Full-text

Knowledge Transfer for Entity Resolution with Siamese Neural Networks

Journal of Data and Information Quality ◽

10.1145/3410157 ◽

2021 ◽

Vol 13 (1) ◽

pp. 1-25

Author(s):

Michael Loster ◽

Ioannis Koumarelas ◽

Felix Naumann

Keyword(s):

Knowledge Transfer ◽

Similarity Measure ◽

State Of The Art ◽

Similarity Measures ◽

Engineering Process ◽

Domain Experts ◽

Multiple Datasets ◽

Multiple Data ◽

Domain Expertise ◽

F Measure

The integration of multiple data sources is a common problem in a large variety of applications. Traditionally, handcrafted similarity measures are used to discover, merge, and integrate multiple representations of the same entity—duplicates—into a large homogeneous collection of data. Often, these similarity measures do not cope well with the heterogeneity of the underlying dataset. In addition, domain experts are needed to manually design and configure such measures, which is both time-consuming and requires extensive domain expertise. We propose a deep Siamese neural network, capable of learning a similarity measure that is tailored to the characteristics of a particular dataset. With the properties of deep learning methods, we are able to eliminate the manual feature engineering process and thus considerably reduce the effort required for model construction. In addition, we show that it is possible to transfer knowledge acquired during the deduplication of one dataset to another, and thus significantly reduce the amount of data required to train a similarity measure. We evaluated our method on multiple datasets and compare our approach to state-of-the-art deduplication methods. Our approach outperforms competitors by up to +26 percent F-measure, depending on task and dataset. In addition, we show that knowledge transfer is not only feasible, but in our experiments led to an improvement in F-measure of up to +4.7 percent.

Download Full-text

A Formal Approach to Handling Conflicts in Multiattribute Group Decision Making

Journal of Mechanical Design ◽

10.1115/1.2197836 ◽

2005 ◽

Vol 128 (4) ◽

pp. 678-688 ◽

Cited By ~ 11

Author(s):

Tung-King See ◽

Kemper Lewis

Keyword(s):

Decision Making ◽

Engineering Design ◽

Group Decision Making ◽

Group Decision ◽

Decision Problems ◽

Team Leaders ◽

Formal Approach ◽

Aggregation Functions ◽

Group Members ◽

Problem Data

Supporting the decision of a group in engineering design is a challenging and complicated problem when issues like consensus and compromise must be taken into account. In this paper, we present the foundations of the group hypothetical equivalents and inequivalents method and two fundamental extensions making it applicable to new classes of group decision problems. The first extension focuses on updating the formulation to place unequal importance on the preferences of the group members. The formulation presented in this paper allows team leaders to emphasize the input from certain group members based on experience or other factors. The second extension focuses on the theoretical implications of using a general class of aggregation functions. Illustration and validation of the developments are presented using a vehicle selection problem. Data from ten engineering design groups are used to demonstrate the application of the method.

Download Full-text

Data-Efficient Sensor Upgrade Path Using Knowledge Distillation

Sensors ◽

10.3390/s21196523 ◽

2021 ◽

Vol 21 (19) ◽

pp. 6523

Author(s):

Pieter Van Van Molle ◽

Cedric De De Boom ◽

Tim Verbelen ◽

Bert Vankeirsbilck ◽

Jonas De De Vylder ◽

...

Keyword(s):

Deep Neural Networks ◽

State Of The Art ◽

Original Data ◽

Radar Data ◽

Teacher Supervision ◽

Multispectral Images ◽

Test Set ◽

Time To Market ◽

Speed Up ◽

Knowledge Distillation

Deep neural networks have achieved state-of-the-art performance in image classification. Due to this success, deep learning is now also being applied to other data modalities such as multispectral images, lidar and radar data. However, successfully training a deep neural network requires a large reddataset. Therefore, transitioning to a new sensor modality (e.g., from regular camera images to multispectral camera images) might result in a drop in performance, due to the limited availability of data in the new modality. This might hinder the adoption rate and time to market for new sensor technologies. In this paper, we present an approach to leverage the knowledge of a teacher network, that was trained using the original data modality, to improve the performance of a student network on a new data modality: a technique known in literature as knowledge distillation. By applying knowledge distillation to the problem of sensor transition, we can greatly speed up this process. We validate this approach using a multimodal version of the MNIST dataset. Especially when little data is available in the new modality (i.e., 10 images), training with additional teacher supervision results in increased performance, with the student network scoring a test set accuracy of 0.77, compared to an accuracy of 0.37 for the baseline. We also explore two extensions to the default method of knowledge distillation, which we evaluate on a multimodal version of the CIFAR-10 dataset: an annealing scheme for the hyperparameter α and selective knowledge distillation. Of these two, the first yields the best results. Choosing the optimal annealing scheme results in an increase in test set accuracy of 6%. Finally, we apply our method to the real-world use case of skin lesion classification.

Download Full-text