Progressive Blockwise Knowledge Distillation for Neural Network Acceleration

Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2018/384 ◽

2018 ◽

Cited By ~ 5

Author(s):

Hui Wang ◽

Hanbin Zhao ◽

Xi Li ◽

Xu Tan

Keyword(s):

Neural Network ◽

Function Approximation ◽

State Of The Art ◽

Design Criterion ◽

Structure Design ◽

Model Accuracy ◽

Teacher Student ◽

Block Level ◽

Knowledge Distillation ◽

Teacher Network

As an important and challenging problem in machine learning and computer vision, neural network acceleration essentially aims to enhance the computational efficiency without sacrificing the model accuracy too much. In this paper, we propose a progressive blockwise learning scheme for teacher-student model distillation at the subnetwork block level. The proposed scheme is able to distill the knowledge of the entire teacher network by locally extracting the knowledge of each block in terms of progressive blockwise function approximation. Furthermore, we propose a structure design criterion for the student subnetwork block, which is able to effectively preserve the original receptive field from the teacher network. Experimental results demonstrate the effectiveness of the proposed scheme against the state-of-the-art approaches.

Download Full-text

Communication Failure Resilient Distributed Neural Network for Edge Devices

Electronics ◽

10.3390/electronics10141614 ◽

2021 ◽

Vol 10 (14) ◽

pp. 1614

Author(s):

Jonghun Jeong ◽

Jong Sung Park ◽

Hoeseok Yang

Keyword(s):

Neural Network ◽

Neural Networks ◽

High Performance ◽

State Of The Art ◽

Wearable Devices ◽

Communication Failure ◽

Canadian Institute ◽

Multiple Devices ◽

Knowledge Distillation ◽

Partitioning Technique

Recently, the necessity to run high-performance neural networks (NN) is increasing even in resource-constrained embedded systems such as wearable devices. However, due to the high computational and memory requirements of the NN applications, it is typically infeasible to execute them on a single device. Instead, it has been proposed to run a single NN application cooperatively on top of multiple devices, a so-called distributed neural network. In the distributed neural network, workloads of a single big NN application are distributed over multiple tiny devices. While the computation overhead could effectively be alleviated by this approach, the existing distributed NN techniques, such as MoDNN, still suffer from large traffics between the devices and vulnerability to communication failures. In order to get rid of such big communication overheads, a knowledge distillation based distributed NN, called Network of Neural Networks (NoNN), was proposed, which partitions the filters in the final convolutional layer of the original NN into multiple independent subsets and derives smaller NNs out of each subset. However, NoNN also has limitations in that the partitioning result may be unbalanced and it considerably compromises the correlation between filters in the original NN, which may result in an unacceptable accuracy degradation in case of communication failure. In this paper, in order to overcome these issues, we propose to enhance the partitioning strategy of NoNN in two aspects. First, we enhance the redundancy of the filters that are used to derive multiple smaller NNs by means of averaging to increase the immunity of the distributed NN to communication failure. Second, we propose a novel partitioning technique, modified from Eigenvector-based partitioning, to preserve the correlation between filters as much as possible while keeping the consistent number of filters distributed to each device. Throughout extensive experiments with the CIFAR-100 (Canadian Institute For Advanced Research-100) dataset, it has been observed that the proposed approach maintains high inference accuracy (over 70%, 1.53× improvement over the state-of-the-art approach), on average, even when a half of eight devices in a distributed NN fail to deliver their partial inference results.

Download Full-text

Viewpoint Robust Knowledge Distillation for Accelerating Vehicle Re-identification

10.21203/rs.3.rs-104548/v1 ◽

2020 ◽

Author(s):

Yi Xie ◽

Fei Shen ◽

Jianqing Zhu ◽

Huanqiang Zeng

Keyword(s):

Posterior Probability ◽

State Of The Art ◽

Probability Distributions ◽

Global Average ◽

Teacher Networks ◽

Leibler Divergence ◽

Deep Networks ◽

Speed Performance ◽

Knowledge Distillation ◽

Teacher Network

Abstract Vehicle re-identification is a challenging task that matches vehicle images captured by different cameras. Recent vehicle re-identification approaches exploit complex deep networks to learn viewpoint robust features for obtaining accurate re-identification results, which causes large computations in their testing phases to restrict the vehicle re-identification speed. In this paper, we propose a viewpoint robust knowledge distillation (VRKD) method for accelerating vehicle re-identification. The VRKD method consists of a complex teacher network and a simple student network. Specifically, the teacher network uses quadruple directional deep networks to learn viewpoint robust features. The student network only contains a shallow backbone sub-network and a global average pooling layer. The student network distills viewpoint robust knowledge from the teacher network via minimizing the Kullback-Leibler divergence between the posterior probability distributions resulted from the student and teacher networks. As a result, the vehicle re-identification speed is significantly accelerated since only the student network of small testing computations is demanded. Experiments on VeRi776 and VehicleID datasets show that the proposed VRKD method outperforms many state-of-the-art vehicle re-identification approaches with better accurate and speed performance.

Download Full-text

Tying of embeddings for improving regularization in neural networks for named entity recognition task

Bulletin of Taras Shevchenko National University of Kyiv. Series: Physics and Mathematics ◽

10.17721/1812-5409.2018/3.8 ◽

2018 ◽

pp. 59-64

Author(s):

M. Bevza

Keyword(s):

Neural Network ◽

Network Architecture ◽

State Of The Art ◽

Named Entity Recognition ◽

Recognition Task ◽

Entity Recognition ◽

Named Entity ◽

Part Of Speech ◽

Recent Developments ◽

Knowledge Distillation

We analyze neural network architectures that yield state of the art results on named entity recognition task and propose a new architecture for improving results even further. We have analyzed a number of ideas and approaches that researchers have used to achieve state of the art results in a variety of NLP tasks. In this work, we present a few of them which we consider to be most likely to improve existing state of the art solutions for named entity recognition task. The architecture is inspired by recent developments in language modeling task. The suggested solution is based on a multi-task learning approach. We incorporate part of speech tags as input for the network. Part of speech tags to be yielded by some state of the art tagger and also ask the network to produce those tags in addition to the main named entity recognition tags. This way knowledge distillation from a strong part of speech tagger to our smaller network is happening. We hypothesize that designing neural network architecture in this way improves the generalizability of the system and provide arguments to support this statement.

Download Full-text

Hierarchical Knowledge Squeezed Adversarial Network Compression

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6799 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11370-11377

Author(s):

Peng Li ◽

Chang Shu ◽

Yuan Xie ◽

Yan Qu ◽

Hui Kong

Keyword(s):

State Of The Art ◽

Teacher Student ◽

Adversarial Network ◽

Benchmark Datasets ◽

Knowledge Distillation ◽

Adversarial Training ◽

Rich Information ◽

Process Oriented ◽

Transfer Method ◽

Network Compression

Deep network compression has been achieved notable progress via knowledge distillation, where a teacher-student learning manner is adopted by using predetermined loss. Recently, more focuses have been transferred to employ the adversarial training to minimize the discrepancy between distributions of output from two networks. However, they always emphasize on result-oriented learning while neglecting the scheme of process-oriented learning, leading to the loss of rich information contained in the whole network pipeline. Whereas in other (non GAN-based) process-oriented methods, the knowledge have usually been transferred in a redundant manner. Observing that, the small network can not perfectly mimic a large one due to the huge gap of network scale, we propose a knowledge transfer method, involving effective intermediate supervision, under the adversarial training framework to learn the student network. Different from the other intermediate supervision methods, we design the knowledge representation in a compact form by introducing a task-driven attention mechanism. Meanwhile, to improve the representation capability of the attention-based method, a hierarchical structure is utilized so that powerful but highly squeezed knowledge is realized and the knowledge from teacher network could accommodate the size of student network. Extensive experimental results on three typical benchmark datasets, i.e., CIFAR-10, CIFAR-100, and ImageNet, demonstrate that our method achieves highly superior performances against state-of-the-art methods.

Download Full-text

Light Multi-Segment Activation for Model Compression

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6128 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6542-6549

Author(s):

Zhenhui Xu ◽

Guolin Ke ◽

Jia Zhang ◽

Jiang Bian ◽

Tie-Yan Liu

Keyword(s):

State Of The Art ◽

Model Complexity ◽

Student Model ◽

Model Accuracy ◽

Compression Performance ◽

Model Compression ◽

Comparable Performance ◽

Knowledge Distillation ◽

Resource Cost ◽

Strict Requirement

Model compression has become necessary when applying neural networks (NN) into many real application tasks that can accept slightly-reduced model accuracy but with strict tolerance to model complexity. Recently, Knowledge Distillation, which distills the knowledge from well-trained and highly complex teacher model into a compact student model, has been widely used for model compression. However, under the strict requirement on the resource cost, it is quite challenging to make student model achieve comparable performance with the teacher one, essentially due to the drastically-reduced expressiveness ability of the compact student model. Inspired by the nature of the expressiveness ability in NN, we propose to use multi-segment activation, which can significantly improve the expressiveness ability with very little cost, in the compact student model. Specifically, we propose a highly efficient multi-segment activation, called Light Multi-segment Activation (LMA), which can rapidly produce multiple linear regions with very few parameters by leveraging the statistical information. With using LMA, the compact student model is capable of achieving much better performance effectively and efficiently, than the ReLU-equipped one with same model complexity. Furthermore, the proposed method is compatible with other model compression techniques, such as quantization, which means they can be used jointly for better compression performance. Experiments on state-of-the-art NN architectures over the real-world tasks demonstrate the effectiveness and extensibility of the LMA.

Download Full-text

Viewpoint robust knowledge distillation for accelerating vehicle re-identification

EURASIP Journal on Advances in Signal Processing ◽

10.1186/s13634-021-00767-x ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Yi Xie ◽

Fei Shen ◽

Jianqing Zhu ◽

Huanqiang Zeng

Keyword(s):

Posterior Probability ◽

State Of The Art ◽

Probability Distributions ◽

Global Average ◽

Teacher Networks ◽

Leibler Divergence ◽

Deep Networks ◽

Speed Performance ◽

Knowledge Distillation ◽

Teacher Network

AbstractVehicle re-identification is a challenging task that matches vehicle images captured by different cameras. Recent vehicle re-identification approaches exploit complex deep networks to learn viewpoint robust features for obtaining accurate re-identification results, which causes large computations in their testing phases to restrict the vehicle re-identification speed. In this paper, we propose a viewpoint robust knowledge distillation (VRKD) method for accelerating vehicle re-identification. The VRKD method consists of a complex teacher network and a simple student network. Specifically, the teacher network uses quadruple directional deep networks to learn viewpoint robust features. The student network only contains a shallow backbone sub-network and a global average pooling layer. The student network distills viewpoint robust knowledge from the teacher network via minimizing the Kullback-Leibler divergence between the posterior probability distributions resulted from the student and teacher networks. As a result, the vehicle re-identification speed is significantly accelerated since only the student network of small testing computations is demanded. Experiments on VeRi776 and VehicleID datasets show that the proposed VRKD method outperforms many state-of-the-art vehicle re-identification approaches with better accurate and speed performance.

Download Full-text

MiDTD: A Simple and Effective Distillation Framework for Distantly Supervised Relation Extraction

ACM Transactions on Information Systems ◽

10.1145/3503917 ◽

2022 ◽

Vol 40 (4) ◽

pp. 1-32

Author(s):

Rui Li ◽

Cheng Yang ◽

Tingwei Li ◽

Sen Su

Keyword(s):

Information Extraction ◽

Temperature Regulation ◽

State Of The Art ◽

Relation Extraction ◽

Dynamic Temperature ◽

Teacher Student ◽

Distant Supervision ◽

Annotation Data ◽

Knowledge Distillation ◽

Moderate Range

Relation extraction (RE), an important information extraction task, faced the great challenge brought by limited annotation data. To this end, distant supervision was proposed to automatically label RE data, and thus largely increased the number of annotated instances. Unfortunately, lots of noise relation annotations brought by automatic labeling become a new obstacle. Some recent studies have shown that the teacher-student framework of knowledge distillation can alleviate the interference of noise relation annotations via label softening. Nevertheless, we find that they still suffer from two problems: propagation of inaccurate dark knowledge and constraint of a unified distillation temperature . In this article, we propose a simple and effective Multi-instance Dynamic Temperature Distillation (MiDTD) framework, which is model-agnostic and mainly involves two modules: multi-instance target fusion (MiTF) and dynamic temperature regulation (DTR). MiTF combines the teacher’s predictions for multiple sentences with the same entity pair to amend the inaccurate dark knowledge in each student’s target. DTR allocates alterable distillation temperatures to different training instances to enable the softness of most student’s targets to be regulated to a moderate range. In experiments, we construct three concrete MiDTD instantiations with BERT, PCNN, and BiLSTM-based RE models, and the distilled students significantly outperform their teachers and the state-of-the-art (SOTA) methods.

Download Full-text

Uncertainty-Aware Multi-Shot Knowledge Distillation for Image-Based Object Re-Identification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i07.6774 ◽

2020 ◽

Vol 34 (07) ◽

pp. 11165-11172 ◽

Cited By ~ 2

Author(s):

Xin Jin ◽

Cuiling Lan ◽

Wenjun Zeng ◽

Zhibo Chen

Keyword(s):

Student Learning ◽

State Of The Art ◽

Feature Learning ◽

The State ◽

Single Image ◽

Multiple Images ◽

Teacher Student ◽

Comprehensive Information ◽

Specific Object ◽

Knowledge Distillation

Object re-identification (re-id) aims to identify a specific object across times or camera views, with the person re-id and vehicle re-id as the most widely studied applications. Re-id is challenging because of the variations in viewpoints, (human) poses, and occlusions. Multi-shots of the same object can cover diverse viewpoints/poses and thus provide more comprehensive information. In this paper, we propose exploiting the multi-shots of the same identity to guide the feature learning of each individual image. Specifically, we design an Uncertainty-aware Multi-shot Teacher-Student (UMTS) Network. It consists of a teacher network (T-net) that learns the comprehensive features from multiple images of the same object, and a student network (S-net) that takes a single image as input. In particular, we take into account the data dependent heteroscedastic uncertainty for effectively transferring the knowledge from the T-net to S-net. To the best of our knowledge, we are the first to make use of multi-shots of an object in a teacher-student learning manner for effectively boosting the single image based re-id. We validate the effectiveness of our approach on the popular vehicle re-id and person re-id datasets. In inference, the S-net alone significantly outperforms the baselines and achieves the state-of-the-art performance.

Download Full-text

The Convolution Neural Network Combined with the HT Person Fit Statistic to Develop an APP for Detecting Dengue Fever in Children: Development and Usability Study (Preprint)

10.2196/preprints.16347 ◽

2019 ◽

Author(s):

CHIEN WEI ◽

Chi Chow Julie ◽

Chou Willy

Keyword(s):

Neural Network ◽

Dengue Fever ◽

Prediction Accuracy ◽

Early Stage ◽

Family Members ◽

Convolution Neural Network ◽

Public Health Issue ◽

Person Fit ◽

Model Accuracy ◽

Comparison Results

UNSTRUCTURED Backgrounds: Dengue fever (DF) is an important public health issue in Asia. However, the disease is extremely hard to detect using traditional dichotomous (i.e., absent vs. present) evaluations of symptoms. Convolution neural network (CNN), a well-established deep learning method, can improve prediction accuracy on account of its usage of a large number of parameters for modeling. Whether the HT person fit statistic can be combined with CNN to increase the prediction accuracy of the model and develop an application (APP) to detect DF in children remains unknown. Objectives: The aim of this study is to build a model for the automatic detection and classification of DF with symptoms to help patients, family members, and clinicians identify the disease at an early stage. Methods: We extracted 19 feature variables of DF-related symptoms from 177 pediatric patients (69 diagnosed with DF) using CNN to predict DF risk. The accuracy of two sets of characteristics (19 symptoms and four other variables, including person mean, standard deviation, and two HT-related statistics matched to DF+ and DF−) for predicting DF, were then compared. Data were separated into training and testing sets, and the former was used to predict the latter. We calculated the sensitivity (Sens), specificity (Spec), and area under the receiver operating characteristic curve (AUC) across studies for comparison. Results: We observed that (1) the 23-item model yields a higher accuracy rate (0.95) and AUC (0.94) than the 19-item model (accuracy = 0.92, AUC = 0.90) based on the 177-case training set; (2) the Sens values are almost higher than the corresponding Spec values (90% in 10 scenarios) for predicting DF; (3) the Sens and Spec values of the 23-item model are consistently higher than those of the 19-item model. An APP was subsequently designed to detect DF in children. Conclusion: The 23-item model yielded higher accuracy rates (0.95) and AUC (0.94) than the 19-item model (accuracy = 0.92, AUC = 0.90). An APP could be developed to help patients, family members, and clinicians discriminate DF from other febrile illnesses at an early stage.

Download Full-text

CoCoX: Generating Conceptual and Counterfactual Explanations via Fault-Lines

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i03.5643 ◽

2020 ◽

Vol 34 (03) ◽

pp. 2594-2601

Author(s):

Arjun Akula ◽

Shuai Wang ◽

Song-Chun Zhu

Keyword(s):

Neural Network ◽

State Of The Art ◽

Input Image ◽

Classification Model ◽

Learning Models ◽

Fault Line ◽

Semantic Level ◽

Explainable Ai ◽

Fault Lines ◽

Classification Category

We present CoCoX (short for Conceptual and Counterfactual Explanations), a model for explaining decisions made by a deep convolutional neural network (CNN). In Cognitive Psychology, the factors (or semantic-level features) that humans zoom in on when they imagine an alternative to a model prediction are often referred to as fault-lines. Motivated by this, our CoCoX model explains decisions made by a CNN using fault-lines. Specifically, given an input image I for which a CNN classification model M predicts class cpred, our fault-line based explanation identifies the minimal semantic-level features (e.g., stripes on zebra, pointed ears of dog), referred to as explainable concepts, that need to be added to or deleted from I in order to alter the classification category of I by M to another specified class calt. We argue that, due to the conceptual and counterfactual nature of fault-lines, our CoCoX explanations are practical and more natural for both expert and non-expert users to understand the internal workings of complex deep learning models. Extensive quantitative and qualitative experiments verify our hypotheses, showing that CoCoX significantly outperforms the state-of-the-art explainable AI models. Our implementation is available at https://github.com/arjunakula/CoCoX

Download Full-text