Novel Model Based on Stacked Autoencoders with Sample-Wise Strategy for Fault Diagnosis

Autoencoders are used for fault diagnosis in chemical engineering. To improve their performance, experts have paid close attention to regularized strategies and the creation of new and effective cost functions. However, existing methods are modified on the basis of only one model. This study provides a new perspective for strengthening the fault diagnosis model, which attempts to gain useful information from a model (teacher model) and applies it to a new model (student model). It pretrains the teacher model by fitting ground truth labels and then uses a sample-wise strategy to transfer knowledge from the teacher model. Finally, the knowledge and the ground truth labels are used to train the student model that is identical to the teacher model in terms of structure. The current student model is then used as the teacher of next student model. After step-by-step teacher-student reconfiguration and training, the optimal model is selected for fault diagnosis. Besides, knowledge distillation is applied in training procedures. The proposed method is applied to several benchmarked problems to prove its effectiveness.

Download Full-text

Robust cross-lingual knowledge base question answering via knowledge distillation

Data Technologies and Applications ◽

10.1108/dta-12-2020-0312 ◽

2021 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Shaofei Wang ◽

Depeng Dang

Keyword(s):

Knowledge Base ◽

Design Methodology ◽

Question Answering ◽

Ground Truth ◽

Student Model ◽

Content Type ◽

Final Performance ◽

Knowledge Distillation ◽

The Cross ◽

Cross Lingual

PurposePrevious knowledge base question answering (KBQA) models only consider the monolingual scenario and cannot be directly extended to the cross-lingual scenario, in which the language of questions and that of knowledge base (KB) are different. Although a machine translation (MT) model can bridge the gap through translating questions to the language of KB, the noises of translated questions could accumulate and further sharply impair the final performance. Therefore, the authors propose a method to improve the robustness of KBQA models in the cross-lingual scenario.Design/methodology/approachThe authors propose a knowledge distillation-based robustness enhancement (KDRE) method. Specifically, first a monolingual model (teacher) is trained by ground truth (GT) data. Then to imitate the practical noises, a noise-generating model is designed to inject two types of noise into questions: general noise and translation-aware noise. Finally, the noisy questions are input into the student model. Meanwhile, the student model is jointly trained by GT data and distilled data, which are derived from the teacher when feeding GT questions.FindingsThe experimental results demonstrate that KDRE can improve the performance of models in the cross-lingual scenario. The performance of each module in KBQA model is improved by KDRE. The knowledge distillation (KD) and noise-generating model in the method can complementarily boost the robustness of models.Originality/valueThe authors first extend KBQA models from monolingual to cross-lingual scenario. Also, the authors first implement KD for KBQA to develop robust cross-lingual models.

Download Full-text

Knowledge distillation in deep learning and its applications

PeerJ Computer Science ◽

10.7717/peerj-cs.474 ◽

2021 ◽

Vol 7 ◽

pp. e474

Author(s):

Abdolmaged Alkhulaifi ◽

Fahad Alsahli ◽

Irfan Ahmad

Keyword(s):

Deep Learning ◽

Mobile Phones ◽

Learning Models ◽

Student Model ◽

Embedded Devices ◽

Research Directions ◽

Resource Limited ◽

Knowledge Distillation ◽

Teacher Model

Deep learning based models are relatively large, and it is hard to deploy such models on resource-limited devices such as mobile phones and embedded devices. One possible solution is knowledge distillation whereby a smaller model (student model) is trained by utilizing the information from a larger model (teacher model). In this paper, we present an outlook of knowledge distillation techniques applied to deep learning models. To compare the performances of different techniques, we propose a new metric called distillation metric which compares different knowledge distillation solutions based on models' sizes and accuracy scores. Based on the survey, some interesting conclusions are drawn and presented in this paper including the current challenges and possible research directions.

Download Full-text

Estimation of Pedestrian Pose Orientation Using Soft Target Training Based on Teacher–Student Framework

Sensors ◽

10.3390/s19051147 ◽

2019 ◽

Vol 19 (5) ◽

pp. 1147 ◽

Cited By ~ 1

Author(s):

DuYeong Heo ◽

Jae Nam ◽

Byoung Ko

Keyword(s):

Supervised Learning ◽

Spatial Information ◽

Classification Performance ◽

Input Image ◽

Student Model ◽

Teacher Student ◽

Specific Shape ◽

Soft Target ◽

Target Data ◽

Teacher Model

Semi-supervised learning is known to achieve better generalisation than a model learned solely from labelled data. Therefore, we propose a new method for estimating a pedestrian pose orientation using a soft-target method, which is a type of semi-supervised learning method. Because a convolutional neural network (CNN) based pose orientation estimation requires large numbers of parameters and operations, we apply the teacher–student algorithm to generate a compressed student model with high accuracy and compactness resembling that of the teacher model by combining a deep network with a random forest. After the teacher model is generated using hard target data, the softened outputs (soft-target data) of the teacher model are used for training the student model. Moreover, the orientation of the pedestrian has specific shape patterns, and a wavelet transform is applied to the input image as a pre-processing step owing to its good spatial frequency localisation property and the ability to preserve both the spatial information and gradient information of an image. For a benchmark dataset considering real driving situations based on a single camera, we used the TUD and KITTI datasets. We applied the proposed algorithm to various driving images in the datasets, and the results indicate that its classification performance with regard to the pose orientation is better than that of other state-of-the-art methods based on a CNN. In addition, the computational speed of the proposed student model is faster than that of other deep CNNs owing to the shorter model structure with a smaller number of parameters.

Download Full-text

Deep Unsupervised Hashing for Large-Scale Cross-Modal Retrieval Using Knowledge Distillation Model

Computational Intelligence and Neuroscience ◽

10.1155/2021/5107034 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Mingyong Li ◽

Qiqi Li ◽

Lirong Tang ◽

Shuang Peng ◽

Yan Ma ◽

...

Keyword(s):

Large Scale ◽

Data Retrieval ◽

Multimedia Data ◽

Search Performance ◽

Similarity Matrix ◽

Student Model ◽

Deep Hashing ◽

Knowledge Distillation ◽

Semantic Alignment ◽

Teacher Model

Cross-modal hashing encodes heterogeneous multimedia data into compact binary code to achieve fast and flexible retrieval across different modalities. Due to its low storage cost and high retrieval efficiency, it has received widespread attention. Supervised deep hashing significantly improves search performance and usually yields more accurate results, but requires a lot of manual annotation of the data. In contrast, unsupervised deep hashing is difficult to achieve satisfactory performance due to the lack of reliable supervisory information. To solve this problem, inspired by knowledge distillation, we propose a novel unsupervised knowledge distillation cross-modal hashing method based on semantic alignment (SAKDH), which can reconstruct the similarity matrix using the hidden correlation information of the pretrained unsupervised teacher model, and the reconstructed similarity matrix can be used to guide the supervised student model. Specifically, firstly, the teacher model adopted an unsupervised semantic alignment hashing method, which can construct a modal fusion similarity matrix. Secondly, under the supervision of teacher model distillation information, the student model can generate more discriminative hash codes. Experimental results on two extensive benchmark datasets (MIRFLICKR-25K and NUS-WIDE) show that compared to several representative unsupervised cross-modal hashing methods, the mean average precision (MAP) of our proposed method has achieved a significant improvement. It fully reflects its effectiveness in large-scale cross-modal data retrieval.

Download Full-text

Comparing Kullback-Leibler Divergence and Mean Squared Error Loss in Knowledge Distillation

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/362 ◽

2021 ◽

Author(s):

Taehyeon Kim ◽

Jaehoon Oh ◽

Nak Yil Kim ◽

Sangwook Cho ◽

Se-Young Yun

Keyword(s):

Mean Squared Error ◽

Probability Distributions ◽

Student Model ◽

Kl Divergence ◽

Squared Error ◽

Leibler Divergence ◽

Temperature Scaling ◽

Knowledge Distillation ◽

The Mean ◽

Teacher Model

Knowledge distillation (KD), transferring knowledge from a cumbersome teacher model to a lightweight student model, has been investigated to design efficient neural architectures. Generally, the objective function of KD is the Kullback-Leibler (KL) divergence loss between the softened probability distributions of the teacher model and the student model with the temperature scaling hyperparameter τ. Despite its widespread use, few studies have discussed how such softening influences generalization. Here, we theoretically show that the KL divergence loss focuses on the logit matching when τ increases and the label matching when τ goes to 0 and empirically show that the logit matching is positively correlated to performance improvement in general. From this observation, we consider an intuitive KD loss function, the mean squared error (MSE) between the logit vectors, so that the student model can directly learn the logit of the teacher model. The MSE loss outperforms the KL divergence loss, explained by the penultimate layer representations difference between the two losses. Furthermore, we show that sequential distillation can improve performance and that KD, using the KL divergence loss with small τ particularly, mitigates the label noise. The code to reproduce the experiments is publicly available online at https://github.com/jhoon-oh/kd_data/.

Download Full-text

KDAS-ReID: Architecture Search for Person Re-Identification via Distilled Knowledge with Dynamic Temperature

Algorithms ◽

10.3390/a14050137 ◽

2021 ◽

Vol 14 (5) ◽

pp. 137

Author(s):

Zhou Lei ◽

Kangkang Yang ◽

Kai Jiang ◽

Shengbo Chen

Keyword(s):

State Of The Art ◽

Identification Algorithm ◽

Student Model ◽

Deep Convolutional Neural Networks ◽

Fast Speed ◽

Training Stage ◽

Knowledge Distillation ◽

And Training ◽

Better Than ◽

Teacher Model

Person re-Identification(Re-ID) based on deep convolutional neural networks (CNNs) achieves remarkable success with its fast speed. However, prevailing Re-ID models are usually built upon backbones that manually design for classification. In order to automatically design an effective Re-ID architecture, we propose a pedestrian re-identification algorithm based on knowledge distillation, called KDAS-ReID. When the knowledge of the teacher model is transferred to the student model, the importance of knowledge in the teacher model will gradually decrease with the improvement of the performance of the student model. Therefore, instead of applying the distillation loss function directly, we consider using dynamic temperatures during the search stage and training stage. Specifically, we start searching and training at a high temperature and gradually reduce the temperature to 1 so that the student model can better learn from the teacher model through soft targets. Extensive experiments demonstrate that KDAS-ReID performs not only better than other state-of-the-art Re-ID models on three benchmarks, but also better than the teacher model based on the ResNet-50 backbone.

Download Full-text

Revisiting Label Smoothing Regularization with Knowledge Distillation

Applied Sciences ◽

10.3390/app11104699 ◽

2021 ◽

Vol 11 (10) ◽

pp. 4699

Author(s):

Jiyue Wang ◽

Pei Zhang ◽

Qianhua He ◽

Yanxiong Li ◽

Yongjian Hu

Keyword(s):

Computational Cost ◽

Ground Truth ◽

New Teacher ◽

Model Output ◽

Leibler Divergence ◽

Two Component ◽

Knowledge Distillation ◽

The One ◽

Generalize Classification ◽

Teacher Model

Label Smoothing Regularization (LSR) is a widely used tool to generalize classification models by replacing the one-hot ground truth with smoothed labels. Recent research on LSR has increasingly focused on the correlation between the LSR and Knowledge Distillation (KD), which transfers the knowledge from a teacher model to a lightweight student model by penalizing their output’s Kullback–Leibler-divergence. Based on this observation, a Teacher-free Knowledge Distillation (Tf-KD) method was proposed in previous work. Instead of a real teacher model, a handcrafted distribution similar to LSR was used to guide the student learning. Tf-KD is a promising substitute for LSR except for its hard-to-tune and model-dependent hyperparameters. This paper develops a new teacher-free framework LSR-OS-TC, which decomposes the Tf-KD method into two components: model Output Smoothing (OS) and Teacher Correction (TC). Firstly, the LSR-OS extends the LSR method to the KD regime and applies a softer temperature to the model output softmax layer. Output smoothing is critical for stabilizing the KD hyperparameters among different models. Secondly, in the TC part, a larger proportion is assigned to the uniform distribution teacher’s right class to provide a more informative teacher. The two-component method was evaluated exhaustively on the image (dataset CIFAR-100, CIFAR-10, and CINIC-10) and audio (dataset GTZAN) classification tasks. The results showed that LSR-OS can improve LSR performance independently with no extra computational cost, especially on several deep neural networks where LSR is ineffective. The further training boost by the TC component showed the effectiveness of our two-component strategy. Overall, LSR-OS-TC is a practical substitution of LSR that can be tuned on one model and directly applied to other models compared to the original Tf-KD method.

Download Full-text

Object Detection in Densely Packed Scenes via Semi-Supervised Learning with Dual Consistency

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/172 ◽

2021 ◽

Author(s):

Chao Ye ◽

Huaidong Zhang ◽

Xuemiao Xu ◽

Weiwei Cai ◽

Jing Qin ◽

...

Keyword(s):

Object Detection ◽

Deep Neural Networks ◽

State Of The Art ◽

Student Model ◽

Training Process ◽

Teacher Student ◽

Public Dataset ◽

Dual Consistency ◽

Bounding Boxes ◽

Teacher Model

Deep neural networks have been shown to be very powerful tools for object detection in various scenes. Their remarkable performance, however, heavily depends on the availability of a large number of high quality labeled data, which are time-consuming and costly to acquire for scenes with densely packed objects. We present a novel semi-supervised approach to addressing this problem, which is designed based on a common teacher-student model, integrated with a novel intersection-over-union (IoU) aware consistency loss and a new proposal consistency loss. The IoU-aware consistency loss evaluates the IoU over the prediction pairs of the teacher model and the student model, which enforces the prediction of the student model to approach closely to that of the teacher model. The IoU-aware consistency loss also reweights the importance of different prediction pairs to suppress the low-confident pairs. The proposal consistency loss ensures proposal consistency between the two models, making it possible to involve the region proposal network in the training process with unlabeled data. We also construct a new dataset, namely RebarDSC, containing 2,125 rebar images annotated with 350,348 bounding boxes in total (164.9 annotations per image average), to evaluate the proposed method. Extensive experiments are conducted over both the RebarDSC dataset and the famous large public dataset SKU-110K. Experimental results corroborate that the proposed method is able to improve the object detection performance in densely packed scenes, consistently outperforming state-of-the-art approaches. Dataset is available in https://github.com/Armin1337/RebarDSC.

Download Full-text

Hybrid Learning with Teacher-student Knowledge Distillation for Recommenders

2020 International Conference on Data Mining Workshops (ICDMW) ◽

10.1109/icdmw51313.2020.00040 ◽

2020 ◽

Author(s):

Hangbin Zhang ◽

Raymond K. Wong ◽

Victor W. Chu

Keyword(s):

Hybrid Learning ◽

Student Knowledge ◽

Teacher Student ◽

Knowledge Distillation

Download Full-text

Deep convolutional tree-inspired network: a decision-tree-structured neural network for hierarchical fault diagnosis of bearings

Frontiers of Mechanical Engineering ◽

10.1007/s11465-021-0650-6 ◽

2021 ◽

Author(s):

Xu Wang ◽

Hongyang Gu ◽

Tianyang Wang ◽

Wei Zhang ◽

Aihua Li ◽

...

Keyword(s):

Neural Network ◽

Fault Diagnosis ◽

Decision Tree ◽

Structural Characteristics ◽

Proposed Model ◽

The Hierarchical Structure ◽

Hierarchical Decision ◽

Tree Methods ◽

New Perspective ◽

Simple Combination

AbstractThe fault diagnosis of bearings is crucial in ensuring the reliability of rotating machinery. Deep neural networks have provided unprecedented opportunities to condition monitoring from a new perspective due to the powerful ability in learning fault-related knowledge. However, the inexplicability and low generalization ability of fault diagnosis models still bar them from the application. To address this issue, this paper explores a decision-tree-structured neural network, that is, the deep convolutional tree-inspired network (DCTN), for the hierarchical fault diagnosis of bearings. The proposed model effectively integrates the advantages of convolutional neural network (CNN) and decision tree methods by rebuilding the output decision layer of CNN according to the hierarchical structural characteristics of the decision tree, which is by no means a simple combination of the two models. The proposed DCTN model has unique advantages in 1) the hierarchical structure that can support more accuracy and comprehensive fault diagnosis, 2) the better interpretability of the model output with hierarchical decision making, and 3) more powerful generalization capabilities for the samples across fault severities. The multiclass fault diagnosis case and cross-severity fault diagnosis case are executed on a multicondition aeronautical bearing test rig. Experimental results can fully demonstrate the feasibility and superiority of the proposed method.

Download Full-text