Relieving the Incompatibility of Network Representation and Classification for Long-Tailed Data Distribution

In the real-world scenario, data often have a long-tailed distribution and training deep neural networks on such an imbalanced dataset has become a great challenge. The main problem caused by a long-tailed data distribution is that common classes will dominate the training results and achieve a very low accuracy on the rare classes. Recent work focuses on improving the network representation ability to overcome the long-tailed problem, while it always ignores adapting the network classifier to a long-tailed case, which will cause the “incompatibility” problem of network representation and network classifier. In this paper, we use knowledge distillation to solve the long-tailed data distribution problem and fully optimize the network representation and classifier simultaneously. We propose multiexperts knowledge distillation with class-balanced sampling to jointly learn high-quality network representation and classifier. Also, a channel activation-based knowledge distillation method is also proposed to improve the performance further. State-of-the-art performance on several large-scale long-tailed classification datasets shows the superior generalization of our method.

Download Full-text

Data-Efficient Sensor Upgrade Path Using Knowledge Distillation

Sensors ◽

10.3390/s21196523 ◽

2021 ◽

Vol 21 (19) ◽

pp. 6523

Author(s):

Pieter Van Van Molle ◽

Cedric De De Boom ◽

Tim Verbelen ◽

Bert Vankeirsbilck ◽

Jonas De De Vylder ◽

...

Keyword(s):

Deep Neural Networks ◽

State Of The Art ◽

Original Data ◽

Radar Data ◽

Teacher Supervision ◽

Multispectral Images ◽

Test Set ◽

Time To Market ◽

Speed Up ◽

Knowledge Distillation

Deep neural networks have achieved state-of-the-art performance in image classification. Due to this success, deep learning is now also being applied to other data modalities such as multispectral images, lidar and radar data. However, successfully training a deep neural network requires a large reddataset. Therefore, transitioning to a new sensor modality (e.g., from regular camera images to multispectral camera images) might result in a drop in performance, due to the limited availability of data in the new modality. This might hinder the adoption rate and time to market for new sensor technologies. In this paper, we present an approach to leverage the knowledge of a teacher network, that was trained using the original data modality, to improve the performance of a student network on a new data modality: a technique known in literature as knowledge distillation. By applying knowledge distillation to the problem of sensor transition, we can greatly speed up this process. We validate this approach using a multimodal version of the MNIST dataset. Especially when little data is available in the new modality (i.e., 10 images), training with additional teacher supervision results in increased performance, with the student network scoring a test set accuracy of 0.77, compared to an accuracy of 0.37 for the baseline. We also explore two extensions to the default method of knowledge distillation, which we evaluate on a multimodal version of the CIFAR-10 dataset: an annealing scheme for the hyperparameter α and selective knowledge distillation. Of these two, the first yields the best results. Choosing the optimal annealing scheme results in an increase in test set accuracy of 6%. Finally, we apply our method to the real-world use case of skin lesion classification.

Download Full-text

Hearing Lips: Improving Lip Reading by Distilling Speech Recognizers

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6174 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6917-6924 ◽

Cited By ~ 1

Author(s):

Ya Zhao ◽

Rui Xu ◽

Xinchao Wang ◽

Peng Hou ◽

Haihong Tang ◽

...

Keyword(s):

Deep Learning ◽

Speech Recognition ◽

Error Rate ◽

Large Scale ◽

State Of The Art ◽

Lip Reading ◽

Speech Recognizers ◽

Lip Movement ◽

Knowledge Distillation ◽

The One

Lip reading has witnessed unparalleled development in recent years thanks to deep learning and the availability of large-scale datasets. Despite the encouraging results achieved, the performance of lip reading, unfortunately, remains inferior to the one of its counterpart speech recognition, due to the ambiguous nature of its actuations that makes it challenging to extract discriminant features from the lip movement videos. In this paper, we propose a new method, termed as Lip by Speech (LIBS), of which the goal is to strengthen lip reading by learning from speech recognizers. The rationale behind our approach is that the features extracted from speech recognizers may provide complementary and discriminant clues, which are formidable to be obtained from the subtle movements of the lips, and consequently facilitate the training of lip readers. This is achieved, specifically, by distilling multi-granularity knowledge from speech recognizers to lip readers. To conduct this cross-modal knowledge distillation, we utilize an efficacious alignment scheme to handle the inconsistent lengths of the audios and videos, as well as an innovative filtering strategy to refine the speech recognizer's prediction. The proposed method achieves the new state-of-the-art performance on the CMLR and LRS2 datasets, outperforming the baseline by a margin of 7.66% and 2.75% in character error rate, respectively.

Download Full-text

Interpolation Consistency Training for Semi-supervised Learning

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/504 ◽

2019 ◽

Cited By ~ 39

Author(s):

Vikas Verma ◽

Alex Lamb ◽

Juho Kannala ◽

Yoshua Bengio ◽

David Lopez-Paz

Keyword(s):

Neural Network ◽

Neural Networks ◽

Supervised Learning ◽

Deep Neural Networks ◽

State Of The Art ◽

Data Distribution ◽

Network Architectures ◽

Low Density ◽

Decision Boundary ◽

Classification Problems

We introduce Interpolation Consistency Training (ICT), a simple and computation efficient algorithm for training Deep Neural Networks in the semi-supervised learning paradigm. ICT encourages the prediction at an interpolation of unlabeled points to be consistent with the interpolation of the predictions at those points. In classification problems, ICT moves the decision boundary to low-density regions of the data distribution. Our experiments show that ICT achieves state-of-the-art performance when applied to standard neural network architectures on the CIFAR-10 and SVHN benchmark dataset.

Download Full-text

Compact Cloud Detection with Bidirectional Self-Attention Knowledge Distillation

Remote Sensing ◽

10.3390/rs12172770 ◽

2020 ◽

Vol 12 (17) ◽

pp. 2770 ◽

Cited By ~ 1

Author(s):

Yajie Chai ◽

Kun Fu ◽

Xian Sun ◽

Wenhui Diao ◽

Zhiyuan Yan ◽

...

Keyword(s):

Remote Sensing ◽

Large Scale ◽

Semantic Information ◽

State Of The Art ◽

High Accuracy ◽

Compact Model ◽

Cloud Detection ◽

Significant Progress ◽

Remote Sensing Images ◽

Knowledge Distillation

The deep convolutional neural network has made significant progress in cloud detection. However, the compromise between having a compact model and high accuracy has always been a challenging task in cloud detection for large-scale remote sensing imagery. A promising method to tackle this problem is knowledge distillation, which usually lets the compact model mimic the cumbersome model’s output to get better generalization. However, vanilla knowledge distillation methods cannot properly distill the characteristics of clouds in remote sensing images. In this paper, we propose a novel self-attention knowledge distillation approach for compact and accurate cloud detection, named Bidirectional Self-Attention Distillation (Bi-SAD). Bi-SAD lets a model learn from itself without adding additional parameters or supervision. With bidirectional layer-wise features learning, the model can get a better representation of the cloud’s textural information and semantic information, so that the cloud’s boundaries become more detailed and the predictions become more reliable. Experiments on a dataset acquired by GaoFen-1 satellite show that our Bi-SAD has a great balance between compactness and accuracy, and outperforms vanilla distillation methods. Compared with state-of-the-art cloud detection models, the parameter size and FLOPs are reduced by 100 times and 400 times, respectively, with a small drop in accuracy.

Download Full-text

Neural models for information retrieval without labeled data

ACM SIGIR Forum ◽

10.1145/3458553.3458569 ◽

2019 ◽

Vol 53 (2) ◽

pp. 104-105

Author(s):

Hamed Zamani

Keyword(s):

Neural Network ◽

Information Retrieval ◽

Performance Prediction ◽

Large Scale ◽

Deep Neural Networks ◽

State Of The Art ◽

Training Data ◽

Retrieval Model ◽

Neural Models ◽

Retrieval Models

Recent developments of machine learning models, and in particular deep neural networks, have yielded significant improvements on several computer vision, natural language processing, and speech recognition tasks. Progress with information retrieval (IR) tasks has been slower, however, due to the lack of large-scale training data as well as neural network models specifically designed for effective information retrieval [9]. In this dissertation, we address these two issues by introducing task-specific neural network architectures for a set of IR tasks and proposing novel unsupervised or weakly supervised solutions for training the models. The proposed learning solutions do not require labeled training data. Instead, in our weak supervision approach, neural models are trained on a large set of noisy and biased training data obtained from external resources, existing models, or heuristics. We first introduce relevance-based embedding models [3] that learn distributed representations for words and queries. We show that the learned representations can be effectively employed for a set of IR tasks, including query expansion, pseudo-relevance feedback, and query classification [1, 2]. We further propose a standalone learning to rank model based on deep neural networks [5, 8]. Our model learns a sparse representation for queries and documents. This enables us to perform efficient retrieval by constructing an inverted index in the learned semantic space. Our model outperforms state-of-the-art retrieval models, while performing as efficiently as term matching retrieval models. We additionally propose a neural network framework for predicting the performance of a retrieval model for a given query [7]. Inspired by existing query performance prediction models, our framework integrates several information sources, such as retrieval score distribution and term distribution in the top retrieved documents. This leads to state-of-the-art results for the performance prediction task on various standard collections. We finally bridge the gap between retrieval and recommendation models, as the two key components in most information systems. Search and recommendation often share the same goal: helping people get the information they need at the right time. Therefore, joint modeling and optimization of search engines and recommender systems could potentially benefit both systems [4]. In more detail, we introduce a retrieval model that is trained using user-item interaction (e.g., recommendation data), with no need to query-document relevance information for training [6]. Our solutions and findings in this dissertation smooth the path towards learning efficient and effective models for various information retrieval and related tasks, especially when large-scale training data is not available.

Download Full-text

Learning Competitive and Discriminative Reconstructions for Anomaly Detection

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015167 ◽

2019 ◽

Vol 33 ◽

pp. 5167-5174 ◽

Cited By ~ 1

Author(s):

Kai Tian ◽

Shuigeng Zhou ◽

Jianping Fan ◽

Jihong Guan

Keyword(s):

Anomaly Detection ◽

Large Scale ◽

State Of The Art ◽

Empirical Studies ◽

Data Distribution ◽

Training Algorithm ◽

Positive Data ◽

Unlabelled Data ◽

Detection Stage ◽

Test Instance

Most of the existing methods for anomaly detection use only positive data to learn the data distribution, thus they usually need a pre-defined threshold at the detection stage to determine whether a test instance is an outlier. Unfortunately, a good threshold is vital for the performance and it is really hard to find an optimal one. In this paper, we take the discriminative information implied in unlabeled data into consideration and propose a new method for anomaly detection that can learn the labels of unlabelled data directly. Our proposed method has an end-to-end architecture with one encoder and two decoders that are trained to model inliers and outliers’ data distributions in a competitive way. This architecture works in a discriminative manner without suffering from overfitting, and the training algorithm of our model is adopted from SGD, thus it is efficient and scalable even for large-scale datasets. Empirical studies on 7 datasets including KDD99, MNIST, Caltech-256, and ImageNet etc. show that our model outperforms the state-of-the-art methods.

Download Full-text

Innovative Topologies and Algorithms for Neural Networks

Future Internet ◽

10.3390/fi12070117 ◽

2020 ◽

Vol 12 (7) ◽

pp. 117

Author(s):

Salvatore Graziani ◽

Maria Gabriella Xibilia

Keyword(s):

Neural Networks ◽

Network Architecture ◽

Deep Neural Networks ◽

State Of The Art ◽

Text Processing ◽

Successful Outcome ◽

Training Algorithm ◽

Special Issue ◽

And Training ◽

Selection Of

The introduction of new topologies and training procedures to deep neural networks has solicited a renewed interest in the field of neural computation. The use of deep structures has significantly improved the state of the art in many applications, such as computer vision, speech and text processing, medical applications, and IoT (Internet of Things). The probability of a successful outcome from a neural network is linked to selection of an appropriate network architecture and training algorithm. Accordingly, much of the recent research on neural networks is devoted to the study and proposal of novel architectures, including solutions tailored to specific problems. The papers of this Special Issue make significant contributions to the above-mentioned fields by merging theoretical aspects and relevant applications. Twelve papers are collected in the issue, addressing many relevant aspects of the topic.

Download Full-text

Androgen Receptor Binding Category Prediction with Deep Neural Networks and Structure-, Ligand-, and Statistically Based Features

Molecules ◽

10.3390/molecules26051285 ◽

2021 ◽

Vol 26 (5) ◽

pp. 1285

Author(s):

Alfonso T. García-Sosa

Keyword(s):

Neural Networks ◽

Androgen Receptor ◽

Logistic Model ◽

Deep Neural Networks ◽

State Of The Art ◽

Protein Structures ◽

Training Set ◽

Multivariate Logistic Model ◽

And Training ◽

Better Than

Substances that can modify the androgen receptor pathway in humans and animals are entering the environment and food chain with the proven ability to disrupt hormonal systems and leading to toxicity and adverse effects on reproduction, brain development, and prostate cancer, among others. State-of-the-art databases with experimental data of human, chimp, and rat effects by chemicals have been used to build machine-learning classifiers and regressors and to evaluate these on independent sets. Different featurizations, algorithms, and protein structures lead to different results, with deep neural networks (DNNs) on user-defined physicochemically relevant features developed for this work outperforming graph convolutional, random forest, and large featurizations. The results show that these user-provided structure-, ligand-, and statistically based features and specific DNNs provided the best results as determined by AUC (0.87), MCC (0.47), and other metrics and by their interpretability and chemical meaning of the descriptors/features. In addition, the same features in the DNN method performed better than in a multivariate logistic model: validation MCC = 0.468 and training MCC = 0.868 for the present work compared to evaluation set MCC = 0.2036 and training set MCC = 0.5364 for the multivariate logistic regression on the full, unbalanced set. Techniques of this type may improve AR and toxicity description and prediction, improving assessment and design of compounds. Source code and data are available on github.

Download Full-text

KDAS-ReID: Architecture Search for Person Re-Identification via Distilled Knowledge with Dynamic Temperature

Algorithms ◽

10.3390/a14050137 ◽

2021 ◽

Vol 14 (5) ◽

pp. 137

Author(s):

Zhou Lei ◽

Kangkang Yang ◽

Kai Jiang ◽

Shengbo Chen

Keyword(s):

State Of The Art ◽

Identification Algorithm ◽

Student Model ◽

Deep Convolutional Neural Networks ◽

Fast Speed ◽

Training Stage ◽

Knowledge Distillation ◽

And Training ◽

Better Than ◽

Teacher Model

Person re-Identification(Re-ID) based on deep convolutional neural networks (CNNs) achieves remarkable success with its fast speed. However, prevailing Re-ID models are usually built upon backbones that manually design for classification. In order to automatically design an effective Re-ID architecture, we propose a pedestrian re-identification algorithm based on knowledge distillation, called KDAS-ReID. When the knowledge of the teacher model is transferred to the student model, the importance of knowledge in the teacher model will gradually decrease with the improvement of the performance of the student model. Therefore, instead of applying the distillation loss function directly, we consider using dynamic temperatures during the search stage and training stage. Specifically, we start searching and training at a high temperature and gradually reduce the temperature to 1 so that the student model can better learn from the teacher model through soft targets. Extensive experiments demonstrate that KDAS-ReID performs not only better than other state-of-the-art Re-ID models on three benchmarks, but also better than the teacher model based on the ResNet-50 backbone.

Download Full-text

Androgen Receptor Binding Category Prediction with Deep Neural Networks and Structure-, Ligand-, and Statistically-Based Features

10.20944/preprints202102.0318.v3 ◽

2021 ◽

Author(s):

Alfonso T. García-Sosa

Keyword(s):

Neural Networks ◽

Androgen Receptor ◽

Logistic Model ◽

Deep Neural Networks ◽

State Of The Art ◽

Protein Structures ◽

Training Set ◽

Multivariate Logistic Model ◽

And Training ◽

Better Than

Substances that can modify the androgen receptor pathway in humans and animals are entering the environment and food chain with the proven ability to disrupt hormonal systems and leading to toxicity and adverse effects on reproduction, brain development, and prostate cancer, among others. State-of-the-art databases with experimental data of human, chimp, and rat effects by chemicals have been used to build machine learning classifiers and regressors and evaluate these on independent sets. Different featurizations, algorithms, and protein structures lead to dif- ferent results, with deep neural networks (DNNs) on user-defined physicochemically-relevant features developed for this work outperforming graph convolutional, random forest, and large featurizations. The results show that these user-provided structure-, ligand-, and statistically-based features and specific DNNs provided the best results as determined by AUC (0.87), MCC (0.47), and other metrics and by their interpretability and chemical meaning of the descriptors/features. In addition, the same features in the DNN method performed better than in a multivariate logistic model: validation MCC = 0.468 and training MCC = 0.868 for the present work compared to evalu- ation set MCC = 0.2036 and training set MCC = 0.5364 for the multivariate logistic regression on the full, unbalanced set. Techniques of this type may improve AR and toxicity description and predic- tion, improving assessment and design of compounds. Source code and data are available at https://github.com/AlfonsoTGarcia-Sosa/ML

Download Full-text