Understanding deep learning (still) requires rethinking generalization

Despite their massive size, successful deep artificial neural networks can exhibit a remarkably small gap between training and test performance. Conventional wisdom attributes small generalization error either to properties of the model family or to the regularization techniques used during training. Through extensive systematic experiments, we show how these traditional approaches fail to explain why large neural networks generalize well in practice. Specifically, our experiments establish that state-of-the-art convolutional networks for image classification trained with stochastic gradient methods easily fit a random labeling of the training data. This phenomenon is qualitatively unaffected by explicit regularization and occurs even if we replace the true images by completely unstructured random noise. We corroborate these experimental findings with a theoretical construction showing that simple depth two neural networks already have perfect finite sample expressivity as soon as the number of parameters exceeds the number of data points as it usually does in practice. We interpret our experimental findings by comparison with traditional models. We supplement this republication with a new section at the end summarizing recent progresses in the field since the original version of this paper.

Download Full-text

A Comparison of Regularization Techniques in Deep Neural Networks

Symmetry ◽

10.3390/sym10110648 ◽

2018 ◽

Vol 10 (11) ◽

pp. 648 ◽

Cited By ~ 6

Author(s):

Ismoilov Nusrat ◽

Sung-Bong Jang

Keyword(s):

Neural Network ◽

Neural Networks ◽

Network Model ◽

Neural Network Model ◽

Data Augmentation ◽

Training Data ◽

Regularization Techniques ◽

Performance Results ◽

Normalization Scheme ◽

Significant Attention

Artificial neural networks (ANN) have attracted significant attention from researchers because many complex problems can be solved by training them. If enough data are provided during the training process, ANNs are capable of achieving good performance results. However, if training data are not enough, the predefined neural network model suffers from overfitting and underfitting problems. To solve these problems, several regularization techniques have been devised and widely applied to applications and data analysis. However, it is difficult for developers to choose the most suitable scheme for a developing application because there is no information regarding the performance of each scheme. This paper describes comparative research on regularization techniques by evaluating the training and validation errors in a deep neural network model, using a weather dataset. For comparisons, each algorithm was implemented using a recent neural network library of TensorFlow. The experiment results showed that an autoencoder had the worst performance among schemes. When the prediction accuracy was compared, data augmentation and the batch normalization scheme showed better performance than the others.

Download Full-text

Improving representations of genomic sequence motifs in convolutional networks with exponential activations

10.1101/2020.06.14.150706 ◽

2020 ◽

Cited By ~ 3

Author(s):

Peter K. Koo ◽

Matt Ploenzke

Keyword(s):

Neural Networks ◽

Dna Sequences ◽

Test Performance ◽

Genomic Sequence ◽

Comprehensive Analysis ◽

Sequence Motifs ◽

Deep Convolutional Neural Networks ◽

Convolutional Networks ◽

Learned Features

ABSTRACTDeep convolutional neural networks (CNNs) trained on regulatory genomic sequences tend to build representations in a distributed manner, making it a challenge to extract learned features that are biologically meaningful, such as sequence motifs. Here we perform a comprehensive analysis on synthetic sequences to investigate the role that CNN activations have on model interpretability. We show that employing an exponential activation to first layer filters consistently leads to interpretable and robust representations of motifs compared to other commonly used activations. Strikingly, we demonstrate that CNNs with better test performance do not necessarily imply more interpretable representations with attribution methods. We find that CNNs with exponential activations significantly improve the efficacy of recovering biologically meaningful representations with attribution methods. We demonstrate these results generalise to real DNA sequences across several in vivo datasets. Together, this work demonstrates how a small modification to existing CNNs, i.e. setting exponential activations in the first layer, can significantly improve the robustness and interpretabilty of learned representations directly in convolutional filters and indirectly with attribution methods.

Download Full-text

Stochastic Receptive Fields in Deep Convolutional Networks

Vision Letters ◽

10.15353/vsnl.v1i1.42 ◽

2015 ◽

Vol 1 (1) ◽

Author(s):

Audrey G Chung ◽

Mohammad Javad Shafiee ◽

Alexander Wong

Keyword(s):

Neural Networks ◽

Receptive Fields ◽

Training Data ◽

Complex Data ◽

Training Set ◽

Deep Convolutional Neural Networks ◽

Test Error ◽

Convolutional Networks ◽

High Level ◽

High Level Abstraction

Deep convolutional neural networks (ConvNets) have rapidly grown in popularity due to their powerful capabilities in representing and modelling the high-level abstraction of complex data. However, ConvNets require an abundance of data to adequately train network parameters. To tackle this problem, we introduce the concept of stochastic receptive fields, where the receptive fields are stochastic realizations of a random field that obey a learned distribution. We study the efficacy of incorporating layers of stochastic receptive fields to a ConvNet to boost performance without the need for additional training data. Preliminary results showing an improvement in accuracy ( 2% drop in test error) was achieved by adding a layer of stochastic receptive fields to a ConvNet compared to adding a layer of fully-trained receptive fields, when training with a small training set consisting of 20% of the STL-10 dataset.

Download Full-text

Complete and representative training of neural networks: A generalization study using double noise injection and natural images

Geophysics ◽

10.1190/geo2020-0193.1 ◽

2021 ◽

pp. 1-43

Author(s):

Chao Zhang ◽

Mirko van der Baan

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Seismic Data ◽

Learning Algorithm ◽

Random Noise ◽

Solution Space ◽

Natural Images ◽

Training Data ◽

Noisy Input ◽

Noise Injection

Neural networks hold substantial promise to automate various processing and interpretation tasks. Yet their performance is often sub-optimal compared with standard but more closely guided approaches. Lack of performance is often attributed to poor generalization, in particular if fewer training examples are provided than free parameters exist in the machine learning algorithm. In this case the training data are typically memorized instead of the algorithm learning the underlying general trends. Network generalization is improved if the provided samples are representative, in that they describe all features of interest well. We argue that a more subtle condition preventing poor performance is that the provided examples must also be complete; the examples must span the full solution space. Ensuring completeness during training is challenging unless the target application is well understood. We illustrate that one possible solution is to make the problem more general if this greatly increases the number of available training data. For instance, if seismic images are treated as a subclass of natural images, then a deep-learning-based denoiser for seismic data can be trained using exclusively natural images. The latter are widely available. The resulting denoising algorithm has never seen any seismic data during the training stage; yet it displays a performance comparable to standard and advanced random-noise reduction methods. We exclude any seismic data during training to demonstrate the natural images are both complete and representative for this specific task. Furthermore, we apply a novel approach to increase the amount of training data known as double noise injection, providing both noisy input and output images during the training process. Given the importance of network generalization, we hope that insights gained in this study may help improve the performance of a range of machine learning applications in geophysics.

Download Full-text

Graph Convolutional Networks for Text Classification

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33017370 ◽

2019 ◽

Vol 33 ◽

pp. 7370-7377 ◽

Cited By ~ 90

Author(s):

Liang Yao ◽

Chengsheng Mao ◽

Yuan Luo

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Language Processing ◽

Text Classification ◽

State Of The Art ◽

Classical Problem ◽

Experimental Results ◽

Training Data ◽

Convolutional Networks ◽

Single Text

Text classification is an important and classical problem in natural language processing. There have been a number of studies that applied convolutional neural networks (convolution on regular grid, e.g., sequence) to classification. However, only a limited number of studies have explored the more flexible graph convolutional neural networks (convolution on non-grid, e.g., arbitrary graph) for the task. In this work, we propose to use graph convolutional networks for text classification. We build a single text graph for a corpus based on word co-occurrence and document word relations, then learn a Text Graph Convolutional Network (Text GCN) for the corpus. Our Text GCN is initialized with one-hot representation for word and document, it then jointly learns the embeddings for both words and documents, as supervised by the known class labels for documents. Our experimental results on multiple benchmark datasets demonstrate that a vanilla Text GCN without any external word embeddings or knowledge outperforms state-of-the-art methods for text classification. On the other hand, Text GCN also learns predictive word and document embeddings. In addition, experimental results show that the improvement of Text GCN over state-of-the-art comparison methods become more prominent as we lower the percentage of training data, suggesting the robustness of Text GCN to less training data in text classification.

Download Full-text

Regularizing Fully Convolutional Networks for Time Series Classification by Decorrelating Filters

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.330110003 ◽

2019 ◽

Vol 33 ◽

pp. 10003-10004 ◽

Cited By ~ 1

Author(s):

Kaushal Paneri ◽

Vishnu TV ◽

Pankaj Malhotra ◽

Lovekesh Vig ◽

Gautam Shroff

Keyword(s):

Neural Networks ◽

Time Series ◽

Deep Neural Networks ◽

Training Data ◽

Time Series Classification ◽

Strong Correlations ◽

Average Correlation ◽

Convolutional Networks ◽

Fully Convolutional Networks ◽

Classification Tasks

Deep neural networks are prone to overfitting, especially in small training data regimes. Often, these networks are overparameterized and the resulting learned weights tend to have strong correlations. However, convolutional networks in general, and fully convolution neural networks (FCNs) in particular, have been shown to be relatively parameter efficient, and have recently been successfully applied to time series classification tasks. In this paper, we investigate the application of different regularizers on the correlation between the learned convolutional filters in FCNs using Batch Normalization (BN) as a regularizer for time series classification (TSC) tasks. Results demonstrate that despite orthogonal initialization of the filters, the average correlation across filters (especially for filters in higher layers) tends to increase as training proceeds, indicating redundancy of filters. To mitigate this redundancy, we propose a strong regularizer, using simple yet effective filter decorrelation. Our proposed method yields significant gains in classification accuracy for 44 diverse time series datasets from the UCR TSC benchmark repository.

Download Full-text

Comparison of Semantic Convolution Neural Networks on the Example of Crack Segmentation in Asphalt Images

International Journal of Computing ◽

10.47839/ijc.20.3.2288 ◽

2021 ◽

pp. 415-423

Author(s):

Svetlana Mustafina ◽

Andrey Akimov ◽

Sofia Mustafina ◽

Alexandra Plotnikova

Keyword(s):

Neural Networks ◽

Comparative Analysis ◽

Semantic Segmentation ◽

Surface Damage ◽

High Accuracy ◽

Road Surface ◽

Training Data ◽

The Road ◽

Convolutional Networks ◽

Technical Vision

The article is devoted to a comparative analysis of the effectiveness of convolutional neural networks for semantic segmentation of road surface damage marking. Currently, photo and video surveillance methods are used to control the condition of the road surface. Assessing and analyzing new manual data can take too long. Thus, a completely different procedure is required to inspect and assess the state of controlled objects using technical vision. The authors compared 3 neural networks (Unet, Linknet, PSPNet) used in semantic segmentation using the example of the Crack500 dataset. The proposed architectures have been implemented in the Keras and TensorFlow frameworks. The compared models of neural convolutional networks effectively solve the assigned tasks even with a limited amount of training data. High accuracy is observed. The considered models can be used in various segmentation tasks. The results obtained can be used in the process of modeling, monitoring, and predicting the wear of the road surface.

Download Full-text

The Compact Support Neural Network

Sensors ◽

10.3390/s21248494 ◽

2021 ◽

Vol 21 (24) ◽

pp. 8494

Author(s):

Adrian Barbu ◽

Hongyu Mou

Keyword(s):

Neural Network ◽

Neural Networks ◽

Compact Support ◽

Shape Parameter ◽

Autonomous Driving ◽

Activation Function ◽

Training Data ◽

Benchmark Datasets ◽

The Neural Networks ◽

Experimental Findings

Neural networks are popular and useful in many fields, but they have the problem of giving high confidence responses for examples that are away from the training data. This makes the neural networks very confident in their prediction while making gross mistakes, thus limiting their reliability for safety-critical applications such as autonomous driving and space exploration, etc. This paper introduces a novel neuron generalization that has the standard dot-product-based neuron and the radial basis function (RBF) neuron as two extreme cases of a shape parameter. Using a rectified linear unit (ReLU) as the activation function results in a novel neuron that has compact support, which means its output is zero outside a bounded domain. To address the difficulties in training the proposed neural network, it introduces a novel training method that takes a pretrained standard neural network that is fine-tuned while gradually increasing the shape parameter to the desired value. The theoretical findings of the paper are bound on the gradient of the proposed neuron and proof that a neural network with such neurons has the universal approximation property. This means that the network can approximate any continuous and integrable function with an arbitrary degree of accuracy. The experimental findings on standard benchmark datasets show that the proposed approach has smaller test errors than the state-of-the-art competing methods and outperforms the competing methods in detecting out-of-distribution samples on two out of three datasets.

Download Full-text

Comparative results of the classification accuracy of MSTAR dataset radar images by convolutional neural networks with different architectures.

Journal of Radio Electronics ◽

10.30898/1684-1719.2021.11.14 ◽

2021 ◽

Vol 2021 (11) ◽

Author(s):

I.F. Kupryashkin ◽

Keyword(s):

Neural Network ◽

Neural Networks ◽

Convolutional Neural Network ◽

Classification Accuracy ◽

Training Data ◽

Convolutional Networks ◽

Radar Images ◽

Neural Network Classification ◽

Learning Technique ◽

Comparative Results

The results of MSTAR objects ten-classes classification using a VGG-type deep convolutional neural network with eight convolutional layers are presented. The maximum accuracy achieved by the network was 97.91%. In addition, the results of the MobileNetV1, Xception, InceptionV3, ResNet50, InceptionResNetV2, DenseNet121 networks, prepared using the transfer learning technique, are presented. It is shown that in the problem under consideration, the use of the listed pretrained convolutional networks did not improve the classification accuracy, which ranged from 93.79% to 97.36%. It has been established that even visually unobservable local features of the terrain background near each type of object are capable of providing a classification accuracy of about 51% (and not the expected 10% for a ten-alternative classification) even in the absence of object and their shadows. The procedure for preparing training data is described, which ensures the elimination of the influence of the terrain background on the result of neural network classification.

Download Full-text

Regularisation of neural networks by enforcing Lipschitz continuity

Machine Learning ◽

10.1007/s10994-020-05929-w ◽

2020 ◽

Cited By ~ 1

Author(s):

Henry Gouk ◽

Eibe Frank ◽

Bernhard Pfahringer ◽

Michael J. Cree

Keyword(s):

Neural Network ◽

Neural Networks ◽

Lipschitz Continuity ◽

Simple Technique ◽

Lipschitz Constant ◽

Gradient Methods ◽

Evaluation Study ◽

Training Data ◽

Feed Forward Neural Network ◽

Performance Gains

AbstractWe investigate the effect of explicitly enforcing the Lipschitz continuity of neural networks with respect to their inputs. To this end, we provide a simple technique for computing an upper bound to the Lipschitz constant—for multiple p-norms—of a feed forward neural network composed of commonly used layer types. Our technique is then used to formulate training a neural network with a bounded Lipschitz constant as a constrained optimisation problem that can be solved using projected stochastic gradient methods. Our evaluation study shows that the performance of the resulting models exceeds that of models trained with other common regularisers. We also provide evidence that the hyperparameters are intuitive to tune, demonstrate how the choice of norm for computing the Lipschitz constant impacts the resulting model, and show that the performance gains provided by our method are particularly noticeable when only a small amount of training data is available.

Download Full-text