Stronger convergence results for deep residual networks: network width scales linearly with training data size

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa030 ◽

2020 ◽

Author(s):

Talha Cihad Gulcu

Keyword(s):

Neural Networks ◽

Gradient Descent ◽

Deep Neural Networks ◽

Full Rank ◽

Network Size ◽

Training Data ◽

Quadratic Loss ◽

First Order ◽

Convergence Results ◽

Rank Gradient

Abstract Deep neural networks are highly expressive machine learning models with the ability to interpolate arbitrary datasets. Deep nets are typically optimized via first-order methods, and the optimization process crucially depends on the characteristics of the network as well as the dataset. This work sheds light on the relation between the network size and the properties of the dataset with an emphasis on deep residual networks (ResNets). Our contribution is that if the network Jacobian is full rank, gradient descent for the quadratic loss and smooth activation converges to the global minima even if the network width $m$ of the ResNet scales linearly with the sample size $n$ and logarithmically with the network depth $H$. Consequently, our work is able to provide a theoretical guarantee for the convergence of deep neural networks in the $m=\varOmega (n,\log H)$ regime.

Download Full-text

Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5736 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3349-3356

Author(s):

Yuan Cao ◽

Quanquan Gu

Keyword(s):

Neural Networks ◽

Error Bounds ◽

Gradient Descent ◽

Deep Neural Networks ◽

Empirical Studies ◽

Training Data ◽

Generalization Error ◽

Generalization Performance ◽

Gradient Based ◽

Random Initialization

Empirical studies show that gradient-based methods can learn deep neural networks (DNNs) with very good generalization performance in the over-parameterization regime, where DNNs can easily fit a random labeling of the training data. Very recently, a line of work explains in theory that with over-parameterization and proper random initialization, gradient-based methods can find the global minima of the training loss for DNNs. However, existing generalization error bounds are unable to explain the good generalization performance of over-parameterized DNNs. The major limitation of most existing generalization bounds is that they are based on uniform convergence and are independent of the training algorithm. In this work, we derive an algorithm-dependent generalization error bound for deep ReLU networks, and show that under certain assumptions on the data distribution, gradient descent (GD) with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small generalization error. Our work sheds light on explaining the good generalization performance of over-parameterized deep neural networks.

Download Full-text

Evaluation of Power Insulator Detection Efficiency with the Use of Limited Training Dataset

Applied Sciences ◽

10.3390/app10062104 ◽

2020 ◽

Vol 10 (6) ◽

pp. 2104

Author(s):

Michał Tomaszewski ◽

Paweł Michalski ◽

Jakub Osuchowski

Keyword(s):

Neural Network ◽

Neural Networks ◽

Object Detection ◽

Convolutional Neural Network ◽

Deep Neural Networks ◽

Detection Efficiency ◽

Training Data ◽

Training Dataset ◽

Training Set ◽

Convolutional Network

This article presents an analysis of the effectiveness of object detection in digital images with the application of a limited quantity of input. The possibility of using a limited set of learning data was achieved by developing a detailed scenario of the task, which strictly defined the conditions of detector operation in the considered case of a convolutional neural network. The described solution utilizes known architectures of deep neural networks in the process of learning and object detection. The article presents comparisons of results from detecting the most popular deep neural networks while maintaining a limited training set composed of a specific number of selected images from diagnostic video. The analyzed input material was recorded during an inspection flight conducted along high-voltage lines. The object detector was built for a power insulator. The main contribution of the presented papier is the evidence that a limited training set (in our case, just 60 training frames) could be used for object detection, assuming an outdoor scenario with low variability of environmental conditions. The decision of which network will generate the best result for such a limited training set is not a trivial task. Conducted research suggests that the deep neural networks will achieve different levels of effectiveness depending on the amount of training data. The most beneficial results were obtained for two convolutional neural networks: the faster region-convolutional neural network (faster R-CNN) and the region-based fully convolutional network (R-FCN). Faster R-CNN reached the highest AP (average precision) at a level of 0.8 for 60 frames. The R-FCN model gained a worse AP result; however, it can be noted that the relationship between the number of input samples and the obtained results has a significantly lower influence than in the case of other CNN models, which, in the authors’ assessment, is a desired feature in the case of a limited training set.

Download Full-text

A Diffusion Approximation Theory of Momentum Stochastic Gradient Descent in Nonconvex Optimization

Stochastic Systems ◽

10.1287/stsy.2021.0083 ◽

2021 ◽

Author(s):

Tianyi Liu ◽

Zhehui Chen ◽

Enlu Zhou ◽

Tuo Zhao

Keyword(s):

Neural Networks ◽

Nonconvex Optimization ◽

Gradient Descent ◽

Deep Neural Networks ◽

Optimization Problems ◽

Saddle Points ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Nonconvex Optimization Problems ◽

Empirical Success

Momentum stochastic gradient descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning (e.g., training deep neural networks, variational Bayesian inference, etc.). Despite its empirical success, there is still a lack of theoretical understanding of convergence properties of MSGD. To fill this gap, we propose to analyze the algorithmic behavior of MSGD by diffusion approximations for nonconvex optimization problems with strict saddle points and isolated local optima. Our study shows that the momentum helps escape from saddle points but hurts the convergence within the neighborhood of optima (if without the step size annealing or momentum annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks.

Download Full-text

Towards better performance with heterogeneous training data in acoustic modeling using deep neural networks

10.21437/interspeech.2014-214 ◽

2014 ◽

Author(s):

Yan Huang ◽

Malcolm Slaney ◽

Michael L. Seltzer ◽

Yifan Gong

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Training Data ◽

Acoustic Modeling

Download Full-text

Language recognition using deep neural networks with very limited training data

2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2016.7472795 ◽

2016 ◽

Cited By ~ 6

Author(s):

Shivesh Ranjan ◽

Chengzhu Yu ◽

Chunlei Zhang ◽

Finnian Kelly ◽

John H. L. Hansen

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Training Data ◽

Language Recognition

Download Full-text

Improving speech recognition using limited accent diverse British English training data with deep neural networks

2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP) ◽

10.1109/mlsp.2016.7738854 ◽

2016 ◽

Cited By ~ 1

Author(s):

Maryam Najafian ◽

Saeid Safavi ◽

John H. L. Hansen ◽

Martin Russell

Keyword(s):

Neural Networks ◽

Speech Recognition ◽

Deep Neural Networks ◽

Training Data ◽

British English ◽

English Training

Download Full-text

Deep Neural Networks Techniques using for Learning Automata Based Incremental Learning Method

Turkish Journal of Computer and Mathematics Education (TURCOMAT) ◽

10.17762/turcomat.v12i6.1268 ◽

2021 ◽

Vol 12 (6) ◽

pp. 69-73

Author(s):

C. Swetha Reddy Et.al

Keyword(s):

Neural Networks ◽

Language Processing ◽

Visual Recognition ◽

Deep Neural Networks ◽

Learning Automata ◽

Training Data ◽

Neuron Models ◽

Effectiveness Of Training ◽

Learning Machine ◽

Incremental Learning Method

Surprisingly comprehensive learning methods are implemented in many large learning machine data, such as visual recognition and visual language processing. Much of the success of advanced training in recent years is due to leadership training, which requires a set of information for specific tasks, before such training. However, in reality, selected tasks related to personal study are gradually accumulated over time as it is difficult to collect and submit training data manually. It provides a way to continue learning some information columns and examples of steps that are specific to the new class and called additional learning. In this post, we recommend the best machine training method for further training for deep neural networks. The basic idea is to learn a deep system with strong connections that can be "activated" or "turned off" at different stages. The approach you suggest allows you to reduce the distribution of old services as you learn new for example new training, which increases the effectiveness of training in the additional training phase. Experiments with MNIST and CIFAR-100 show that our approach can be implemented in other long-term phases in deep neuron models and achieve better results from zero-base training.

Download Full-text

THE USE OF CONTROL THEORY METHODS IN TRAINING NEURAL NETWORKS ON THE EXAMPLE OF TEETH RECOGNITION ON PANORAMIC X-RAY IMAGES

Automation technological and business processes ◽

10.15673/atbp.v13i2.2055 ◽

2021 ◽

Vol 13 (2) ◽

pp. 36-40

Author(s):

A. Smorodin

Keyword(s):

Neural Networks ◽

Control Theory ◽

Gradient Descent ◽

Deep Neural Networks ◽

Discrete Dynamical System ◽

Stochastic Gradient Descent ◽

Network Training ◽

Panoramic Images ◽

Important Field ◽

New Algorithms

The article investigated a modification of stochastic gradient descent (SGD), based on the previously developed stabilization theory of discrete dynamical system cycles. Relation between stabilization of cycles in discrete dynamical systems and finding extremum points allowed us to apply new control methods to accelerate gradient descent when approaching local minima. Gradient descent is often used in training deep neural networks on a par with other iterative methods. Two gradient SGD and Adam were experimented, and we conducted comparative experiments. All experiments were conducted during solving a practical problem of teeth recognition on 2-D panoramic images. Network training showed that the new method outperforms the SGD in its capabilities and as for parameters chosen it approaches the capabilities of Adam, which is a “state of the art” method. Thus, practical utility of using control theory in the training of deep neural networks and possibility of expanding its applicability in the process of creating new algorithms in this important field are shown.

Download Full-text

Physics-Driven Regularization of Deep Neural Networks for Enhanced Engineering Design and Analysis

Journal of Computing and Information Science in Engineering ◽

10.1115/1.4044507 ◽

2019 ◽

Vol 20 (1) ◽

Cited By ~ 7

Author(s):

Mohammad Amin Nabian ◽

Hadi Meidani

Keyword(s):

Neural Networks ◽

Engineering Design ◽

Physical System ◽

Deep Neural Networks ◽

Complete Information ◽

Training Data ◽

Data Driven ◽

Training Approach ◽

Domain Expertise ◽

Generalization Errors

Abstract In this paper, we introduce a physics-driven regularization method for training of deep neural networks (DNNs) for use in engineering design and analysis problems. In particular, we focus on the prediction of a physical system, for which in addition to training data, partial or complete information on a set of governing laws is also available. These laws often appear in the form of differential equations, derived from first principles, empirically validated laws, or domain expertise, and are usually neglected in a data-driven prediction of engineering systems. We propose a training approach that utilizes the known governing laws and regularizes data-driven DNN models by penalizing divergence from those laws. The first two numerical examples are synthetic examples, where we show that in constructing a DNN model that best fits the measurements from a physical system, the use of our proposed regularization results in DNNs that are more interpretable with smaller generalization errors, compared with other common regularization methods. The last two examples concern metamodeling for a random Burgers’ system and for aerodynamic analysis of passenger vehicles, where we demonstrate that the proposed regularization provides superior generalization accuracy compared with other common alternatives.

Download Full-text

Development of Vietnamese Speech Synthesis System using Deep Neural Networks

Journal of Computer Science and Cybernetics ◽

10.15625/1813-9663/34/4/13172 ◽

2019 ◽

Vol 34 (4) ◽

pp. 349-363 ◽

Cited By ~ 4

Author(s):

Thinh Van Nguyen ◽

Bao Quoc Nguyen ◽

Kinh Huy Phan ◽

Hai Van Do

Keyword(s):

Neural Networks ◽

Markov Model ◽

Hidden Markov Model ◽

Speech Synthesis ◽

Deep Neural Networks ◽

Hidden Markov ◽

Experimental Results ◽

Training Data ◽

The Internet ◽

Synthesis System

In this paper, we present our first Vietnamese speech synthesis system based on deep neural networks. To improve the training data collected from the Internet, a cleaning method is proposed. The experimental results indicate that by using deeper architectures we can achieve better performance for the TTS than using shallow architectures such as hidden Markov model. We also present the effect of using different amounts of data to train the TTS systems. In the VLSP TTS challenge 2018, our proposed DNN-based speech synthesis system won the first place in all three subjects including naturalness, intelligibility, and MOS.

Download Full-text