Learning dynamics of gradient descent optimization in deep neural networks

Momentum stochastic gradient descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning (e.g., training deep neural networks, variational Bayesian inference, etc.). Despite its empirical success, there is still a lack of theoretical understanding of convergence properties of MSGD. To fill this gap, we propose to analyze the algorithmic behavior of MSGD by diffusion approximations for nonconvex optimization problems with strict saddle points and isolated local optima. Our study shows that the momentum helps escape from saddle points but hurts the convergence within the neighborhood of optima (if without the step size annealing or momentum annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks.

Download Full-text

THE USE OF CONTROL THEORY METHODS IN TRAINING NEURAL NETWORKS ON THE EXAMPLE OF TEETH RECOGNITION ON PANORAMIC X-RAY IMAGES

Automation technological and business processes ◽

10.15673/atbp.v13i2.2055 ◽

2021 ◽

Vol 13 (2) ◽

pp. 36-40

Author(s):

A. Smorodin

Keyword(s):

Neural Networks ◽

Control Theory ◽

Gradient Descent ◽

Deep Neural Networks ◽

Discrete Dynamical System ◽

Stochastic Gradient Descent ◽

Network Training ◽

Panoramic Images ◽

Important Field ◽

New Algorithms

The article investigated a modification of stochastic gradient descent (SGD), based on the previously developed stabilization theory of discrete dynamical system cycles. Relation between stabilization of cycles in discrete dynamical systems and finding extremum points allowed us to apply new control methods to accelerate gradient descent when approaching local minima. Gradient descent is often used in training deep neural networks on a par with other iterative methods. Two gradient SGD and Adam were experimented, and we conducted comparative experiments. All experiments were conducted during solving a practical problem of teeth recognition on 2-D panoramic images. Network training showed that the new method outperforms the SGD in its capabilities and as for parameters chosen it approaches the capabilities of Adam, which is a “state of the art” method. Thus, practical utility of using control theory in the training of deep neural networks and possibility of expanding its applicability in the process of creating new algorithms in this important field are shown.

Download Full-text

Gradient Descent Finds Global Minima for Generalizable Deep Neural Networks of Practical Sizes

2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton) ◽

10.1109/allerton.2019.8919696 ◽

2019 ◽

Cited By ~ 1

Author(s):

Kenji Kawaguchi ◽

Jiaoyang Huang

Keyword(s):

Neural Networks ◽

Gradient Descent ◽

Deep Neural Networks ◽

Global Minima

Download Full-text

An Enhanced Stochastic Gradient Descent Variance Reduced Ascension Optimization Algorithm for Deep Neural Networks

Applied Computer Vision and Image Processing - Advances in Intelligent Systems and Computing ◽

10.1007/978-981-15-4029-5_38 ◽

2020 ◽

pp. 378-385

Author(s):

Arifa Shikalgar ◽

Shefali Sonavane

Keyword(s):

Neural Networks ◽

Optimization Algorithm ◽

Gradient Descent ◽

Deep Neural Networks ◽

Stochastic Gradient ◽

Stochastic Gradient Descent

Download Full-text

Archetypal landscapes for deep neural networks

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1919995117 ◽

2020 ◽

Vol 117 (36) ◽

pp. 21857-21864

Author(s):

Philipp C. Verpoort ◽

Alpha A. Lee ◽

David J. Wales

Keyword(s):

Neural Networks ◽

Learning Community ◽

Gradient Descent ◽

Deep Neural Networks ◽

Loss Functions ◽

Stochastic Gradient Descent ◽

High Dimensional ◽

Local Minima ◽

High Loss ◽

Optimization Schemes

The predictive capabilities of deep neural networks (DNNs) continue to evolve to increasingly impressive levels. However, it is still unclear how training procedures for DNNs succeed in finding parameters that produce good results for such high-dimensional and nonconvex loss functions. In particular, we wish to understand why simple optimization schemes, such as stochastic gradient descent, do not end up trapped in local minima with high loss values that would not yield useful predictions. We explain the optimizability of DNNs by characterizing the local minima and transition states of the loss-function landscape (LFL) along with their connectivity. We show that the LFL of a DNN in the shallow network or data-abundant limit is funneled, and thus easy to optimize. Crucially, in the opposite low-data/deep limit, although the number of minima increases, the landscape is characterized by many minima with similar loss values separated by low barriers. This organization is different from the hierarchical landscapes of structural glass formers and explains why minimization procedures commonly employed by the machine-learning community can navigate the LFL successfully and reach low-lying solutions.

Download Full-text

Gradient Descent Analysis: On Visualizing the Training of Deep Neural Networks

Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications ◽

10.5220/0007583403380345 ◽

2019 ◽

Author(s):

Martin Becker ◽

Jens Lippel ◽

Thomas Zielke

Keyword(s):

Neural Networks ◽

Gradient Descent ◽

Deep Neural Networks

Download Full-text

Characterizing Learning Dynamics of Deep Neural Networks via Complex Networks

10.1109/ictai52525.2021.00056 ◽

2021 ◽

Author(s):

Emanuele La Malfa ◽

Gabriele La Malfa ◽

Giuseppe Nicosia ◽

Vito Latora

Keyword(s):

Neural Networks ◽

Complex Networks ◽

Deep Neural Networks ◽

Learning Dynamics

Download Full-text

Understanding approximate Fisher information for fast convergence of natural gradient descent in wide neural networks*

Journal of Statistical Mechanics Theory and Experiment ◽

10.1088/1742-5468/ac3ae3 ◽

2021 ◽

Vol 2021 (12) ◽

pp. 124010

Author(s):

Ryo Karakida ◽

Kazuki Osawa

Keyword(s):

Neural Networks ◽

Function Space ◽

Fisher Information ◽

Gradient Descent ◽

Large Scale ◽

Deep Neural Networks ◽

Theoretical Perspective ◽

Computational Cost ◽

Fast Convergence ◽

Natural Gradient

Abstract Natural gradient descent (NGD) helps to accelerate the convergence of gradient descent dynamics, but it requires approximations in large-scale deep neural networks because of its high computational cost. Empirical studies have confirmed that some NGD methods with approximate Fisher information converge sufficiently fast in practice. Nevertheless, it remains unclear from the theoretical perspective why and under what conditions such heuristic approximations work well. In this work, we reveal that, under specific conditions, NGD with approximate Fisher information achieves the same fast convergence to global minima as exact NGD. We consider deep neural networks in the infinite-width limit, and analyze the asymptotic training dynamics of NGD in function space via the neural tangent kernel. In the function space, the training dynamics with the approximate Fisher information are identical to those with the exact Fisher information, and they converge quickly. The fast convergence holds in layer-wise approximations; for instance, in block diagonal approximation where each block corresponds to a layer as well as in block tri-diagonal and K-FAC approximations. We also find that a unit-wise approximation achieves the same fast convergence under some assumptions. All of these different approximations have an isotropic gradient in the function space, and this plays a fundamental role in achieving the same convergence properties in training. Thus, the current study gives a novel and unified theoretical foundation with which to understand NGD methods in deep learning.

Download Full-text

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/452 ◽

2020 ◽

Author(s):

Jinghui Chen ◽

Dongruo Zhou ◽

Yiqi Tang ◽

Ziyan Yang ◽

Yuan Cao ◽

...

Keyword(s):

Neural Networks ◽

Convergence Rate ◽

Gradient Descent ◽

Deep Neural Networks ◽

Gradient Methods ◽

Estimation Method ◽

Fast Convergence ◽

Stochastic Gradient Descent ◽

Adaptive Parameter ◽

Fast Convergence Rate

Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes "over adapted". We design a new algorithm, called Partially adaptive momentum estimation method, which unifies the Adam/Amsgrad with SGD by introducing a partial adaptive parameter $p$, to achieve the best from both worlds. We also prove the convergence rate of our proposed algorithm to a stationary point in the stochastic nonconvex optimization setting. Experiments on standard benchmarks show that our proposed algorithm can maintain fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks. These results would suggest practitioners pick up adaptive gradient methods once again for faster training of deep neural networks.

Download Full-text

Learning dynamics of kernel-based deep neural networks in manifolds

Science China Information Sciences ◽

10.1007/s11432-020-3022-3 ◽

2021 ◽

Vol 64 (11) ◽

Author(s):

Wei Wu ◽

Xiaoyuan Jing ◽

Wencai Du ◽

Guoliang Chen

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Learning Dynamics

Download Full-text