Distributed Gradient Descent: Nonconvergence to Saddle Points and the Stable-Manifold Theorem

Momentum stochastic gradient descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning (e.g., training deep neural networks, variational Bayesian inference, etc.). Despite its empirical success, there is still a lack of theoretical understanding of convergence properties of MSGD. To fill this gap, we propose to analyze the algorithmic behavior of MSGD by diffusion approximations for nonconvex optimization problems with strict saddle points and isolated local optima. Our study shows that the momentum helps escape from saddle points but hurts the convergence within the neighborhood of optima (if without the step size annealing or momentum annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks.

Download Full-text

4.4 The stable manifold theorem

Spaces of Dynamical Systems ◽

10.1515/9783110258417.53 ◽

2012 ◽

pp. 53-65

Keyword(s):

Stable Manifold ◽

Stable Manifold Theorem

Download Full-text

Integrability of continuous bundles

Journal für die reine und angewandte Mathematik (Crelles Journal) ◽

10.1515/crelle-2016-0049 ◽

2019 ◽

Vol 2019 (752) ◽

pp. 229-264 ◽

Cited By ~ 1

Author(s):

Stefano Luzzatto ◽

Sina Tureli ◽

Khadim War

Keyword(s):

Dynamical Systems ◽

Stable Manifold ◽

Arbitrary Dimension ◽

Sufficient Conditions ◽

Classical Theorem ◽

Uniqueness Of Solutions ◽

Stable Manifold Theorem ◽

New Criteria

Abstract We give new sufficient conditions for the integrability and unique integrability of continuous tangent subbundles on manifolds of arbitrary dimension, generalizing Frobenius’ classical theorem for {C^{1}} subbundles. Using these conditions, we derive new criteria for uniqueness of solutions to ODEs and PDEs and for the integrability of invariant bundles in dynamical systems. In particular, we give a novel proof of the Stable Manifold Theorem and prove some integrability results for dynamically defined dominated splittings.

Download Full-text

Hammer's X -Ray Problem and the Stable Manifold Theorem

Journal of the London Mathematical Society ◽

10.1112/jlms/s2-28.1.149 ◽

1983 ◽

Vol s2-28 (1) ◽

pp. 149-160 ◽

Cited By ~ 9

Author(s):

K. J. Falconer

Keyword(s):

Stable Manifold ◽

X Ray ◽

Stable Manifold Theorem

Download Full-text

Evolutionary Gradient Descent for Non-convex Optimization

Proceedings of the Thirtieth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2021/443 ◽

2021 ◽

Author(s):

Ke Xue ◽

Chao Qian ◽

Ling Xu ◽

Xudong Fei

Keyword(s):

Convex Optimization ◽

Stationary Point ◽

Gradient Descent ◽

Saddle Points ◽

Global Optimum ◽

General Purpose ◽

Good Ability ◽

Derivative Free Optimization ◽

Derivative Free ◽

Low Efficiency

Non-convex optimization is often involved in artificial intelligence tasks, which may have many saddle points, and is NP-hard to solve. Evolutionary algorithms (EAs) are general-purpose derivative-free optimization algorithms with a good ability to find the global optimum, which can be naturally applied to non-convex optimization. Their performance is, however, limited due to low efficiency. Gradient descent (GD) runs efficiently, but only converges to a first-order stationary point, which may be a saddle point and thus arbitrarily bad. Some recent efforts have been put into combining EAs and GD. However, previous works either utilized only a specific component of EAs, or just combined them heuristically without theoretical guarantee. In this paper, we propose an evolutionary GD (EGD) algorithm by combining typical components, i.e., population and mutation, of EAs with GD. We prove that EGD can converge to a second-order stationary point by escaping the saddle points, and is more efficient than previous algorithms. Empirical results on non-convex synthetic functions as well as reinforcement learning (RL) tasks also show its superiority.

Download Full-text

One-dimensional system arising in stochastic gradient descent

Advances in Applied Probability ◽

10.1017/apr.2020.10 ◽

2021 ◽

Vol 53 (2) ◽

pp. 575-607

Author(s):

Konstantinos Karatapanis

Keyword(s):

Differential Equations ◽

Stochastic Differential Equations ◽

Gradient Descent ◽

Saddle Points ◽

Threshold Value ◽

Stochastic Gradient Descent ◽

Dimensional System ◽

Second Derivative ◽

Martingale Differences ◽

One Dimensional

AbstractWe consider stochastic differential equations of the form $dX_t = |f(X_t)|/t^{\gamma} dt+1/t^{\gamma} dB_t$, where f(x) behaves comparably to $|x|^k$ in a neighborhood of the origin, for $k\in [1,\infty)$. We show that there exists a threshold value $ \,{:}\,{\raise-1.5pt{=}}\, \tilde{\gamma}$ for $\gamma$, depending on k, such that if $\gamma \in (1/2, \tilde{\gamma})$, then $\mathbb{P}(X_t\rightarrow 0) = 0$, and for the rest of the permissible values of $\gamma$, $\mathbb{P}(X_t\rightarrow 0)>0$. These results extend to discrete processes that satisfy $X_{n+1}-X_n = f(X_n)/n^\gamma +Y_n/n^\gamma$. Here, $Y_{n+1}$ are martingale differences that are almost surely bounded.This result shows that for a function F whose second derivative at degenerate saddle points is of polynomial order, it is always possible to escape saddle points via the iteration $X_{n+1}-X_n =F'(X_n)/n^\gamma +Y_n/n^\gamma$ for a suitable choice of $\gamma$.

Download Full-text

CONSTRAINED HEBBIAN LEARNING: GRADIENT DESCENT TO GLOBAL MINIMA IN AN n-DIMENSIONAL LANDSCAPE

International Journal of Neural Systems ◽

10.1142/s0129065791000042 ◽

1991 ◽

Vol 02 (01n02) ◽

pp. 35-46

Author(s):

Yves Chauvin

Keyword(s):

Cost Function ◽

Principal Components ◽

Gradient Descent ◽

Global Minimum ◽

Hebbian Learning ◽

Saddle Points ◽

Local Maximum ◽

Principal Component ◽

Learning Trajectory ◽

Computing Unit

This behavior of a constrained linear computing unit is analysed during “Hebbian” learning by gradient descent of a cost function corresponding to the sum of a variance maximization and a weight normalization term. The n-dimensional landscape of this cost function is shown to be composed of one local maximum and of n saddle points plus one global minimum aligned with the principal components of the input patterns. Furthermore, the landscape can be described in terms of hyperspheres, hypercrests, and hypervalleys associated with each of these principal components. Using this description, it is shown that the learning trajectory will converge to the global minimum of the landscape corresponding to the main principal component of the input patterns, provided some conditions on the starting weights and on the learning rate of the descent procedure. Extensions and implications of the algorithm are discussed.

Download Full-text