scholarly journals Distributed Gradient Descent: Nonconvergence to Saddle Points and the Stable-Manifold Theorem

Author(s):  
Brian Swenson ◽  
Ryan Murray ◽  
H. Vincent Poor ◽  
Soummya Kar
2015 ◽  
Vol 83 (4) ◽  
pp. 2435-2452 ◽  
Author(s):  
Amey Deshpande ◽  
Varsha Daftardar-Gejji

2021 ◽  
Author(s):  
Tianyi Liu ◽  
Zhehui Chen ◽  
Enlu Zhou ◽  
Tuo Zhao

Momentum stochastic gradient descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning (e.g., training deep neural networks, variational Bayesian inference, etc.). Despite its empirical success, there is still a lack of theoretical understanding of convergence properties of MSGD. To fill this gap, we propose to analyze the algorithmic behavior of MSGD by diffusion approximations for nonconvex optimization problems with strict saddle points and isolated local optima. Our study shows that the momentum helps escape from saddle points but hurts the convergence within the neighborhood of optima (if without the step size annealing or momentum annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks.


2019 ◽  
Vol 2019 (752) ◽  
pp. 229-264 ◽  
Author(s):  
Stefano Luzzatto ◽  
Sina Tureli ◽  
Khadim War

Abstract We give new sufficient conditions for the integrability and unique integrability of continuous tangent subbundles on manifolds of arbitrary dimension, generalizing Frobenius’ classical theorem for {C^{1}} subbundles. Using these conditions, we derive new criteria for uniqueness of solutions to ODEs and PDEs and for the integrability of invariant bundles in dynamical systems. In particular, we give a novel proof of the Stable Manifold Theorem and prove some integrability results for dynamically defined dominated splittings.


Author(s):  
Ke Xue ◽  
Chao Qian ◽  
Ling Xu ◽  
Xudong Fei

Non-convex optimization is often involved in artificial intelligence tasks, which may have many saddle points, and is NP-hard to solve. Evolutionary algorithms (EAs) are general-purpose derivative-free optimization algorithms with a good ability to find the global optimum, which can be naturally applied to non-convex optimization. Their performance is, however, limited due to low efficiency. Gradient descent (GD) runs efficiently, but only converges to a first-order stationary point, which may be a saddle point and thus arbitrarily bad. Some recent efforts have been put into combining EAs and GD. However, previous works either utilized only a specific component of EAs, or just combined them heuristically without theoretical guarantee. In this paper, we propose an evolutionary GD (EGD) algorithm by combining typical components, i.e., population and mutation, of EAs with GD. We prove that EGD can converge to a second-order stationary point by escaping the saddle points, and is more efficient than previous algorithms. Empirical results on non-convex synthetic functions as well as reinforcement learning (RL) tasks also show its superiority.


2021 ◽  
Vol 53 (2) ◽  
pp. 575-607
Author(s):  
Konstantinos Karatapanis

AbstractWe consider stochastic differential equations of the form $dX_t = |f(X_t)|/t^{\gamma} dt+1/t^{\gamma} dB_t$, where f(x) behaves comparably to $|x|^k$ in a neighborhood of the origin, for $k\in [1,\infty)$. We show that there exists a threshold value $ \,{:}\,{\raise-1.5pt{=}}\, \tilde{\gamma}$ for $\gamma$, depending on k, such that if $\gamma \in (1/2, \tilde{\gamma})$, then $\mathbb{P}(X_t\rightarrow 0) = 0$, and for the rest of the permissible values of $\gamma$, $\mathbb{P}(X_t\rightarrow 0)>0$. These results extend to discrete processes that satisfy $X_{n+1}-X_n = f(X_n)/n^\gamma +Y_n/n^\gamma$. Here, $Y_{n+1}$ are martingale differences that are almost surely bounded.This result shows that for a function F whose second derivative at degenerate saddle points is of polynomial order, it is always possible to escape saddle points via the iteration $X_{n+1}-X_n =F'(X_n)/n^\gamma +Y_n/n^\gamma$ for a suitable choice of $\gamma$.


1991 ◽  
Vol 02 (01n02) ◽  
pp. 35-46
Author(s):  
Yves Chauvin

This behavior of a constrained linear computing unit is analysed during “Hebbian” learning by gradient descent of a cost function corresponding to the sum of a variance maximization and a weight normalization term. The n-dimensional landscape of this cost function is shown to be composed of one local maximum and of n saddle points plus one global minimum aligned with the principal components of the input patterns. Furthermore, the landscape can be described in terms of hyperspheres, hypercrests, and hypervalleys associated with each of these principal components. Using this description, it is shown that the learning trajectory will converge to the global minimum of the landscape corresponding to the main principal component of the input patterns, provided some conditions on the starting weights and on the learning rate of the descent procedure. Extensions and implications of the algorithm are discussed.


Sign in / Sign up

Export Citation Format

Share Document