On the Convergence of (Stochastic) Gradient Descent with Extrapolation for Non-Convex Minimization

Extrapolation is a well-known technique for solving convex optimization and variational inequalities and recently attracts some attention for non-convex optimization. Several recent works have empirically shown its success in some machine learning tasks. However, it has not been analyzed for non-convex minimization and there still remains a gap between the theory and the practice. In this paper, we analyze gradient descent and stochastic gradient descent with extrapolation for finding an approximate first-order stationary point in smooth non-convex optimization problems. Our convergence upper bounds show that the algorithms with extrapolation can be accelerated than without extrapolation.

Download Full-text

Decentralized and parallel primal and dual accelerated methods for stochastic convex programming problems

Journal of Inverse and Ill-Posed Problems ◽

10.1515/jiip-2020-0068 ◽

2021 ◽

Vol 0 (0) ◽

Author(s):

Darina Dvinskikh ◽

Alexander Gasnikov

Keyword(s):

Convex Optimization ◽

Inverse Problems ◽

Convex Programming ◽

Data Science ◽

Optimization Problems ◽

Stochastic Gradient ◽

Logarithmic Factor ◽

Accelerated Methods ◽

Convex Optimization Problems ◽

Primal And Dual

Abstract We introduce primal and dual stochastic gradient oracle methods for decentralized convex optimization problems. Both for primal and dual oracles, the proposed methods are optimal in terms of the number of communication steps. However, for all classes of the objective, the optimality in terms of the number of oracle calls per node takes place only up to a logarithmic factor and the notion of smoothness. By using mini-batching technique, we show that the proposed methods with stochastic oracle can be additionally parallelized at each node. The considered algorithms can be applied to many data science problems and inverse problems.

Download Full-text

Variational Analysis Perspective on Linear Convergence of Some First Order Methods for Nonsmooth Convex Optimization Problems

Set-Valued and Variational Analysis ◽

10.1007/s11228-021-00591-3 ◽

2021 ◽

Author(s):

Jane J. Ye ◽

Xiaoming Yuan ◽

Shangzhi Zeng ◽

Jin Zhang

Keyword(s):

Convex Optimization ◽

Variational Analysis ◽

Optimization Problems ◽

Linear Convergence ◽

Nonsmooth Convex Optimization ◽

First Order ◽

Convex Optimization Problems ◽

First Order Methods

Download Full-text

A Diffusion Approximation Theory of Momentum Stochastic Gradient Descent in Nonconvex Optimization

Stochastic Systems ◽

10.1287/stsy.2021.0083 ◽

2021 ◽

Author(s):

Tianyi Liu ◽

Zhehui Chen ◽

Enlu Zhou ◽

Tuo Zhao

Keyword(s):

Neural Networks ◽

Nonconvex Optimization ◽

Gradient Descent ◽

Deep Neural Networks ◽

Optimization Problems ◽

Saddle Points ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Nonconvex Optimization Problems ◽

Empirical Success

Momentum stochastic gradient descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning (e.g., training deep neural networks, variational Bayesian inference, etc.). Despite its empirical success, there is still a lack of theoretical understanding of convergence properties of MSGD. To fill this gap, we propose to analyze the algorithmic behavior of MSGD by diffusion approximations for nonconvex optimization problems with strict saddle points and isolated local optima. Our study shows that the momentum helps escape from saddle points but hurts the convergence within the neighborhood of optima (if without the step size annealing or momentum annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks.

Download Full-text

A first order method for finding minimal norm-like solutions of convex optimization problems

Mathematical Programming ◽

10.1007/s10107-013-0708-2 ◽

2013 ◽

Vol 147 (1-2) ◽

pp. 25-46 ◽

Cited By ~ 10

Author(s):

Amir Beck ◽

Shoham Sabach

Keyword(s):

Convex Optimization ◽

Optimization Problems ◽

Order Method ◽

Minimal Norm ◽

First Order ◽

Convex Optimization Problems ◽

First Order Method

Download Full-text

SSGD: A Safe and Efficient Method of Gradient Descent

Security and Communication Networks ◽

10.1155/2021/5404061 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Jinhuan Duan ◽

Xianxian Li ◽

Shiqi Gao ◽

Zili Zhong ◽

Jinyan Wang

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Optimization Problems ◽

Unit Vector ◽

Descent Method ◽

Stochastic Gradient ◽

Learning System ◽

Training Data ◽

Stochastic Gradient Descent ◽

Gradient Descent Method

With the vigorous development of artificial intelligence technology, various engineering technology applications have been implemented one after another. The gradient descent method plays an important role in solving various optimization problems, due to its simple structure, good stability, and easy implementation. However, in multinode machine learning system, the gradients usually need to be shared, which will cause privacy leakage, because attackers can infer training data with the gradient information. In this paper, to prevent gradient leakage while keeping the accuracy of the model, we propose the super stochastic gradient descent approach to update parameters by concealing the modulus length of gradient vectors and converting it or them into a unit vector. Furthermore, we analyze the security of super stochastic gradient descent approach and demonstrate that our algorithm can defend against the attacks on the gradient. Experiment results show that our approach is obviously superior to prevalent gradient descent approaches in terms of accuracy, robustness, and adaptability to large-scale batches. Interestingly, our algorithm can also resist model poisoning attacks to a certain extent.

Download Full-text

Stochastic gradient descent for hybrid quantum-classical optimization

Quantum ◽

10.22331/q-2020-08-31-314 ◽

2020 ◽

Vol 4 ◽

pp. 314 ◽

Cited By ~ 2

Author(s):

Ryan Sweke ◽

Frederik Wilde ◽

Johannes Jakob Meyer ◽

Maria Schuld ◽

Paul K. Fährmann ◽

...

Keyword(s):

Gradient Descent ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Expectation Values ◽

Data Set ◽

Doubly Stochastic ◽

Learning Tasks ◽

Value Estimation ◽

Near Term ◽

Classical Optimization

Within the context of hybrid quantum-classical optimization, gradient descent based optimizers typically require the evaluation of expectation values with respect to the outcome of parameterized quantum circuits. In this work, we explore the consequences of the prior observation that estimation of these quantities on quantum hardware results in a form of stochastic gradient descent optimization. We formalize this notion, which allows us to show that in many relevant cases, including VQE, QAOA and certain quantum classifiers, estimating expectation values with k measurement outcomes results in optimization algorithms whose convergence properties can be rigorously well understood, for any value of k. In fact, even using single measurement outcomes for the estimation of expectation values is sufficient. Moreover, in many settings the required gradients can be expressed as linear combinations of expectation values -- originating, e.g., from a sum over local terms of a Hamiltonian, a parameter shift rule, or a sum over data-set instances -- and we show that in these cases k-shot expectation value estimation can be combined with sampling over terms of the linear combination, to obtain ``doubly stochastic'' gradient descent optimizers. For all algorithms we prove convergence guarantees, providing a framework for the derivation of rigorous optimization results in the context of near-term quantum devices. Additionally, we explore numerically these methods on benchmark VQE, QAOA and quantum-enhanced machine learning tasks and show that treating the stochastic settings as hyper-parameters allows for state-of-the-art results with significantly fewer circuit executions and measurements.

Download Full-text

Stochastic Gradient Descent on a Tree: an Adaptive and Robust Approach to Stochastic Convex Optimization

2019 57th Annual Allerton Conference on Communication, Control, and Computing (Allerton) ◽

10.1109/allerton.2019.8919740 ◽

2019 ◽

Cited By ~ 2

Author(s):

Sattar Vakili ◽

Sudeep Salgia ◽

Qing Zhao

Keyword(s):

Convex Optimization ◽

Gradient Descent ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Robust Approach

Download Full-text

Distributed Stochastic Gradient Descent with Event-Triggered Communication

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i05.6206 ◽

2020 ◽

Vol 34 (05) ◽

pp. 7169-7178 ◽

Cited By ~ 1

Author(s):

Jemin George ◽

Prudhvi Gurram

Keyword(s):

Gradient Descent ◽

Optimization Problems ◽

Sufficient Conditions ◽

Stochastic Gradient ◽

Training Data ◽

Stochastic Gradient Descent ◽

Model Parameters ◽

Trained Neural Network ◽

The Individual ◽

Event Triggered

We develop a Distributed Event-Triggered Stochastic GRAdient Descent (DETSGRAD) algorithm for solving non-convex optimization problems typically encountered in distributed deep learning. We propose a novel communication triggering mechanism that would allow the networked agents to update their model parameters aperiodically and provide sufficient conditions on the algorithm step-sizes that guarantee the asymptotic mean-square convergence. The algorithm is applied to a distributed supervised-learning problem, in which a set of networked agents collaboratively train their individual neural networks to perform image classification, while aperiodically sharing the model parameters with their one-hop neighbors. Results indicate that all agents report similar performance that is also comparable to the performance of a centrally trained neural network, while the event-triggered communication provides significant reduction in inter-agent communication. Results also show that the proposed algorithm allows the individual agents to classify the images even though the training data corresponding to all the classes are not locally available to each agent.

Download Full-text