An analysis of stochastic variance reduced gradient for linear inverse problems

Abstract Stochastic variance reduced gradient (SVRG) is a popular variance reduction technique for stochastic gradient descent (SGD). We provide a first analysis of the method for solving a class of linear inverse problems in the lens of the classical regularization theory. We prove that for a suitable constant step size schedule, the method can achieve an optimal convergence rate in terms of the noise level (under suitable regularity condition) and the variance of the SVRG iterate error is smaller than that by SGD. These theoretical findings are corroborated by a set of numerical experiments.

Download Full-text

Stochastic gradient descent for linear inverse problems in Hilbert spaces

10.1090/mcom/3714 ◽

2021 ◽

Author(s):

Shuai Lu ◽

Peter Mathé

Keyword(s):

Inverse Problems ◽

Hilbert Spaces ◽

Gradient Descent ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Linear Inverse Problems

Download Full-text

On the Saturation Phenomenon of Stochastic Gradient Descent for Linear Inverse Problems

SIAM/ASA Journal on Uncertainty Quantification ◽

10.1137/20m1374456 ◽

2021 ◽

Vol 9 (4) ◽

pp. 1553-1588

Author(s):

Bangti Jin ◽

Zehui Zhou ◽

Jun Zou

Keyword(s):

Inverse Problems ◽

Gradient Descent ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Saturation Phenomenon ◽

Linear Inverse Problems

Download Full-text

Meta-Descent for Online, Continual Prediction

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013943 ◽

2019 ◽

Vol 33 ◽

pp. 3943-3950

Author(s):

Andrew Jacobsen ◽

Matthew Schlegel ◽

Cameron Linke ◽

Thomas Degris ◽

Adam White ◽

...

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Time Series Prediction ◽

Real Data ◽

Second Order ◽

Stochastic Gradient Descent ◽

Step Size ◽

Vector Approximation ◽

Prediction Problems ◽

Stationary Problems

This paper investigates different vector step-size adaptation approaches for non-stationary online, continual prediction problems. Vanilla stochastic gradient descent can be considerably improved by scaling the update with a vector of appropriately chosen step-sizes. Many methods, including AdaGrad, RMSProp, and AMSGrad, keep statistics about the learning process to approximate a second order update—a vector approximation of the inverse Hessian. Another family of approaches use meta-gradient descent to adapt the stepsize parameters to minimize prediction error. These metadescent strategies are promising for non-stationary problems, but have not been as extensively explored as quasi-second order methods. We first derive a general, incremental metadescent algorithm, called AdaGain, designed to be applicable to a much broader range of algorithms, including those with semi-gradient updates or even those with accelerations, such as RMSProp. We provide an empirical comparison of methods from both families. We conclude that methods from both families can perform well, but in non-stationary prediction problems the meta-descent methods exhibit advantages. Our method is particularly robust across several prediction problems, and is competitive with the state-of-the-art method on a large-scale, time-series prediction problem on real data from a mobile robot.

Download Full-text

Low-Variance Black-Box Gradient Estimates for the Plackett-Luce Distribution

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i06.6572 ◽

2020 ◽

Vol 34 (06) ◽

pp. 10126-10135

Author(s):

Artyom Gadetsky ◽

Kirill Struminsky ◽

Christopher Robinson ◽

Novi Quadrianto ◽

Dmitry Vetrov

Keyword(s):

Latent Variables ◽

Gradient Descent ◽

Variance Reduction ◽

Causal Structure ◽

Optimization Methods ◽

Black Box ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Gradient Estimates ◽

Control Variates

Learning models with discrete latent variables using stochastic gradient descent remains a challenge due to the high variance of gradient estimates. Modern variance reduction techniques mostly consider categorical distributions and have limited applicability when the number of possible outcomes becomes large. In this work, we consider models with latent permutations and propose control variates for the Plackett-Luce distribution. In particular, the control variates allow us to optimize black-box functions over permutations using stochastic gradient descent. To illustrate the approach, we consider a variety of causal structure learning tasks for continuous and discrete data. We show that our method outperforms competitive relaxation-based optimization methods and is also applicable to non-differentiable score functions.

Download Full-text

Delay-Adaptive Distributed Stochastic Optimization

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6001 ◽

2020 ◽

Vol 34 (04) ◽

pp. 5503-5510

Author(s):

Zhaolin Ren ◽

Zhengyuan Zhou ◽

Linhai Qiu ◽

Ajay Deshpande ◽

Jayant Kalagnanam

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Optimization Problems ◽

Stochastic Gradient Descent ◽

Large Scale Optimization ◽

Step Size ◽

Gradient Information ◽

Convergence Results ◽

Scale Optimization ◽

Class Of Functions

In large-scale optimization problems, distributed asynchronous stochastic gradient descent (DASGD) is a commonly used algorithm. In most applications, there are often a large number of computing nodes asynchronously computing gradient information. As such, the gradient information received at a given iteration is often stale. In the presence of such delays, which can be unbounded, the convergence of DASGD is uncertain. The contribution of this paper is twofold. First, we propose a delay-adaptive variant of DASGD where we adjust each iteration's step-size based on the size of the delay, and prove asymptotic convergence of the algorithm on variationally coherent stochastic problems, a class of functions which properly includes convex, quasi-convex and star-convex functions. Second, we extend the convergence results of standard DASGD, used usually for problems with bounded domains, to problems with unbounded domains. In this way, we extend the frontier of theoretical guarantees for distributed asynchronous optimization, and provide new insights for practitioners working on large-scale optimization problems.

Download Full-text

A Fast Distributed Non-Negative Matrix Factorization Algorithm Based on DSGD

International Journal of Distributed Systems and Technologies ◽

10.4018/ijdst.2018070102 ◽

2018 ◽

Vol 9 (3) ◽

pp. 24-38

Author(s):

Yan Gao ◽

Lingjun Zhou ◽

Baifan Chen ◽

Xiaobing Xing

Keyword(s):

Gradient Descent ◽

Matrix Decomposition ◽

Negative Control ◽

Stochastic Gradient Descent ◽

Utilization Rate ◽

Step Size ◽

Gradient Descent Algorithm ◽

Distributed Execution ◽

Non Negative Matrix Factorization ◽

Improved Algorithm

In numerous solutions, the distributed stochastic gradient descent algorithm is one of the most popular algorithms for parallelization matrix decomposition. However, in parallel calculation, the computing speed of each computing node was greatly different because of the imbalance of the computing nodes. This article reduced the data skew for all computing nodes during distributed execution to solve the problem of locking waiting. The improved algorithm on DSGD was named D-DSGD, which reduced the time consumption of the algorithm and improved the utilization rate of the nodes. Meanwhile, the dynamic step size adjusting strategy was applied to improve the convergence rate of the algorithm. To ensure the non-negative matrix decomposition, non-negative control was added into D-DSGD and the improved algorithm was named D-NMF. Compared with the existing methods, the proposed algorithm in this article has a marked impact on reducing the latency and speed of convergence.

Download Full-text

Variance Counterbalancing for Stochastic Large-scale Learning

International Journal of Artificial Intelligence Tools ◽

10.1142/s0218213020500104 ◽

2020 ◽

Vol 29 (05) ◽

pp. 2050010

Author(s):

Pola Lydia Lagari ◽

Lefteri H. Tsoukalas ◽

Isaac E. Lagaris

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Performance Enhancement ◽

Mean Squared Error ◽

Large Data ◽

Random Sets ◽

Stochastic Gradient Descent ◽

Step Size ◽

Data Set ◽

Acceleration Techniques

Stochastic Gradient Descent (SGD) is perhaps the most frequently used method for large scale training. A common example is training a neural network over a large data set, which amounts to minimizing the corresponding mean squared error (MSE). Since the convergence of SGD is rather slow, acceleration techniques based on the notion of “Mini-Batches” have been developed. All of them however, mimicking SGD, impose diminishing step-sizes as a means to inhibit large variations in the MSE objective. In this article, we introduce random sets of mini-batches instead of individual mini-batches. We employ an objective function that minimizes the average MSE and its variance over these sets, eliminating so the need for the systematic step size reduction. This approach permits the use of state-of-the-art optimization methods, far more efficient than the gradient descent, and yields a significant performance enhancement.

Download Full-text

Distributed and asynchronous Stochastic Gradient Descent with variance reduction

Neurocomputing ◽

10.1016/j.neucom.2017.11.044 ◽

2018 ◽

Vol 281 ◽

pp. 27-36 ◽

Cited By ~ 6

Author(s):

Yuewei Ming ◽

Yawei Zhao ◽

Chengkun Wu ◽

Kuan Li ◽

Jianping Yin

Keyword(s):

Gradient Descent ◽

Variance Reduction ◽

Stochastic Gradient ◽

Stochastic Gradient Descent

Download Full-text

Globally convergent stochastic optimization with optimal asymptotic distribution

Journal of Applied Probability ◽

10.1017/s0021900200015023 ◽

1998 ◽

Vol 35 (02) ◽

pp. 395-406 ◽

Cited By ~ 3

Author(s):

Jürgen Dippon

Keyword(s):

Neural Networks ◽

Stochastic Optimization ◽

Asymptotic Distribution ◽

Gradient Descent ◽

Likelihood Estimation ◽

Descent Method ◽

Stochastic Gradient Descent ◽

Gradient Descent Method ◽

Optimal Convergence Rate ◽

Globally Convergent

A stochastic gradient descent method is combined with a consistent auxiliary estimate to achieve global convergence of the recursion. Using step lengths converging to zero slower than 1/n and averaging the trajectories, yields the optimal convergence rate of 1/√n and the optimal variance of the asymptotic distribution. Possible applications can be found in maximum likelihood estimation, regression analysis, training of artificial neural networks, and stochastic optimization.

Download Full-text

Wireless Brain Wave Classification for Alzheimer’s Patients via Efficient Neural Network Computation

Advances in Data Science and Adaptive Analysis ◽

10.1142/s2424922x18500043 ◽

2018 ◽

Vol 10 (03) ◽

pp. 1850004

Author(s):

Grant Sheen

Keyword(s):

Neural Network ◽

Gradient Descent ◽

Descent Method ◽

Stochastic Gradient Descent ◽

Normal Person ◽

Gradient Descent Method ◽

Step Size ◽

Brain Wave ◽

Proposed Model ◽

Wireless Recording

Wireless recording and real time classification of brain waves are essential steps towards future wearable devices to assist Alzheimer’s patients in conveying their thoughts. This work is concerned with efficient computation of a dimension-reduced neural network (NN) model on Alzheimer’s patient data recorded by a wireless headset. Due to much fewer sensors in wireless recording than the number of electrodes in a traditional wired cap and shorter attention span of an Alzheimer’s patient than a normal person, the data is much more restrictive than is typical in neural robotics and mind-controlled games. To overcome this challenge, an alternating minimization (AM) method is developed for network training. AM minimizes a nonsmooth and nonconvex objective function one variable at a time while fixing the rest. The sub-problem for each variable is piecewise convex with a finite number of minima. The overall iterative AM method is descending and free of step size (learning parameter) in the standard gradient descent method. The proposed model, trained by the AM method, significantly outperforms the standard NN model trained by the stochastic gradient descent method in classifying four daily thoughts, reaching accuracies around 90% for Alzheimer’s patient. Curved decision boundaries of the proposed model with multiple hidden neurons are found analytically to establish the nonlinear nature of the classification.

Download Full-text