scholarly journals An analysis of stochastic variance reduced gradient for linear inverse problems

2021 ◽  
Author(s):  
Bangti Jin ◽  
Zehui Zhou ◽  
Jun Zou

Abstract Stochastic variance reduced gradient (SVRG) is a popular variance reduction technique for stochastic gradient descent (SGD). We provide a first analysis of the method for solving a class of linear inverse problems in the lens of the classical regularization theory. We prove that for a suitable constant step size schedule, the method can achieve an optimal convergence rate in terms of the noise level (under suitable regularity condition) and the variance of the SVRG iterate error is smaller than that by SGD. These theoretical findings are corroborated by a set of numerical experiments.

Author(s):  
Andrew Jacobsen ◽  
Matthew Schlegel ◽  
Cameron Linke ◽  
Thomas Degris ◽  
Adam White ◽  
...  

This paper investigates different vector step-size adaptation approaches for non-stationary online, continual prediction problems. Vanilla stochastic gradient descent can be considerably improved by scaling the update with a vector of appropriately chosen step-sizes. Many methods, including AdaGrad, RMSProp, and AMSGrad, keep statistics about the learning process to approximate a second order update—a vector approximation of the inverse Hessian. Another family of approaches use meta-gradient descent to adapt the stepsize parameters to minimize prediction error. These metadescent strategies are promising for non-stationary problems, but have not been as extensively explored as quasi-second order methods. We first derive a general, incremental metadescent algorithm, called AdaGain, designed to be applicable to a much broader range of algorithms, including those with semi-gradient updates or even those with accelerations, such as RMSProp. We provide an empirical comparison of methods from both families. We conclude that methods from both families can perform well, but in non-stationary prediction problems the meta-descent methods exhibit advantages. Our method is particularly robust across several prediction problems, and is competitive with the state-of-the-art method on a large-scale, time-series prediction problem on real data from a mobile robot.


2020 ◽  
Vol 34 (06) ◽  
pp. 10126-10135
Author(s):  
Artyom Gadetsky ◽  
Kirill Struminsky ◽  
Christopher Robinson ◽  
Novi Quadrianto ◽  
Dmitry Vetrov

Learning models with discrete latent variables using stochastic gradient descent remains a challenge due to the high variance of gradient estimates. Modern variance reduction techniques mostly consider categorical distributions and have limited applicability when the number of possible outcomes becomes large. In this work, we consider models with latent permutations and propose control variates for the Plackett-Luce distribution. In particular, the control variates allow us to optimize black-box functions over permutations using stochastic gradient descent. To illustrate the approach, we consider a variety of causal structure learning tasks for continuous and discrete data. We show that our method outperforms competitive relaxation-based optimization methods and is also applicable to non-differentiable score functions.


2020 ◽  
Vol 34 (04) ◽  
pp. 5503-5510
Author(s):  
Zhaolin Ren ◽  
Zhengyuan Zhou ◽  
Linhai Qiu ◽  
Ajay Deshpande ◽  
Jayant Kalagnanam

In large-scale optimization problems, distributed asynchronous stochastic gradient descent (DASGD) is a commonly used algorithm. In most applications, there are often a large number of computing nodes asynchronously computing gradient information. As such, the gradient information received at a given iteration is often stale. In the presence of such delays, which can be unbounded, the convergence of DASGD is uncertain. The contribution of this paper is twofold. First, we propose a delay-adaptive variant of DASGD where we adjust each iteration's step-size based on the size of the delay, and prove asymptotic convergence of the algorithm on variationally coherent stochastic problems, a class of functions which properly includes convex, quasi-convex and star-convex functions. Second, we extend the convergence results of standard DASGD, used usually for problems with bounded domains, to problems with unbounded domains. In this way, we extend the frontier of theoretical guarantees for distributed asynchronous optimization, and provide new insights for practitioners working on large-scale optimization problems.


Author(s):  
Yan Gao ◽  
Lingjun Zhou ◽  
Baifan Chen ◽  
Xiaobing Xing

In numerous solutions, the distributed stochastic gradient descent algorithm is one of the most popular algorithms for parallelization matrix decomposition. However, in parallel calculation, the computing speed of each computing node was greatly different because of the imbalance of the computing nodes. This article reduced the data skew for all computing nodes during distributed execution to solve the problem of locking waiting. The improved algorithm on DSGD was named D-DSGD, which reduced the time consumption of the algorithm and improved the utilization rate of the nodes. Meanwhile, the dynamic step size adjusting strategy was applied to improve the convergence rate of the algorithm. To ensure the non-negative matrix decomposition, non-negative control was added into D-DSGD and the improved algorithm was named D-NMF. Compared with the existing methods, the proposed algorithm in this article has a marked impact on reducing the latency and speed of convergence.


2020 ◽  
Vol 29 (05) ◽  
pp. 2050010
Author(s):  
Pola Lydia Lagari ◽  
Lefteri H. Tsoukalas ◽  
Isaac E. Lagaris

Stochastic Gradient Descent (SGD) is perhaps the most frequently used method for large scale training. A common example is training a neural network over a large data set, which amounts to minimizing the corresponding mean squared error (MSE). Since the convergence of SGD is rather slow, acceleration techniques based on the notion of “Mini-Batches” have been developed. All of them however, mimicking SGD, impose diminishing step-sizes as a means to inhibit large variations in the MSE objective. In this article, we introduce random sets of mini-batches instead of individual mini-batches. We employ an objective function that minimizes the average MSE and its variance over these sets, eliminating so the need for the systematic step size reduction. This approach permits the use of state-of-the-art optimization methods, far more efficient than the gradient descent, and yields a significant performance enhancement.


2018 ◽  
Vol 281 ◽  
pp. 27-36 ◽  
Author(s):  
Yuewei Ming ◽  
Yawei Zhao ◽  
Chengkun Wu ◽  
Kuan Li ◽  
Jianping Yin

1998 ◽  
Vol 35 (02) ◽  
pp. 395-406 ◽  
Author(s):  
Jürgen Dippon

A stochastic gradient descent method is combined with a consistent auxiliary estimate to achieve global convergence of the recursion. Using step lengths converging to zero slower than 1/n and averaging the trajectories, yields the optimal convergence rate of 1/√n and the optimal variance of the asymptotic distribution. Possible applications can be found in maximum likelihood estimation, regression analysis, training of artificial neural networks, and stochastic optimization.


2018 ◽  
Vol 10 (03) ◽  
pp. 1850004
Author(s):  
Grant Sheen

Wireless recording and real time classification of brain waves are essential steps towards future wearable devices to assist Alzheimer’s patients in conveying their thoughts. This work is concerned with efficient computation of a dimension-reduced neural network (NN) model on Alzheimer’s patient data recorded by a wireless headset. Due to much fewer sensors in wireless recording than the number of electrodes in a traditional wired cap and shorter attention span of an Alzheimer’s patient than a normal person, the data is much more restrictive than is typical in neural robotics and mind-controlled games. To overcome this challenge, an alternating minimization (AM) method is developed for network training. AM minimizes a nonsmooth and nonconvex objective function one variable at a time while fixing the rest. The sub-problem for each variable is piecewise convex with a finite number of minima. The overall iterative AM method is descending and free of step size (learning parameter) in the standard gradient descent method. The proposed model, trained by the AM method, significantly outperforms the standard NN model trained by the stochastic gradient descent method in classifying four daily thoughts, reaching accuracies around 90% for Alzheimer’s patient. Curved decision boundaries of the proposed model with multiple hidden neurons are found analytically to establish the nonlinear nature of the classification.


Sign in / Sign up

Export Citation Format

Share Document