Stochastic Gradient Langevin Dynamics with Variance Reduction

Learning models with discrete latent variables using stochastic gradient descent remains a challenge due to the high variance of gradient estimates. Modern variance reduction techniques mostly consider categorical distributions and have limited applicability when the number of possible outcomes becomes large. In this work, we consider models with latent permutations and propose control variates for the Plackett-Luce distribution. In particular, the control variates allow us to optimize black-box functions over permutations using stochastic gradient descent. To illustrate the approach, we consider a variety of causal structure learning tasks for continuous and discrete data. We show that our method outperforms competitive relaxation-based optimization methods and is also applicable to non-differentiable score functions.

Download Full-text

A convergence analysis for a class of practical variance-reduction stochastic gradient MCMC

Science China Information Sciences ◽

10.1007/s11432-018-9656-y ◽

2018 ◽

Vol 62 (1) ◽

Cited By ~ 1

Author(s):

Changyou Chen ◽

Wenlin Wang ◽

Yizhe Zhang ◽

Qinliang Su ◽

Lawrence Carin

Keyword(s):

Convergence Analysis ◽

Variance Reduction ◽

Stochastic Gradient

Download Full-text

Accelerating variance-reduced stochastic gradient methods

Mathematical Programming ◽

10.1007/s10107-020-01566-2 ◽

2020 ◽

Cited By ~ 1

Author(s):

Derek Driggs ◽

Matthias J. Ehrhardt ◽

Carola-Bibiane Schönlieb

Keyword(s):

Variance Reduction ◽

Convergence Rates ◽

Mean Squared Error ◽

Gradient Methods ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Acceleration Techniques ◽

Squared Error ◽

Accelerated Gradient ◽

First Time

Abstract Variance reduction is a crucial tool for improving the slow convergence of stochastic gradient descent. Only a few variance-reduced methods, however, have yet been shown to directly benefit from Nesterov’s acceleration techniques to match the convergence rates of accelerated gradient methods. Such approaches rely on “negative momentum”, a technique for further variance reduction that is generally specific to the SVRG gradient estimator. In this work, we show for the first time that negative momentum is unnecessary for acceleration and develop a universal acceleration framework that allows all popular variance-reduced methods to achieve accelerated convergence rates. The constants appearing in these rates, including their dependence on the number of functions n, scale with the mean-squared-error and bias of the gradient estimator. In a series of numerical experiments, we demonstrate that versions of SAGA, SVRG, SARAH, and SARGE using our framework significantly outperform non-accelerated versions and compare favourably with algorithms using negative momentum.

Download Full-text

Distributed and asynchronous Stochastic Gradient Descent with variance reduction

Neurocomputing ◽

10.1016/j.neucom.2017.11.044 ◽

2018 ◽

Vol 281 ◽

pp. 27-36 ◽

Cited By ~ 6

Author(s):

Yuewei Ming ◽

Yawei Zhao ◽

Chengkun Wu ◽

Kuan Li ◽

Jianping Yin

Keyword(s):

Gradient Descent ◽

Variance Reduction ◽

Stochastic Gradient ◽

Stochastic Gradient Descent

Download Full-text

Stochastic gradient Hamiltonian Monte Carlo with variance reduction for Bayesian inference

Machine Learning ◽

10.1007/s10994-019-05825-y ◽

2019 ◽

Vol 108 (8-9) ◽

pp. 1701-1727

Author(s):

Zhize Li ◽

Tianyi Zhang ◽

Shuyu Cheng ◽

Jun Zhu ◽

Jian Li

Keyword(s):

Monte Carlo ◽

Bayesian Inference ◽

Variance Reduction ◽

Stochastic Gradient ◽

Hamiltonian Monte Carlo

Download Full-text

Improving sampling accuracy of stochastic gradient MCMC methods via non-uniform subsampling of gradients

Discrete and Continuous Dynamical Systems - Series S ◽

10.3934/dcdss.2021157 ◽

2021 ◽

Vol 0 (0) ◽

pp. 0

Author(s):

Ruilin Li ◽

Xin Wang ◽

Hongyuan Zha ◽

Molei Tao

Keyword(s):

Large Scale ◽

Variance Reduction ◽

Sampling Error ◽

Stochastic Gradient ◽

Mcmc Methods ◽

Practical Implementation ◽

Mcmc Method ◽

Local Variance ◽

Gradient Based ◽

Data Points

<p style='text-indent:20px;'>Many Markov Chain Monte Carlo (MCMC) methods leverage gradient information of the potential function of target distribution to explore sample space efficiently. However, computing gradients can often be computationally expensive for large scale applications, such as those in contemporary machine learning. Stochastic Gradient (SG-)MCMC methods approximate gradients by stochastic ones, commonly via uniformly subsampled data points, and achieve improved computational efficiency, however at the price of introducing sampling error. We propose a non-uniform subsampling scheme to improve the sampling accuracy. The proposed exponentially weighted stochastic gradient (EWSG) is designed so that a non-uniform-SG-MCMC method mimics the statistical behavior of a batch-gradient-MCMC method, and hence the inaccuracy due to SG approximation is reduced. EWSG differs from classical variance reduction (VR) techniques as it focuses on the entire distribution instead of just the variance; nevertheless, its reduced local variance is also proved. EWSG can also be viewed as an extension of the importance sampling idea, successful for stochastic-gradient-based optimizations, to sampling tasks. In our practical implementation of EWSG, the non-uniform subsampling is performed efficiently via a Metropolis-Hastings chain on the data index, which is coupled to the MCMC algorithm. Numerical experiments are provided, not only to demonstrate EWSG's effectiveness, but also to guide hyperparameter choices, and validate our <i>non-asymptotic global error bound</i> despite of approximations in the implementation. Notably, while statistical accuracy is improved, convergence speed can be comparable to the uniform version, which renders EWSG a practical alternative to VR (but EWSG and VR can be combined too).</p>

Download Full-text