Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio

Hyperparameter-free optimizer of stochastic gradient descent that incorporates unit correction and moment estimation

10.1101/348557 ◽

2018 ◽

Author(s):

Kazunori D Yamada

Keyword(s):

Deep Learning ◽

Gradient Descent ◽

Mathematical Optimization ◽

Descent Method ◽

Learning Rate ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Gradient Descent Method ◽

Moment Estimation ◽

Estimation System

ABSTRACTIn the deep learning era, stochastic gradient descent is the most common method used for optimizing neural network parameters. Among the various mathematical optimization methods, the gradient descent method is the most naive. Adjustment of learning rate is necessary for quick convergence, which is normally done manually with gradient descent. Many optimizers have been developed to control the learning rate and increase convergence speed. Generally, these optimizers adjust the learning rate automatically in response to learning status. These optimizers were gradually improved by incorporating the effective aspects of earlier methods. In this study, we developed a new optimizer: YamAdam. Our optimizer is based on Adam, which utilizes the first and second moments of previous gradients. In addition to the moment estimation system, we incorporated an advantageous part of AdaDelta, namely a unit correction system, into YamAdam. According to benchmark tests on some common datasets, our optimizer showed similar or faster convergent performance compared to the existing methods. YamAdam is an option as an alternative optimizer for deep learning.

Download Full-text

An effective learning rate scheduler for stochastic gradient descent-based deep learning model in healthcare diagnosis system

International Journal of Electronic Healthcare ◽

10.1504/ijeh.2022.119587 ◽

2022 ◽

Vol 12 (1) ◽

pp. 1

Author(s):

K. Sathyabama ◽

K. Saruladha

Keyword(s):

Deep Learning ◽

Gradient Descent ◽

Learning Model ◽

Learning Rate ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Diagnosis System ◽

Effective Learning ◽

Deep Learning Model

Download Full-text

Mutual Information Based Learning Rate Decay for Stochastic Gradient Descent Training of Deep Neural Networks

Entropy ◽

10.3390/e22050560 ◽

2020 ◽

Vol 22 (5) ◽

pp. 560

Author(s):

Shrihari Vasudevan

Keyword(s):

Neural Networks ◽

Mutual Information ◽

Gradient Descent ◽

Deep Neural Networks ◽

Learning Rate ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Novel Approach ◽

The Neural Network ◽

Gradient Based

This paper demonstrates a novel approach to training deep neural networks using a Mutual Information (MI)-driven, decaying Learning Rate (LR), Stochastic Gradient Descent (SGD) algorithm. MI between the output of the neural network and true outcomes is used to adaptively set the LR for the network, in every epoch of the training cycle. This idea is extended to layer-wise setting of LR, as MI naturally provides a layer-wise performance metric. A LR range test determining the operating LR range is also proposed. Experiments compared this approach with popular alternatives such as gradient-based adaptive LR algorithms like Adam, RMSprop, and LARS. Competitive to better accuracy outcomes obtained in competitive to better time, demonstrate the feasibility of the metric and approach.

Download Full-text

Hierarchical attributes learning for pedestrian re-identification via parallel stochastic gradient descent combined with momentum correction and adaptive learning rate

Neural Computing and Applications ◽

10.1007/s00521-019-04485-2 ◽

2019 ◽

Vol 32 (10) ◽

pp. 5695-5712 ◽

Cited By ~ 1

Author(s):

Keyang Cheng ◽

Fei Tao ◽

Yongzhao Zhan ◽

Maozhen Li ◽

Kenli Li

Keyword(s):

Adaptive Learning ◽

Gradient Descent ◽

Learning Rate ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Adaptive Learning Rate ◽

Parallel Stochastic Gradient Descent

Download Full-text

Cogra: Concept-Drift-Aware Stochastic Gradient Descent for Time-Series Forecasting

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014594 ◽

2019 ◽

Vol 33 ◽

pp. 4594-4601 ◽

Cited By ~ 1

Author(s):

Kohei Miyaguchi ◽

Hiroshi Kajino

Keyword(s):

Time Series ◽

Gradient Descent ◽

Concept Drift ◽

Time Series Forecasting ◽

Learning Rate ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Automatic Learning ◽

The Mean ◽

Real World Datasets

We approach the time-series forecasting problem in the presence of concept drift by automatic learning rate tuning of stochastic gradient descent (SGD). The SGD-based approach is preferable to other concept drift algorithms in that it can be applied to any model and it can keep learning efficiently whilst predicting online. Among a number of SGD algorithms, the variance-based SGD (vSGD) can successfully handle concept drift by automatic learning rate tuning, which is reduced to an adaptive mean estimation problem. However, its performance is still limited because of its heuristic mean estimator. In this paper, we present a concept-drift-aware stochastic gradient descent (Cogra), equipped with more theoretically-sound mean estimator called sequential mean tracker (SMT). Our key contribution is that we define a goodness criterion for the mean estimators; SMT is designed to be optimal according to this criterion. As a result of comprehensive experiments, we find that (i) our SMT can estimate the mean better than vSGD’s estimator in the presence of concept drift, and (ii) in terms of predictive performance, Cogra reduces the predictive loss by 16–67% for real-world datasets, indicating that SMT improves the prediction accuracy significantly.

Download Full-text

Accelerating Stochastic Gradient Descent using Adaptive Mini-Batch Size

2019 2nd International Conference on new Trends in Computing Sciences (ICTCS) ◽

10.1109/ictcs.2019.8923046 ◽

2019 ◽

Author(s):

Muayyad Saleh Alsadi ◽

Rawan Ghnemat ◽

Arafat Awajan

Keyword(s):

Gradient Descent ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Batch Size

Download Full-text

An Effective Learning Rate Scheduler for Stochastic Gradient Descent Based Deep Learning Model in Healthcare Diagnosis System

International Journal of Electronic Healthcare ◽

10.1504/ijeh.2022.10041876 ◽

2022 ◽

Vol 12 (1) ◽

pp. 1

Author(s):

Sathyabama K ◽

K. Saruladha

Keyword(s):

Deep Learning ◽

Gradient Descent ◽

Learning Model ◽

Learning Rate ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Diagnosis System ◽

Effective Learning ◽

Deep Learning Model

Download Full-text

Stochastic Gradient Descent with Polyak’s Learning Rate

Journal of Scientific Computing ◽

10.1007/s10915-021-01628-3 ◽

2021 ◽

Vol 89 (1) ◽

Author(s):

Mariana Prazeres ◽

Adam M. Oberman

Keyword(s):

Gradient Descent ◽

Learning Rate ◽

Stochastic Gradient ◽

Stochastic Gradient Descent

Download Full-text

Learning Rate Adaptation in Stochastic Gradient Descent

Nonconvex Optimization and Its Applications - Advances in Convex Analysis and Global Optimization ◽

10.1007/978-1-4613-0279-7_27 ◽

2001 ◽

pp. 433-444 ◽

Cited By ~ 20

Author(s):

V. P. Plagianakos ◽

G. D. Magoulas ◽

M. N. Vrahatis

Keyword(s):

Gradient Descent ◽

Rate Adaptation ◽

Learning Rate ◽

Stochastic Gradient ◽

Stochastic Gradient Descent

Download Full-text

Analysis of stochastic gradient descent in continuous time

Statistics and Computing ◽

10.1007/s11222-021-10016-8 ◽

2021 ◽

Vol 31 (4) ◽

Author(s):

Jonas Latz

Keyword(s):

Dynamical System ◽

Continuous Time ◽

Gradient Descent ◽

Target Function ◽

Gradient Flow ◽

Learning Rate ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Finite State ◽

Finite State Space

AbstractStochastic gradient descent is an optimisation method that combines classical gradient descent with random subsampling within the target functional. In this work, we introduce the stochastic gradient process as a continuous-time representation of stochastic gradient descent. The stochastic gradient process is a dynamical system that is coupled with a continuous-time Markov process living on a finite state space. The dynamical system—a gradient flow—represents the gradient descent part, the process on the finite state space represents the random subsampling. Processes of this type are, for instance, used to model clonal populations in fluctuating environments. After introducing it, we study theoretical properties of the stochastic gradient process: We show that it converges weakly to the gradient flow with respect to the full target function, as the learning rate approaches zero. We give conditions under which the stochastic gradient process with constant learning rate is exponentially ergodic in the Wasserstein sense. Then we study the case, where the learning rate goes to zero sufficiently slowly and the single target functions are strongly convex. In this case, the process converges weakly to the point mass concentrated in the global minimum of the full target function; indicating consistency of the method. We conclude after a discussion of discretisation strategies for the stochastic gradient process and numerical experiments.

Download Full-text