Mutual Information Based Learning Rate Decay for Stochastic Gradient Descent Training of Deep Neural Networks

This paper demonstrates a novel approach to training deep neural networks using a Mutual Information (MI)-driven, decaying Learning Rate (LR), Stochastic Gradient Descent (SGD) algorithm. MI between the output of the neural network and true outcomes is used to adaptively set the LR for the network, in every epoch of the training cycle. This idea is extended to layer-wise setting of LR, as MI naturally provides a layer-wise performance metric. A LR range test determining the operating LR range is also proposed. Experiments compared this approach with popular alternatives such as gradient-based adaptive LR algorithms like Adam, RMSprop, and LARS. Competitive to better accuracy outcomes obtained in competitive to better time, demonstrate the feasibility of the metric and approach.

Download Full-text

A Diffusion Approximation Theory of Momentum Stochastic Gradient Descent in Nonconvex Optimization

Stochastic Systems ◽

10.1287/stsy.2021.0083 ◽

2021 ◽

Author(s):

Tianyi Liu ◽

Zhehui Chen ◽

Enlu Zhou ◽

Tuo Zhao

Keyword(s):

Neural Networks ◽

Nonconvex Optimization ◽

Gradient Descent ◽

Deep Neural Networks ◽

Optimization Problems ◽

Saddle Points ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Nonconvex Optimization Problems ◽

Empirical Success

Momentum stochastic gradient descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning (e.g., training deep neural networks, variational Bayesian inference, etc.). Despite its empirical success, there is still a lack of theoretical understanding of convergence properties of MSGD. To fill this gap, we propose to analyze the algorithmic behavior of MSGD by diffusion approximations for nonconvex optimization problems with strict saddle points and isolated local optima. Our study shows that the momentum helps escape from saddle points but hurts the convergence within the neighborhood of optima (if without the step size annealing or momentum annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks.

Download Full-text

An Enhanced Stochastic Gradient Descent Variance Reduced Ascension Optimization Algorithm for Deep Neural Networks

Applied Computer Vision and Image Processing - Advances in Intelligent Systems and Computing ◽

10.1007/978-981-15-4029-5_38 ◽

2020 ◽

pp. 378-385

Author(s):

Arifa Shikalgar ◽

Shefali Sonavane

Keyword(s):

Neural Networks ◽

Optimization Algorithm ◽

Gradient Descent ◽

Deep Neural Networks ◽

Stochastic Gradient ◽

Stochastic Gradient Descent

Download Full-text

A Geometric Interpretation of Stochastic Gradient Descent Using Diffusion Metrics

Entropy ◽

10.3390/e22010101 ◽

2020 ◽

Vol 22 (1) ◽

pp. 101

Author(s):

Rita Fioresi ◽

Pratik Chaudhari ◽

Stefano Soatto

Keyword(s):

Neural Networks ◽

General Relativity ◽

Gradient Descent ◽

Deep Neural Networks ◽

Deterministic Model ◽

Geometric Interpretation ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Diffusion Matrix

This paper is a step towards developing a geometric understanding of a popular algorithm for training deep neural networks named stochastic gradient descent (SGD). We built upon a recent result which observed that the noise in SGD while training typical networks is highly non-isotropic. That motivated a deterministic model in which the trajectories of our dynamical systems are described via geodesics of a family of metrics arising from a certain diffusion matrix; namely, the covariance of the stochastic gradients in SGD. Our model is analogous to models in general relativity: the role of the electromagnetic field in the latter is played by the gradient of the loss function of a deep network in the former.

Download Full-text

MonolithNet: Training monolithic deep neural networks via a partitioned training strategy

Journal of Computational Vision and Imaging Systems ◽

10.15353/jcvis.v4i1.340 ◽

2018 ◽

Vol 4 (1) ◽

pp. 3

Author(s):

Rene Bidart ◽

Alexander Wong

Keyword(s):

Neural Network ◽

Neural Networks ◽

Gradient Descent ◽

Deep Neural Network ◽

Deep Neural Networks ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Neural Net ◽

Training Strategy ◽

Effective Manner

In this study, we explore the training of monolithic deep neural net-works in an effective manner. One of the biggest challenges withtraining such networks to the desired level of accuracy is the dif-ficulty in converging to a good solution using iterative optimizationmethods such as stochastic gradient descent due to the enormousnumber of parameters that need to be learned. To achieve this,we introduce a partitioned training strategy, where proxy layersare connected to different partitions of a deep neural network toenable isolated training of a much smaller number of parametersto convergence. To illustrate the efficacy of this training strategy,we introduce MonolithNet, a massive residual deep neural networkconsisting of 437 million parameters. The trained MonolithNet wasable to achieve a top-1 accuracy of 97% on the CIFAR10 imageclassification dataset, which demonstrates the feasibility of the pro-posed training strategy for training monolithic deep neural networksto high accuracies.

Download Full-text

Developing a Loss Prediction-based Asynchronous Stochastic Gradient Descent Algorithm for Distributed Training of Deep Neural Networks

49th International Conference on Parallel Processing - ICPP ◽

10.1145/3404397.3404432 ◽

2020 ◽

Author(s):

Junyu Li ◽

Ligang He ◽

Shenyuan Ren ◽

Rui Mao

Keyword(s):

Neural Networks ◽

Gradient Descent ◽

Deep Neural Networks ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Descent Algorithm ◽

Distributed Training ◽

Gradient Descent Algorithm ◽

Loss Prediction

Download Full-text

Non-convergence of stochastic gradient descent in the training of deep neural networks

Journal of Complexity ◽

10.1016/j.jco.2020.101540 ◽

2020 ◽

pp. 101540

Author(s):

Patrick Cheridito ◽

Arnulf Jentzen ◽

Florian Rossmannek

Keyword(s):

Neural Networks ◽

Gradient Descent ◽

Deep Neural Networks ◽

Stochastic Gradient ◽

Stochastic Gradient Descent

Download Full-text

How to Trick a Neural Network? Synthesising Noise to Reduce the Accuracy of Neural Network Image Classification

Herald of the Bauman Moscow State Technical University Series Instrument Engineering ◽

10.18698/0236-3933-2021-1-102-119 ◽

2021 ◽

pp. 102-119

Author(s):

A.P. Karpenko ◽

V.A. Ovchinnikov

Keyword(s):

Neural Network ◽

Neural Networks ◽

Gradient Descent ◽

Deep Neural Networks ◽

Stochastic Gradient Descent ◽

Classification Error ◽

Synthesis Algorithm ◽

Network Training ◽

The Neural Network ◽

Gradient Descent Algorithm

The study aims to develop an algorithm and then software to synthesise noise that could be used to attack deep learning neural networks designed to classify images. We present the results of our analysis of methods for conducting this type of attacks. The synthesis of attack noise is stated as a problem of multidimensional constrained optimization. The main features of the attack noise synthesis algorithm proposed are as follows: we employ the clip function to take constraints on noise into account; we use the top-1 and top-5 classification error ratings as attack noise efficiency criteria; we train our neural networks using backpropagation and Adam's gradient descent algorithm; stochastic gradient descent is employed to solve the optimisation problem indicated above; neural network training also makes use of the augmentation technique. The software was developed in Python using the Pytorch framework to dynamically differentiate the calculation graph and runs under Ubuntu 18.04 and CentOS 7. Our IDE was Visual Studio Code. We accelerated the computation via CUDA executed on a NVIDIA Titan XP GPU. The paper presents the results of a broad computational experiment in synthesising non-universal and universal attack noise types for eight deep neural networks. We show that the attack algorithm proposed is able to increase the neural network error by eight times

Download Full-text

Optical Recognition of Handwritten Logic Formulas Using Neural Networks

Electronics ◽

10.3390/electronics10222761 ◽

2021 ◽

Vol 10 (22) ◽

pp. 2761

Author(s):

Vaios Ampelakiotis ◽

Isidoros Perikos ◽

Ioannis Hatzilygeroudis ◽

George Tsihrintzis

Keyword(s):

Neural Networks ◽

Character Recognition ◽

Gradient Descent ◽

Feedforward Neural Networks ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Training Algorithms ◽

Gradient Descent Algorithm ◽

Two Stages ◽

And Training

In this paper, we present a handwritten character recognition (HCR) system that aims to recognize first-order logic handwritten formulas and create editable text files of the recognized formulas. Dense feedforward neural networks (NNs) are utilized, and their performance is examined under various training conditions and methods. More specifically, after three training algorithms (backpropagation, resilient propagation and stochastic gradient descent) had been tested, we created and trained an NN with the stochastic gradient descent algorithm, optimized by the Adam update rule, which was proved to be the best, using a trainset of 16,750 handwritten image samples of 28 × 28 each and a testset of 7947 samples. The final accuracy achieved is 90.13%. The general methodology followed consists of two stages: the image processing and the NN design and training. Finally, an application has been created that implements the methodology and automatically recognizes handwritten logic formulas. An interesting feature of the application is that it allows for creating new, user-oriented training sets and parameter settings, and thus new NN models.

Download Full-text

Hyperparameter-free optimizer of stochastic gradient descent that incorporates unit correction and moment estimation

10.1101/348557 ◽

2018 ◽

Author(s):

Kazunori D Yamada

Keyword(s):

Deep Learning ◽

Gradient Descent ◽

Mathematical Optimization ◽

Descent Method ◽

Learning Rate ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Gradient Descent Method ◽

Moment Estimation ◽

Estimation System

ABSTRACTIn the deep learning era, stochastic gradient descent is the most common method used for optimizing neural network parameters. Among the various mathematical optimization methods, the gradient descent method is the most naive. Adjustment of learning rate is necessary for quick convergence, which is normally done manually with gradient descent. Many optimizers have been developed to control the learning rate and increase convergence speed. Generally, these optimizers adjust the learning rate automatically in response to learning status. These optimizers were gradually improved by incorporating the effective aspects of earlier methods. In this study, we developed a new optimizer: YamAdam. Our optimizer is based on Adam, which utilizes the first and second moments of previous gradients. In addition to the moment estimation system, we incorporated an advantageous part of AdaDelta, namely a unit correction system, into YamAdam. According to benchmark tests on some common datasets, our optimizer showed similar or faster convergent performance compared to the existing methods. YamAdam is an option as an alternative optimizer for deep learning.

Download Full-text

Deep Convolutional Spiking Neural Networks for Image Classification

10.18122/td.1782.boisestate ◽

2021 ◽

Author(s):

Ruthvik Vaila

Keyword(s):

Neural Network ◽

Neural Networks ◽

Artificial Neural Networks ◽

Gradient Descent ◽

Stochastic Gradient ◽

Spiking Neural Networks ◽

Stochastic Gradient Descent ◽

Data Set ◽

Learning Capabilities ◽

Artificial Neural

Spiking neural networks are biologically plausible counterparts of artificial neural networks. Artificial neural networks are usually trained with stochastic gradient descent (SGD) and spiking neural networks are trained with bioinspired spike timing dependent plasticity (STDP). Spiking networks could potentially help in reducing power usage owing to their binary activations. In this work, we use unsupervised STDP in the feature extraction layers of a neural network with instantaneous neurons to extract meaningful features. The extracted binary feature vectors are then classified using classification layers containing neurons with binary activations. Gradient descent (backpropagation) is used only on the output layer to perform training for classification. Surrogate gradients are proposed to perform backpropagation with binary gradients. The accuracies obtained for MNIST and the balanced EMNIST data set compare favorably with other approaches. The effect of the stochastic gradient descent (SGD) approximations on learning capabilities of our network are also explored. We also studied catastrophic forgetting and its effect on spiking neural networks (SNNs). For the experiments regarding catastrophic forgetting, in the classification sections of the network we use a modified synaptic intelligence that we refer to as cost per synapse metric as a regularizer to immunize the network against catastrophic forgetting in a Single-Incremental-Task scenario (SIT). In catastrophic forgetting experiments, we use MNIST and EMNIST handwritten digits datasets that were divided into five and ten incremental subtasks respectively. We also examine behavior of the spiking neural network and empirically study the effect of various hyperparameters on its learning capabilities using the software tool SPYKEFLOW that we developed. We employ MNIST, EMNIST and NMNIST data sets to produce our results.

Download Full-text