scholarly journals Calibrated Stochastic Gradient Descent for Convolutional Neural Networks

Author(s):  
Li’an Zhuo ◽  
Baochang Zhang ◽  
Chen Chen ◽  
Qixiang Ye ◽  
Jianzhuang Liu ◽  
...  

In stochastic gradient descent (SGD) and its variants, the optimized gradient estimators may be as expensive to compute as the true gradient in many scenarios. This paper introduces a calibrated stochastic gradient descent (CSGD) algorithm for deep neural network optimization. A theorem is developed to prove that an unbiased estimator for the network variables can be obtained in a probabilistic way based on the Lipschitz hypothesis. Our work is significantly distinct from existing gradient optimization methods, by providing a theoretical framework for unbiased variable estimation in the deep learning paradigm to optimize the model parameter calculation. In particular, we develop a generic gradient calibration layer which can be easily used to build convolutional neural networks (CNNs). Experimental results demonstrate that CNNs with our CSGD optimization scheme can improve the stateof-the-art performance for natural image classification, digit recognition, ImageNet object classification, and object detection tasks. This work opens new research directions for developing more efficient SGD updates and analyzing the backpropagation algorithm.

Electronics ◽  
2021 ◽  
Vol 10 (22) ◽  
pp. 2761
Author(s):  
Vaios Ampelakiotis ◽  
Isidoros Perikos ◽  
Ioannis Hatzilygeroudis ◽  
George Tsihrintzis

In this paper, we present a handwritten character recognition (HCR) system that aims to recognize first-order logic handwritten formulas and create editable text files of the recognized formulas. Dense feedforward neural networks (NNs) are utilized, and their performance is examined under various training conditions and methods. More specifically, after three training algorithms (backpropagation, resilient propagation and stochastic gradient descent) had been tested, we created and trained an NN with the stochastic gradient descent algorithm, optimized by the Adam update rule, which was proved to be the best, using a trainset of 16,750 handwritten image samples of 28 × 28 each and a testset of 7947 samples. The final accuracy achieved is 90.13%. The general methodology followed consists of two stages: the image processing and the NN design and training. Finally, an application has been created that implements the methodology and automatically recognizes handwritten logic formulas. An interesting feature of the application is that it allows for creating new, user-oriented training sets and parameter settings, and thus new NN models.


2021 ◽  
Author(s):  
Tianyi Liu ◽  
Zhehui Chen ◽  
Enlu Zhou ◽  
Tuo Zhao

Momentum stochastic gradient descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning (e.g., training deep neural networks, variational Bayesian inference, etc.). Despite its empirical success, there is still a lack of theoretical understanding of convergence properties of MSGD. To fill this gap, we propose to analyze the algorithmic behavior of MSGD by diffusion approximations for nonconvex optimization problems with strict saddle points and isolated local optima. Our study shows that the momentum helps escape from saddle points but hurts the convergence within the neighborhood of optima (if without the step size annealing or momentum annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks.


2018 ◽  
Vol 11 (1) ◽  
pp. 7 ◽  
Author(s):  
Matteo Grimaldi ◽  
Valerio Tenace ◽  
Andrea Calimera

Convolutional Neural Networks (CNNs) are brain-inspired computational models designed to recognize patterns. Recent advances demonstrate that CNNs are able to achieve, and often exceed, human capabilities in many application domains. Made of several millions of parameters, even the simplest CNN shows large model size. This characteristic is a serious concern for the deployment on resource-constrained embedded-systems, where compression stages are needed to meet the stringent hardware constraints. In this paper, we introduce a novel accuracy-driven compressive training algorithm. It consists of a two-stage flow: first, layers are sorted by means of heuristic rules according to their significance; second, a modified stochastic gradient descent optimization is applied on less significant layers such that their representation is collapsed into a constrained subspace. Experimental results demonstrate that our approach achieves remarkable compression rates with low accuracy loss (<1%).


2021 ◽  
Author(s):  
Ruthvik Vaila

Spiking neural networks are biologically plausible counterparts of artificial neural networks. Artificial neural networks are usually trained with stochastic gradient descent (SGD) and spiking neural networks are trained with bioinspired spike timing dependent plasticity (STDP). Spiking networks could potentially help in reducing power usage owing to their binary activations. In this work, we use unsupervised STDP in the feature extraction layers of a neural network with instantaneous neurons to extract meaningful features. The extracted binary feature vectors are then classified using classification layers containing neurons with binary activations. Gradient descent (backpropagation) is used only on the output layer to perform training for classification. Surrogate gradients are proposed to perform backpropagation with binary gradients. The accuracies obtained for MNIST and the balanced EMNIST data set compare favorably with other approaches. The effect of the stochastic gradient descent (SGD) approximations on learning capabilities of our network are also explored. We also studied catastrophic forgetting and its effect on spiking neural networks (SNNs). For the experiments regarding catastrophic forgetting, in the classification sections of the network we use a modified synaptic intelligence that we refer to as cost per synapse metric as a regularizer to immunize the network against catastrophic forgetting in a Single-Incremental-Task scenario (SIT). In catastrophic forgetting experiments, we use MNIST and EMNIST handwritten digits datasets that were divided into five and ten incremental subtasks respectively. We also examine behavior of the spiking neural network and empirically study the effect of various hyperparameters on its learning capabilities using the software tool SPYKEFLOW that we developed. We employ MNIST, EMNIST and NMNIST data sets to produce our results.


2020 ◽  
Vol 34 (06) ◽  
pp. 10126-10135
Author(s):  
Artyom Gadetsky ◽  
Kirill Struminsky ◽  
Christopher Robinson ◽  
Novi Quadrianto ◽  
Dmitry Vetrov

Learning models with discrete latent variables using stochastic gradient descent remains a challenge due to the high variance of gradient estimates. Modern variance reduction techniques mostly consider categorical distributions and have limited applicability when the number of possible outcomes becomes large. In this work, we consider models with latent permutations and propose control variates for the Plackett-Luce distribution. In particular, the control variates allow us to optimize black-box functions over permutations using stochastic gradient descent. To illustrate the approach, we consider a variety of causal structure learning tasks for continuous and discrete data. We show that our method outperforms competitive relaxation-based optimization methods and is also applicable to non-differentiable score functions.


2020 ◽  
Vol 2020 (12) ◽  
pp. 124010
Author(s):  
Sebastian Goldt ◽  
Madhu S Advani ◽  
Andrew M Saxe ◽  
Florent Krzakala ◽  
Lenka Zdeborová

Entropy ◽  
2020 ◽  
Vol 22 (5) ◽  
pp. 560
Author(s):  
Shrihari Vasudevan

This paper demonstrates a novel approach to training deep neural networks using a Mutual Information (MI)-driven, decaying Learning Rate (LR), Stochastic Gradient Descent (SGD) algorithm. MI between the output of the neural network and true outcomes is used to adaptively set the LR for the network, in every epoch of the training cycle. This idea is extended to layer-wise setting of LR, as MI naturally provides a layer-wise performance metric. A LR range test determining the operating LR range is also proposed. Experiments compared this approach with popular alternatives such as gradient-based adaptive LR algorithms like Adam, RMSprop, and LARS. Competitive to better accuracy outcomes obtained in competitive to better time, demonstrate the feasibility of the metric and approach.


Entropy ◽  
2020 ◽  
Vol 22 (1) ◽  
pp. 101
Author(s):  
Rita Fioresi ◽  
Pratik Chaudhari ◽  
Stefano Soatto

This paper is a step towards developing a geometric understanding of a popular algorithm for training deep neural networks named stochastic gradient descent (SGD). We built upon a recent result which observed that the noise in SGD while training typical networks is highly non-isotropic. That motivated a deterministic model in which the trajectories of our dynamical systems are described via geodesics of a family of metrics arising from a certain diffusion matrix; namely, the covariance of the stochastic gradients in SGD. Our model is analogous to models in general relativity: the role of the electromagnetic field in the latter is played by the gradient of the loss function of a deep network in the former.


Sign in / Sign up

Export Citation Format

Share Document