THE USE OF CONTROL THEORY METHODS IN TRAINING NEURAL NETWORKS ON THE EXAMPLE OF TEETH RECOGNITION ON PANORAMIC X-RAY IMAGES

The article investigated a modification of stochastic gradient descent (SGD), based on the previously developed stabilization theory of discrete dynamical system cycles. Relation between stabilization of cycles in discrete dynamical systems and finding extremum points allowed us to apply new control methods to accelerate gradient descent when approaching local minima. Gradient descent is often used in training deep neural networks on a par with other iterative methods. Two gradient SGD and Adam were experimented, and we conducted comparative experiments. All experiments were conducted during solving a practical problem of teeth recognition on 2-D panoramic images. Network training showed that the new method outperforms the SGD in its capabilities and as for parameters chosen it approaches the capabilities of Adam, which is a “state of the art” method. Thus, practical utility of using control theory in the training of deep neural networks and possibility of expanding its applicability in the process of creating new algorithms in this important field are shown.

Download Full-text

How to Trick a Neural Network? Synthesising Noise to Reduce the Accuracy of Neural Network Image Classification

Herald of the Bauman Moscow State Technical University Series Instrument Engineering ◽

10.18698/0236-3933-2021-1-102-119 ◽

2021 ◽

pp. 102-119

Author(s):

A.P. Karpenko ◽

V.A. Ovchinnikov

Keyword(s):

Neural Network ◽

Neural Networks ◽

Gradient Descent ◽

Deep Neural Networks ◽

Stochastic Gradient Descent ◽

Classification Error ◽

Synthesis Algorithm ◽

Network Training ◽

The Neural Network ◽

Gradient Descent Algorithm

The study aims to develop an algorithm and then software to synthesise noise that could be used to attack deep learning neural networks designed to classify images. We present the results of our analysis of methods for conducting this type of attacks. The synthesis of attack noise is stated as a problem of multidimensional constrained optimization. The main features of the attack noise synthesis algorithm proposed are as follows: we employ the clip function to take constraints on noise into account; we use the top-1 and top-5 classification error ratings as attack noise efficiency criteria; we train our neural networks using backpropagation and Adam's gradient descent algorithm; stochastic gradient descent is employed to solve the optimisation problem indicated above; neural network training also makes use of the augmentation technique. The software was developed in Python using the Pytorch framework to dynamically differentiate the calculation graph and runs under Ubuntu 18.04 and CentOS 7. Our IDE was Visual Studio Code. We accelerated the computation via CUDA executed on a NVIDIA Titan XP GPU. The paper presents the results of a broad computational experiment in synthesising non-universal and universal attack noise types for eight deep neural networks. We show that the attack algorithm proposed is able to increase the neural network error by eight times

Download Full-text

A Diffusion Approximation Theory of Momentum Stochastic Gradient Descent in Nonconvex Optimization

Stochastic Systems ◽

10.1287/stsy.2021.0083 ◽

2021 ◽

Author(s):

Tianyi Liu ◽

Zhehui Chen ◽

Enlu Zhou ◽

Tuo Zhao

Keyword(s):

Neural Networks ◽

Nonconvex Optimization ◽

Gradient Descent ◽

Deep Neural Networks ◽

Optimization Problems ◽

Saddle Points ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Nonconvex Optimization Problems ◽

Empirical Success

Momentum stochastic gradient descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning (e.g., training deep neural networks, variational Bayesian inference, etc.). Despite its empirical success, there is still a lack of theoretical understanding of convergence properties of MSGD. To fill this gap, we propose to analyze the algorithmic behavior of MSGD by diffusion approximations for nonconvex optimization problems with strict saddle points and isolated local optima. Our study shows that the momentum helps escape from saddle points but hurts the convergence within the neighborhood of optima (if without the step size annealing or momentum annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks.

Download Full-text

An Enhanced Stochastic Gradient Descent Variance Reduced Ascension Optimization Algorithm for Deep Neural Networks

Applied Computer Vision and Image Processing - Advances in Intelligent Systems and Computing ◽

10.1007/978-981-15-4029-5_38 ◽

2020 ◽

pp. 378-385

Author(s):

Arifa Shikalgar ◽

Shefali Sonavane

Keyword(s):

Neural Networks ◽

Optimization Algorithm ◽

Gradient Descent ◽

Deep Neural Networks ◽

Stochastic Gradient ◽

Stochastic Gradient Descent

Download Full-text

Archetypal landscapes for deep neural networks

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1919995117 ◽

2020 ◽

Vol 117 (36) ◽

pp. 21857-21864

Author(s):

Philipp C. Verpoort ◽

Alpha A. Lee ◽

David J. Wales

Keyword(s):

Neural Networks ◽

Learning Community ◽

Gradient Descent ◽

Deep Neural Networks ◽

Loss Functions ◽

Stochastic Gradient Descent ◽

High Dimensional ◽

Local Minima ◽

High Loss ◽

Optimization Schemes

The predictive capabilities of deep neural networks (DNNs) continue to evolve to increasingly impressive levels. However, it is still unclear how training procedures for DNNs succeed in finding parameters that produce good results for such high-dimensional and nonconvex loss functions. In particular, we wish to understand why simple optimization schemes, such as stochastic gradient descent, do not end up trapped in local minima with high loss values that would not yield useful predictions. We explain the optimizability of DNNs by characterizing the local minima and transition states of the loss-function landscape (LFL) along with their connectivity. We show that the LFL of a DNN in the shallow network or data-abundant limit is funneled, and thus easy to optimize. Crucially, in the opposite low-data/deep limit, although the number of minima increases, the landscape is characterized by many minima with similar loss values separated by low barriers. This organization is different from the hierarchical landscapes of structural glass formers and explains why minimization procedures commonly employed by the machine-learning community can navigate the LFL successfully and reach low-lying solutions.

Download Full-text

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/452 ◽

2020 ◽

Author(s):

Jinghui Chen ◽

Dongruo Zhou ◽

Yiqi Tang ◽

Ziyan Yang ◽

Yuan Cao ◽

...

Keyword(s):

Neural Networks ◽

Convergence Rate ◽

Gradient Descent ◽

Deep Neural Networks ◽

Gradient Methods ◽

Estimation Method ◽

Fast Convergence ◽

Stochastic Gradient Descent ◽

Adaptive Parameter ◽

Fast Convergence Rate

Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes "over adapted". We design a new algorithm, called Partially adaptive momentum estimation method, which unifies the Adam/Amsgrad with SGD by introducing a partial adaptive parameter $p$, to achieve the best from both worlds. We also prove the convergence rate of our proposed algorithm to a stationary point in the stochastic nonconvex optimization setting. Experiments on standard benchmarks show that our proposed algorithm can maintain fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks. These results would suggest practitioners pick up adaptive gradient methods once again for faster training of deep neural networks.

Download Full-text

Deep Neural Networks with Multistate Activation Functions

Computational Intelligence and Neuroscience ◽

10.1155/2015/721367 ◽

2015 ◽

Vol 2015 ◽

pp. 1-10 ◽

Cited By ~ 2

Author(s):

Chenghao Cai ◽

Yanyan Xu ◽

Dengfeng Ke ◽

Kaile Su

Keyword(s):

Neural Networks ◽

Gradient Descent ◽

Deep Neural Networks ◽

Error Rates ◽

Stochastic Gradient Descent ◽

Activation Functions ◽

Classification Problems ◽

Training Set ◽

Relative Improvement ◽

Better Than

We propose multistate activation functions (MSAFs) for deep neural networks (DNNs). These MSAFs are new kinds of activation functions which are capable of representing more than two states, including theN-order MSAFs and the symmetrical MSAF. DNNs with these MSAFs can be trained via conventional Stochastic Gradient Descent (SGD) as well as mean-normalised SGD. We also discuss how these MSAFs perform when used to resolve classification problems. Experimental results on the TIMIT corpus reveal that, on speech recognition tasks, DNNs with MSAFs perform better than the conventional DNNs, getting a relative improvement of 5.60% on phoneme error rates. Further experiments also reveal that mean-normalised SGD facilitates the training processes of DNNs with MSAFs, especially when being with large training sets. The models can also be directly trained without pretraining when the training set is sufficiently large, which results in a considerable relative improvement of 5.82% on word error rates.

Download Full-text

Mutual Information Based Learning Rate Decay for Stochastic Gradient Descent Training of Deep Neural Networks

Entropy ◽

10.3390/e22050560 ◽

2020 ◽

Vol 22 (5) ◽

pp. 560

Author(s):

Shrihari Vasudevan

Keyword(s):

Neural Networks ◽

Mutual Information ◽

Gradient Descent ◽

Deep Neural Networks ◽

Learning Rate ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Novel Approach ◽

The Neural Network ◽

Gradient Based

This paper demonstrates a novel approach to training deep neural networks using a Mutual Information (MI)-driven, decaying Learning Rate (LR), Stochastic Gradient Descent (SGD) algorithm. MI between the output of the neural network and true outcomes is used to adaptively set the LR for the network, in every epoch of the training cycle. This idea is extended to layer-wise setting of LR, as MI naturally provides a layer-wise performance metric. A LR range test determining the operating LR range is also proposed. Experiments compared this approach with popular alternatives such as gradient-based adaptive LR algorithms like Adam, RMSprop, and LARS. Competitive to better accuracy outcomes obtained in competitive to better time, demonstrate the feasibility of the metric and approach.

Download Full-text

Adam and the Ants: On the Influence of the Optimization Algorithm on the Detectability of DNN Watermarks

Entropy ◽

10.3390/e22121379 ◽

2020 ◽

Vol 22 (12) ◽

pp. 1379

Author(s):

Betty Cortiñas-Lorenzo ◽

Fernando Pérez-González

Keyword(s):

Neural Networks ◽

Theoretical Analysis ◽

Digital Watermarking ◽

Optimization Algorithm ◽

Gradient Descent ◽

Deep Neural Networks ◽

Stochastic Gradient Descent ◽

Secret Message ◽

Weight Distributions ◽

A Minor

As training Deep Neural Networks (DNNs) becomes more expensive, the interest in protecting the ownership of the models with watermarking techniques increases. Uchida et al. proposed a digital watermarking algorithm that embeds the secret message into the model coefficients. However, despite its appeal, in this paper, we show that its efficacy can be compromised by the optimization algorithm being used. In particular, we found through a theoretical analysis that, as opposed to Stochastic Gradient Descent (SGD), the update direction given by Adam optimization strongly depends on the sign of a combination of columns of the projection matrix used for watermarking. Consequently, as observed in the empirical results, this makes the coefficients move in unison giving rise to heavily spiked weight distributions that can be easily detected by adversaries. As a way to solve this problem, we propose a new method called Block-Orthonormal Projections (BOP) that allows one to combine watermarking with Adam optimization with a minor impact on the detectability of the watermark and an increased robustness.

Download Full-text

A Geometric Interpretation of Stochastic Gradient Descent Using Diffusion Metrics

Entropy ◽

10.3390/e22010101 ◽

2020 ◽

Vol 22 (1) ◽

pp. 101

Author(s):

Rita Fioresi ◽

Pratik Chaudhari ◽

Stefano Soatto

Keyword(s):

Neural Networks ◽

General Relativity ◽

Gradient Descent ◽

Deep Neural Networks ◽

Deterministic Model ◽

Geometric Interpretation ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Diffusion Matrix

This paper is a step towards developing a geometric understanding of a popular algorithm for training deep neural networks named stochastic gradient descent (SGD). We built upon a recent result which observed that the noise in SGD while training typical networks is highly non-isotropic. That motivated a deterministic model in which the trajectories of our dynamical systems are described via geodesics of a family of metrics arising from a certain diffusion matrix; namely, the covariance of the stochastic gradients in SGD. Our model is analogous to models in general relativity: the role of the electromagnetic field in the latter is played by the gradient of the loss function of a deep network in the former.

Download Full-text

Correspondence between neuroevolution and gradient descent

Nature Communications ◽

10.1038/s41467-021-26568-2 ◽

2021 ◽

Vol 12 (1) ◽

Author(s):

Stephen Whitelam ◽

Viktor Selin ◽

Sang-Won Park ◽

Isaac Tamblyn

Keyword(s):

Neural Network ◽

Numerical Simulation ◽

Neural Networks ◽

Loss Function ◽

Gradient Descent ◽

Deep Neural Networks ◽

Gaussian White Noise ◽

Training Methods ◽

Neural Network Training ◽

Network Training

AbstractWe show analytically that training a neural network by conditioned stochastic mutation or neuroevolution of its weights is equivalent, in the limit of small mutations, to gradient descent on the loss function in the presence of Gaussian white noise. Averaged over independent realizations of the learning process, neuroevolution is equivalent to gradient descent on the loss function. We use numerical simulation to show that this correspondence can be observed for finite mutations, for shallow and deep neural networks. Our results provide a connection between two families of neural-network training methods that are usually considered to be fundamentally different.

Download Full-text