Improving Adversarial Attacks on Deep Neural Networks via Constricted Gradient-based Perturbations

Mutual Information Based Learning Rate Decay for Stochastic Gradient Descent Training of Deep Neural Networks

Entropy ◽

10.3390/e22050560 ◽

2020 ◽

Vol 22 (5) ◽

pp. 560

Author(s):

Shrihari Vasudevan

Keyword(s):

Neural Networks ◽

Mutual Information ◽

Gradient Descent ◽

Deep Neural Networks ◽

Learning Rate ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Novel Approach ◽

The Neural Network ◽

Gradient Based

This paper demonstrates a novel approach to training deep neural networks using a Mutual Information (MI)-driven, decaying Learning Rate (LR), Stochastic Gradient Descent (SGD) algorithm. MI between the output of the neural network and true outcomes is used to adaptively set the LR for the network, in every epoch of the training cycle. This idea is extended to layer-wise setting of LR, as MI naturally provides a layer-wise performance metric. A LR range test determining the operating LR range is also proposed. Experiments compared this approach with popular alternatives such as gradient-based adaptive LR algorithms like Adam, RMSprop, and LARS. Competitive to better accuracy outcomes obtained in competitive to better time, demonstrate the feasibility of the metric and approach.

Download Full-text

Generalization Error Bounds of Gradient Descent for Learning Over-Parameterized Deep ReLU Networks

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.5736 ◽

2020 ◽

Vol 34 (04) ◽

pp. 3349-3356

Author(s):

Yuan Cao ◽

Quanquan Gu

Keyword(s):

Neural Networks ◽

Error Bounds ◽

Gradient Descent ◽

Deep Neural Networks ◽

Empirical Studies ◽

Training Data ◽

Generalization Error ◽

Generalization Performance ◽

Gradient Based ◽

Random Initialization

Empirical studies show that gradient-based methods can learn deep neural networks (DNNs) with very good generalization performance in the over-parameterization regime, where DNNs can easily fit a random labeling of the training data. Very recently, a line of work explains in theory that with over-parameterization and proper random initialization, gradient-based methods can find the global minima of the training loss for DNNs. However, existing generalization error bounds are unable to explain the good generalization performance of over-parameterized DNNs. The major limitation of most existing generalization bounds is that they are based on uniform convergence and are independent of the training algorithm. In this work, we derive an algorithm-dependent generalization error bound for deep ReLU networks, and show that under certain assumptions on the data distribution, gradient descent (GD) with proper random initialization is able to train a sufficiently over-parameterized DNN to achieve arbitrarily small generalization error. Our work sheds light on explaining the good generalization performance of over-parameterized deep neural networks.

Download Full-text

A Gradient-Based Algorithm to Deceive Deep Neural Networks

Communications in Computer and Information Science - Neural Information Processing ◽

10.1007/978-3-030-36808-1_7 ◽

2019 ◽

pp. 57-65

Author(s):

Tianying Xie ◽

Yantao Li

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Gradient Based

Download Full-text

Effects of depth, width, and initialization: A convergence analysis of layer-wise training for deep linear neural networks

Analysis and Applications ◽

10.1142/s0219530521500263 ◽

2021 ◽

pp. 1-47

Author(s):

Yeonjong Shin

Keyword(s):

Neural Networks ◽

Convergence Analysis ◽

Deep Neural Networks ◽

Computational Cost ◽

Back Propagation ◽

Single Layer ◽

Intermediate Layers ◽

Machine Learning Applications ◽

Gradient Based ◽

Accelerate Convergence

Deep neural networks have been used in various machine learning applications and achieved tremendous empirical successes. However, training deep neural networks is a challenging task. Many alternatives have been proposed in place of end-to-end back-propagation. Layer-wise training is one of them, which trains a single layer at a time, rather than trains the whole layers simultaneously. In this paper, we study a layer-wise training using a block coordinate gradient descent (BCGD) for deep linear networks. We establish a general convergence analysis of BCGD and found the optimal learning rate, which results in the fastest decrease in the loss. We identify the effects of depth, width, and initialization. When the orthogonal-like initialization is employed, we show that the width of intermediate layers plays no role in gradient-based training beyond a certain threshold. Besides, we found that the use of deep networks could drastically accelerate convergence when it is compared to those of a depth 1 network, even when the computational cost is considered. Numerical examples are provided to justify our theoretical findings and demonstrate the performance of layer-wise training by BCGD.

Download Full-text

Deep forest

National Science Review ◽

10.1093/nsr/nwy108 ◽

2018 ◽

Vol 6 (1) ◽

pp. 74-86 ◽

Cited By ~ 25

Author(s):

Zhi-Hua Zhou ◽

Ji Feng

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Decision Trees ◽

Deep Neural Networks ◽

Machine Learning Techniques ◽

Model Complexity ◽

Layer By Layer ◽

Theoretical Understanding ◽

Deep Forest ◽

Gradient Based

Abstract Current deep-learning models are mostly built upon neural networks, i.e. multiple layers of parameterized differentiable non-linear modules that can be trained by backpropagation. In this paper, we explore the possibility of building deep models based on non-differentiable modules such as decision trees. After a discussion about the mystery behind deep neural networks, particularly by contrasting them with shallow neural networks and traditional machine-learning techniques such as decision trees and boosting machines, we conjecture that the success of deep neural networks owes much to three characteristics, i.e. layer-by-layer processing, in-model feature transformation and sufficient model complexity. On one hand, our conjecture may offer inspiration for theoretical understanding of deep learning; on the other hand, to verify the conjecture, we propose an approach that generates deep forest holding these characteristics. This is a decision-tree ensemble approach, with fewer hyper-parameters than deep neural networks, and its model complexity can be automatically determined in a data-dependent way. Experiments show that its performance is quite robust to hyper-parameter settings, such that in most cases, even across different data from different domains, it is able to achieve excellent performance by using the same default setting. This study opens the door to deep learning based on non-differentiable modules without gradient-based adjustment, and exhibits the possibility of constructing deep models without backpropagation.

Download Full-text

IWA: Integrated gradient‐based white‐box attacks for fooling deep neural networks

International Journal of Intelligent Systems ◽

10.1002/int.22720 ◽

2021 ◽

Author(s):

Yixiang Wang ◽

Jiqiang Liu ◽

Xiaolin Chang ◽

Jelena Mišić ◽

Vojislav B. Mišić

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Gradient Based

Download Full-text

Critical Point-Finding Methods Reveal Gradient-Flat Regions of Deep Network Losses

Neural Computation ◽

10.1162/neco_a_01388 ◽

2021 ◽

pp. 1-29

Author(s):

Charles G. Frye ◽

James Simon ◽

Neha S. Wadia ◽

Andrew Ligeralde ◽

Michael R. DeWeese ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Critical Points ◽

Deep Neural Networks ◽

Local Minima ◽

Local Curvature ◽

Gradient Based ◽

Network Losses ◽

Pass Through ◽

Theoretical Results

Abstract Despite the fact that the loss functions of deep neural networks are highly nonconvex, gradient-based optimization algorithms converge to approximately the same performance from many random initial points. One thread of work has focused on explaining this phenomenon by numerically characterizing the local curvature near critical points of the loss function, where the gradients are near zero. Such studies have reported that neural network losses enjoy a no-bad-local-minima property, in disagreement with more recent theoretical results. We report here that the methods used to find these putative critical points suffer from a bad local minima problem of their own: they often converge to or pass through regions where the gradient norm has a stationary point. We call these gradient-flat regions, since they arise when the gradient is approximately in the kernel of the Hessian, such that the loss is locally approximately linear, or flat, in the direction of the gradient. We describe how the presence of these regions necessitates care in both interpreting past results that claimed to find critical points of neural network losses and in designing second-order methods for optimizing neural networks.

Download Full-text

Deep neural networks trained with heavier data augmentation learn features closer to representations in hIT

10.32470/ccn.2018.1046-0 ◽

2018 ◽

Cited By ~ 1

Author(s):

Alex Hernández-García ◽

Johannes Mehrer ◽

Nikolaus Kriegeskorte ◽

Peter König ◽

Tim C. Kietzmann

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Data Augmentation

Download Full-text

Representation of adversarial images in deep neural networks and the human brain

10.32470/ccn.2018.1066-0 ◽

2018 ◽

Author(s):

Chi Zhang ◽

Xiaohan Duan ◽

Ruyuan Zhang ◽

Li Tong

Keyword(s):

Neural Networks ◽

Human Brain ◽

Deep Neural Networks

Download Full-text

Study on Intelligent Security Camera Systems for the Automated Detection of Nighttime Snatching Incidents using Deep Neural Networks

IEEJ Transactions on Industry Applications ◽

10.1541/ieejias.136.727 ◽

2016 ◽

Vol 136 (10) ◽

pp. 727-734 ◽

Cited By ~ 2

Author(s):

Itaru Nagayama ◽

Akira Miyahara ◽

Koichi Shimabukuro

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Automated Detection ◽

Camera Systems ◽

Intelligent Security

Download Full-text