scholarly journals On the Neural Tangent Kernel of Deep Networks with Orthogonal Initialization

Author(s):  
Wei Huang ◽  
Weitao Du ◽  
Richard Yi Da Xu

The prevailing thinking is that orthogonal weights are crucial to enforcing dynamical isometry and speeding up training. The increase in learning speed that results from orthogonal initialization in linear networks has been well-proven. However, while the same is believed to also hold for nonlinear networks when the dynamical isometry condition is satisfied, the training dynamics behind this contention have not been thoroughly explored. In this work, we study the dynamics of ultra-wide networks across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs) with orthogonal initialization via neural tangent kernel (NTK). Through a series of propositions and lemmas, we prove that two NTKs, one corresponding to Gaussian weights and one to orthogonal weights, are equal when the network width is infinite. Further, during training, the NTK of an orthogonally-initialized infinite-width network should theoretically remain constant. This suggests that the orthogonal initialization cannot speed up training in the NTK (lazy training) regime, contrary to the prevailing thoughts. In order to explore under what circumstances can orthogonality accelerate training, we conduct a thorough empirical investigation outside the NTK regime. We find that when the hyper-parameters are set to achieve a linear regime in nonlinear activation, orthogonal initialization can improve the learning speed with a large learning rate or large depth.

Inventions ◽  
2021 ◽  
Vol 6 (4) ◽  
pp. 70
Author(s):  
Elena Solovyeva ◽  
Ali Abdullah

In this paper, the structure of a separable convolutional neural network that consists of an embedding layer, separable convolutional layers, convolutional layer and global average pooling is represented for binary and multiclass text classifications. The advantage of the proposed structure is the absence of multiple fully connected layers, which is used to increase the classification accuracy but raises the computational cost. The combination of low-cost separable convolutional layers and a convolutional layer is proposed to gain high accuracy and, simultaneously, to reduce the complexity of neural classifiers. Advantages are demonstrated at binary and multiclass classifications of written texts by means of the proposed networks under the sigmoid and Softmax activation functions in convolutional layer. At binary and multiclass classifications, the accuracy obtained by separable convolutional neural networks is higher in comparison with some investigated types of recurrent neural networks and fully connected networks.


Entropy ◽  
2020 ◽  
Vol 22 (7) ◽  
pp. 727 ◽  
Author(s):  
Hlynur Jónsson ◽  
Giovanni Cherubini ◽  
Evangelos Eleftheriou

Information theory concepts are leveraged with the goal of better understanding and improving Deep Neural Networks (DNNs). The information plane of neural networks describes the behavior during training of the mutual information at various depths between input/output and hidden-layer variables. Previous analysis revealed that most of the training epochs are spent on compressing the input, in some networks where finiteness of the mutual information can be established. However, the estimation of mutual information is nontrivial for high-dimensional continuous random variables. Therefore, the computation of the mutual information for DNNs and its visualization on the information plane mostly focused on low-complexity fully connected networks. In fact, even the existence of the compression phase in complex DNNs has been questioned and viewed as an open problem. In this paper, we present the convergence of mutual information on the information plane for a high-dimensional VGG-16 Convolutional Neural Network (CNN) by resorting to Mutual Information Neural Estimation (MINE), thus confirming and extending the results obtained with low-dimensional fully connected networks. Furthermore, we demonstrate the benefits of regularizing a network, especially for a large number of training epochs, by adopting mutual information estimates as additional terms in the loss function characteristic of the network. Experimental results show that the regularization stabilizes the test accuracy and significantly reduces its variance.


2020 ◽  
Vol 1 ◽  
pp. 6
Author(s):  
Henning Petzka ◽  
Martin Trimmel ◽  
Cristian Sminchisescu

Symmetries in neural networks allow different weight configurations leading to the same network function. For odd activation functions, the set of transformations mapping between such configurations have been studied extensively, but less is known for neural networks with ReLU activation functions. We give a complete characterization for fully-connected networks with two layers. Apart from two well-known transformations, only degenerated situations allow additional transformations that leave the network function unchanged. Reduction steps can remove only part of the degenerated cases. Finally, we present a non-degenerate situation for deep neural networks leading to new transformations leaving the network function intact.


2020 ◽  
Vol 34 (10) ◽  
pp. 13791-13792
Author(s):  
Liangzhu Ge ◽  
Yuexian Hou ◽  
Yaju Jiang ◽  
Shuai Yao ◽  
Chao Yang

Despite their widespread applications, deep neural networks often tend to overfit the training data. Here, we propose a measure called VECA (Variance of Eigenvalues of Covariance matrix of Activation matrix) and demonstrate that VECA is a good predictor of networks' generalization performance during the training process. Experiments performed on fully-connected networks and convolutional neural networks trained on benchmark image datasets show a strong correlation between test loss and VECA, which suggest that we can calculate the VECA to estimate generalization performance without sacrificing training data to be used as a validation set.


2011 ◽  
Vol 216 ◽  
pp. 39-44
Author(s):  
Shi Liang Lv ◽  
Jin Guo Liu ◽  
Ping Jia

The characteristic of the drift error of inertial platform is a high-order nonlinear dynamic system, using the neural networks’ abilities of universal approximation of differentiable trajectory and capturing system dynamic information, this paper presents the drift error identifying project of inertial platform based on Elman networks structure. First, the drift error model of inertial platform is established, after selecting the input and output for network, momentum and alterable speed algorithm is used to speed up the network convergence. On the basis of the algorithm, the extended nonlinear node function in the hidden network does not only improve the learning speed of network, but also satisfies the need of accuracy on system identification. Through the drift error data measured on inertial platform, the training result shows that the scheme achieves satisfied identification results.


2020 ◽  
Vol 32 (10) ◽  
pp. 1836-1862
Author(s):  
Wei Sun ◽  
Jeff Orchard

Predictive coding (PC) networks are a biologically interesting class of neural networks. Their layered hierarchy mimics the reciprocal connectivity pattern observed in the mammalian cortex, and they can be trained using local learning rules that approximate backpropagation (Bogacz, 2017 ). However, despite having feedback connections that enable information to flow down the network hierarchy, discriminative PC networks are not typically generative. Clamping the output class and running the network to equilibrium yields an input sample that usually does not resemble the training input. This letter studies this phenomenon and proposes a simple solution that promotes the generation of input samples that resemble the training inputs. Simple decay, a technique already in wide use in neural networks, pushes the PC network toward a unique minimum two-norm solution, and that unique solution provably (for linear networks) matches the training inputs. The method also vastly improves the samples generated for nonlinear networks, as we demonstrate on MNIST.


Author(s):  
Satoru Watanabe ◽  
Hayato Yamana

AbstractThe inner representation of deep neural networks (DNNs) is indecipherable, which makes it difficult to tune DNN models, control their training process, and interpret their outputs. In this paper, we propose a novel approach to investigate the inner representation of DNNs through topological data analysis (TDA). Persistent homology (PH), one of the outstanding methods in TDA, was employed for investigating the complexities of trained DNNs. We constructed clique complexes on trained DNNs and calculated the one-dimensional PH of DNNs. The PH reveals the combinational effects of multiple neurons in DNNs at different resolutions, which is difficult to be captured without using PH. Evaluations were conducted using fully connected networks (FCNs) and networks combining FCNs and convolutional neural networks (CNNs) trained on the MNIST and CIFAR-10 data sets. Evaluation results demonstrate that the PH of DNNs reflects both the excess of neurons and problem difficulty, making PH one of the prominent methods for investigating the inner representation of DNNs.


Author(s):  
Rafael Stahl ◽  
Alexander Hoffman ◽  
Daniel Mueller-Gritschneder ◽  
Andreas Gerstlauer ◽  
Ulf Schlichtmann

AbstractPerforming inference of Convolutional Neural Networks (CNNs) on Internet of Things (IoT) edge devices ensures both privacy of input data and possible run time reductions when compared to a cloud solution. As most edge devices are memory- and compute-constrained, they cannot store and execute complex CNNs. Partitioning and distributing layer information across multiple edge devices to reduce the amount of computation and data on each device presents a solution to this problem. In this article, we propose DeeperThings, an approach that supports a full distribution of CNN inference tasks by partitioning fully-connected as well as both feature- and weight-intensive convolutional layers. Additionally, we jointly optimize memory, computation and communication demands. This is achieved using techniques to combine both feature and weight partitioning with a communication-aware layer fusion method, enabling holistic optimization across layers. For a given number of edge devices, the schemes are applied jointly using Integer Linear Programming (ILP) formulations to minimize data exchanged between devices, to optimize run times and to find the entire model’s minimal memory footprint. Experimental results from a real-world hardware setup running four different CNN models confirm that the scheme is able to evenly balance the memory footprint between devices. For six devices on 100 Mbit/s connections the integration of layer fusion additionally leads to a reduction of communication demands by up to 28.8%. This results in run time speed-up of the inference task by up to 1.52x compared to layer partitioning without fusing.


2020 ◽  
Author(s):  
Manik Dhingra ◽  
Sarthak Rawat ◽  
Jinan Fiaidhi

The work presented here works on getting higher performances for image recognition task using convolutional neural networks on the MNIST handwritten digits data-set. A range of techniques are compared for improvements with respect to time and accuracy, such as using one-shot Extreme Learning Machines (ELM) in place of the iteratively tuned fully-connected networks for classification, using transfer learning for faster convergence of image classification, and improving the size of data-set and making robust models by image augmentation. The final implementation is hosted on cloud as a web-service for better visualization of the prediction results.


2021 ◽  
Vol 29 (5) ◽  
pp. 775-798
Author(s):  
Sergey Glyzin ◽  
◽  
Andrey Kolesov ◽  

Nonlinear systems of differential equations with delay, which are mathematical models of fully connected networks of impulse neurons, are considered. Purpose of this work is to study the dynamic properties of one special class of solutions to these systems. Large parameter methods are used to study the existence and stability in сonsidered models of special periodic motions – the so-called group dominance or k-dominance modes, where k ∈ N. Results. It is shown that each such regime is a relaxation cycle, exactly k components of which perform synchronous impulse oscillations, and all other components are asymptotically small. The maximum number of stable coexisting group dominance cycles in the system with an appropriate choice of parameters is 2m − 1, where m is the number of network elements. Conclusion. Considered model with maximum possible number of couplings allows us to describe the most complex and diverse behavior that may be observed in biological neural associations. A feature of the k-dominance modes we have considered is that some of the network neurons are in a non-working (refractory) state. Each periodic k-dominance mode can be associated with a binary vector (α1, α2, . . . , αm), where αj = 1 if the j-th neuron is active and αj = 0 otherwise. Taking this into account, we come to the conclusion that these modes can be used to build devices with associative memory based on artificial neural networks.


Sign in / Sign up

Export Citation Format

Share Document