On the Neural Tangent Kernel of Deep Networks with Orthogonal Initialization

The prevailing thinking is that orthogonal weights are crucial to enforcing dynamical isometry and speeding up training. The increase in learning speed that results from orthogonal initialization in linear networks has been well-proven. However, while the same is believed to also hold for nonlinear networks when the dynamical isometry condition is satisfied, the training dynamics behind this contention have not been thoroughly explored. In this work, we study the dynamics of ultra-wide networks across a range of architectures, including Fully Connected Networks (FCNs) and Convolutional Neural Networks (CNNs) with orthogonal initialization via neural tangent kernel (NTK). Through a series of propositions and lemmas, we prove that two NTKs, one corresponding to Gaussian weights and one to orthogonal weights, are equal when the network width is infinite. Further, during training, the NTK of an orthogonally-initialized infinite-width network should theoretically remain constant. This suggests that the orthogonal initialization cannot speed up training in the NTK (lazy training) regime, contrary to the prevailing thoughts. In order to explore under what circumstances can orthogonality accelerate training, we conduct a thorough empirical investigation outside the NTK regime. We find that when the hyper-parameters are set to achieve a linear regime in nonlinear activation, orthogonal initialization can improve the learning speed with a large learning rate or large depth.

Download Full-text

Binary and Multiclass Text Classification by Means of Separable Convolutional Neural Network

Inventions ◽

10.3390/inventions6040070 ◽

2021 ◽

Vol 6 (4) ◽

pp. 70

Author(s):

Elena Solovyeva ◽

Ali Abdullah

Keyword(s):

Neural Network ◽

Neural Networks ◽

Convolutional Neural Network ◽

Recurrent Neural Networks ◽

Low Cost ◽

Computational Cost ◽

High Accuracy ◽

Activation Functions ◽

Fully Connected ◽

Fully Connected Networks

In this paper, the structure of a separable convolutional neural network that consists of an embedding layer, separable convolutional layers, convolutional layer and global average pooling is represented for binary and multiclass text classifications. The advantage of the proposed structure is the absence of multiple fully connected layers, which is used to increase the classification accuracy but raises the computational cost. The combination of low-cost separable convolutional layers and a convolutional layer is proposed to gain high accuracy and, simultaneously, to reduce the complexity of neural classifiers. Advantages are demonstrated at binary and multiclass classifications of written texts by means of the proposed networks under the sigmoid and Softmax activation functions in convolutional layer. At binary and multiclass classifications, the accuracy obtained by separable convolutional neural networks is higher in comparison with some investigated types of recurrent neural networks and fully connected networks.

Download Full-text

Convergence Behavior of DNNs with Mutual-Information-Based Regularization

Entropy ◽

10.3390/e22070727 ◽

2020 ◽

Vol 22 (7) ◽

pp. 727 ◽

Cited By ~ 1

Author(s):

Hlynur Jónsson ◽

Giovanni Cherubini ◽

Evangelos Eleftheriou

Keyword(s):

Neural Networks ◽

Mutual Information ◽

Low Complexity ◽

High Dimensional ◽

Test Accuracy ◽

Compression Phase ◽

Hidden Layer ◽

Low Dimensional ◽

Fully Connected ◽

Fully Connected Networks

Information theory concepts are leveraged with the goal of better understanding and improving Deep Neural Networks (DNNs). The information plane of neural networks describes the behavior during training of the mutual information at various depths between input/output and hidden-layer variables. Previous analysis revealed that most of the training epochs are spent on compressing the input, in some networks where finiteness of the mutual information can be established. However, the estimation of mutual information is nontrivial for high-dimensional continuous random variables. Therefore, the computation of the mutual information for DNNs and its visualization on the information plane mostly focused on low-complexity fully connected networks. In fact, even the existence of the compression phase in complex DNNs has been questioned and viewed as an open problem. In this paper, we present the convergence of mutual information on the information plane for a high-dimensional VGG-16 Convolutional Neural Network (CNN) by resorting to Mutual Information Neural Estimation (MINE), thus confirming and extending the results obtained with low-dimensional fully connected networks. Furthermore, we demonstrate the benefits of regularizing a network, especially for a large number of training epochs, by adopting mutual information estimates as additional terms in the loss function characteristic of the network. Experimental results show that the regularization stabilizes the test accuracy and significantly reduces its variance.

Download Full-text

Notes on the Symmetries of 2-Layer ReLU-Networks

Proceedings of the Northern Lights Deep Learning Workshop ◽

10.7557/18.5150 ◽

2020 ◽

Vol 1 ◽

pp. 6

Author(s):

Henning Petzka ◽

Martin Trimmel ◽

Cristian Sminchisescu

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Complete Characterization ◽

Activation Functions ◽

Network Function ◽

Fully Connected ◽

Fully Connected Networks

Symmetries in neural networks allow different weight configurations leading to the same network function. For odd activation functions, the set of transformations mapping between such configurations have been studied extensively, but less is known for neural networks with ReLU activation functions. We give a complete characterization for fully-connected networks with two layers. Apart from two well-known transformations, only degenerated situations allow additional transformations that leave the network function unchanged. Reduction steps can remove only part of the degenerated cases. Finally, we present a non-degenerate situation for deep neural networks leading to new transformations leaving the network function intact.

Download Full-text

VECA: A Method for Detecting Overfitting in Neural Networks (Student Abstract)

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i10.7167 ◽

2020 ◽

Vol 34 (10) ◽

pp. 13791-13792

Author(s):

Liangzhu Ge ◽

Yuexian Hou ◽

Yaju Jiang ◽

Shuai Yao ◽

Chao Yang

Keyword(s):

Neural Networks ◽

Strong Correlation ◽

Good Predictor ◽

Deep Neural Networks ◽

Training Data ◽

Training Process ◽

Generalization Performance ◽

Validation Set ◽

Fully Connected ◽

Fully Connected Networks

Despite their widespread applications, deep neural networks often tend to overfit the training data. Here, we propose a measure called VECA (Variance of Eigenvalues of Covariance matrix of Activation matrix) and demonstrate that VECA is a good predictor of networks' generalization performance during the training process. Experiments performed on fully-connected networks and convolutional neural networks trained on benchmark image datasets show a strong correlation between test loss and VECA, which suggest that we can calculate the VECA to estimate generalization performance without sacrificing training data to be used as a validation set.

Download Full-text

Research on the Identification for a Nonlinear System

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.216.39 ◽

2011 ◽

Vol 216 ◽

pp. 39-44

Author(s):

Shi Liang Lv ◽

Jin Guo Liu ◽

Ping Jia

Keyword(s):

Neural Networks ◽

Nonlinear System ◽

Nonlinear Dynamic System ◽

Universal Approximation ◽

Dynamic Information ◽

Input And Output ◽

Learning Speed ◽

Speed Up ◽

The Neural Networks ◽

Networks Structure

The characteristic of the drift error of inertial platform is a high-order nonlinear dynamic system, using the neural networks’ abilities of universal approximation of differentiable trajectory and capturing system dynamic information, this paper presents the drift error identifying project of inertial platform based on Elman networks structure. First, the drift error model of inertial platform is established, after selecting the input and output for network, momentum and alterable speed algorithm is used to speed up the network convergence. On the basis of the algorithm, the extended nonlinear node function in the hidden network does not only improve the learning speed of network, but also satisfies the need of accuracy on system identification. Through the drift error data measured on inertial platform, the training result shows that the scheme achieves satisfied identification results.

Download Full-text

A Predictive-Coding Network That Is Both Discriminative and Generative

Neural Computation ◽

10.1162/neco_a_01311 ◽

2020 ◽

Vol 32 (10) ◽

pp. 1836-1862

Author(s):

Wei Sun ◽

Jeff Orchard

Keyword(s):

Neural Networks ◽

Predictive Coding ◽

Connectivity Pattern ◽

Nonlinear Networks ◽

Linear Networks ◽

Learning Rules ◽

Network Hierarchy ◽

Input Sample ◽

Interesting Class ◽

Feedback Connections

Predictive coding (PC) networks are a biologically interesting class of neural networks. Their layered hierarchy mimics the reciprocal connectivity pattern observed in the mammalian cortex, and they can be trained using local learning rules that approximate backpropagation (Bogacz, 2017 ). However, despite having feedback connections that enable information to flow down the network hierarchy, discriminative PC networks are not typically generative. Clamping the output class and running the network to equilibrium yields an input sample that usually does not resemble the training input. This letter studies this phenomenon and proposes a simple solution that promotes the generation of input samples that resemble the training inputs. Simple decay, a technique already in wide use in neural networks, pushes the PC network toward a unique minimum two-norm solution, and that unique solution provably (for linear networks) matches the training inputs. The method also vastly improves the samples generated for nonlinear networks, as we demonstrate on MNIST.

Download Full-text

Topological measurement of deep neural networks using persistent homology

Annals of Mathematics and Artificial Intelligence ◽

10.1007/s10472-021-09761-3 ◽

2021 ◽

Author(s):

Satoru Watanabe ◽

Hayato Yamana

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Persistent Homology ◽

Topological Data Analysis ◽

Data Sets ◽

One Dimensional ◽

Novel Approach ◽

The One ◽

Fully Connected ◽

Fully Connected Networks

AbstractThe inner representation of deep neural networks (DNNs) is indecipherable, which makes it difficult to tune DNN models, control their training process, and interpret their outputs. In this paper, we propose a novel approach to investigate the inner representation of DNNs through topological data analysis (TDA). Persistent homology (PH), one of the outstanding methods in TDA, was employed for investigating the complexities of trained DNNs. We constructed clique complexes on trained DNNs and calculated the one-dimensional PH of DNNs. The PH reveals the combinational effects of multiple neurons in DNNs at different resolutions, which is difficult to be captured without using PH. Evaluations were conducted using fully connected networks (FCNs) and networks combining FCNs and convolutional neural networks (CNNs) trained on the MNIST and CIFAR-10 data sets. Evaluation results demonstrate that the PH of DNNs reflects both the excess of neurons and problem difficulty, making PH one of the prominent methods for investigating the inner representation of DNNs.

Download Full-text

DeeperThings: Fully Distributed CNN Inference on Resource-Constrained Edge Devices

International Journal of Parallel Programming ◽

10.1007/s10766-021-00712-3 ◽

2021 ◽

Author(s):

Rafael Stahl ◽

Alexander Hoffman ◽

Daniel Mueller-Gritschneder ◽

Andreas Gerstlauer ◽

Ulf Schlichtmann

Keyword(s):

Neural Networks ◽

Input Data ◽

Fusion Method ◽

Memory Footprint ◽

Speed Up ◽

Run Time ◽

Holistic Optimization ◽

Fully Connected ◽

Communication Demands ◽

Inference Task

AbstractPerforming inference of Convolutional Neural Networks (CNNs) on Internet of Things (IoT) edge devices ensures both privacy of input data and possible run time reductions when compared to a cloud solution. As most edge devices are memory- and compute-constrained, they cannot store and execute complex CNNs. Partitioning and distributing layer information across multiple edge devices to reduce the amount of computation and data on each device presents a solution to this problem. In this article, we propose DeeperThings, an approach that supports a full distribution of CNN inference tasks by partitioning fully-connected as well as both feature- and weight-intensive convolutional layers. Additionally, we jointly optimize memory, computation and communication demands. This is achieved using techniques to combine both feature and weight partitioning with a communication-aware layer fusion method, enabling holistic optimization across layers. For a given number of edge devices, the schemes are applied jointly using Integer Linear Programming (ILP) formulations to minimize data exchanged between devices, to optimize run times and to find the entire model’s minimal memory footprint. Experimental results from a real-world hardware setup running four different CNN models confirm that the scheme is able to evenly balance the memory footprint between devices. For six devices on 100 Mbit/s connections the integration of layer fusion additionally leads to a reduction of communication demands by up to 28.8%. This results in run time speed-up of the inference task by up to 1.52x compared to layer partitioning without fusing.

Download Full-text

Optimizing Convolutional Neural Network Parameters for Better Image Classification

10.36227/techrxiv.12089358 ◽

2020 ◽

Author(s):

Manik Dhingra ◽

Sarthak Rawat ◽

Jinan Fiaidhi

Keyword(s):

Neural Network ◽

Neural Networks ◽

Image Classification ◽

Web Service ◽

Recognition Task ◽

Extreme Learning Machines ◽

Data Set ◽

Learning Machines ◽

Fully Connected ◽

Fully Connected Networks

The work presented here works on getting higher performances for image recognition task using convolutional neural networks on the MNIST handwritten digits data-set. A range of techniques are compared for improvements with respect to time and accuracy, such as using one-shot Extreme Learning Machines (ELM) in place of the iteratively tuned fully-connected networks for classification, using transfer learning for faster convergence of image classification, and improving the size of data-set and making robust models by image augmentation. The final implementation is hosted on cloud as a web-service for better visualization of the prediction results.

Download Full-text

Periodic modes of group dominance in fully coupled neural networks

Izvestiya VUZ Applied Nonlinear Dynamics ◽

10.18500/0869-6632-2021-29-5-775-798 ◽

2021 ◽

Vol 29 (5) ◽

pp. 775-798

Author(s):

Sergey Glyzin ◽

◽

Andrey Kolesov ◽

Keyword(s):

Neural Networks ◽

Dynamic Properties ◽

Relaxation Cycle ◽

Coupled Neural Networks ◽

Refractory State ◽

Existence And Stability ◽

Fully Connected ◽

Fully Coupled ◽

Equations With Delay ◽

Fully Connected Networks

Nonlinear systems of differential equations with delay, which are mathematical models of fully connected networks of impulse neurons, are considered. Purpose of this work is to study the dynamic properties of one special class of solutions to these systems. Large parameter methods are used to study the existence and stability in сonsidered models of special periodic motions – the so-called group dominance or k-dominance modes, where k ∈ N. Results. It is shown that each such regime is a relaxation cycle, exactly k components of which perform synchronous impulse oscillations, and all other components are asymptotically small. The maximum number of stable coexisting group dominance cycles in the system with an appropriate choice of parameters is 2m − 1, where m is the number of network elements. Conclusion. Considered model with maximum possible number of couplings allows us to describe the most complex and diverse behavior that may be observed in biological neural associations. A feature of the k-dominance modes we have considered is that some of the network neurons are in a non-working (refractory) state. Each periodic k-dominance mode can be associated with a binary vector (α1, α2, . . . , αm), where αj = 1 if the j-th neuron is active and αj = 0 otherwise. Taking this into account, we come to the conclusion that these modes can be used to build devices with associative memory based on artificial neural networks.

Download Full-text