Learning deep linear neural networks: Riemannian gradient flows and convergence to global minimizers

Information and Inference A Journal of the IMA ◽

10.1093/imaiai/iaaa039 ◽

2021 ◽

Author(s):

Bubacarr Bah ◽

Holger Rauhut ◽

Ulrich Terstiege ◽

Michael Westdickenberg

Keyword(s):

Neural Networks ◽

Critical Point ◽

Global Minimum ◽

Gradient Flow ◽

Activation Function ◽

Riemannian Metric ◽

Gradient Flows ◽

Global Minimizers ◽

Network Layers ◽

Almost All

Abstract We study the convergence of gradient flows related to learning deep linear neural networks (where the activation function is the identity map) from data. In this case, the composition of the network layers amounts to simply multiplying the weight matrices of all layers together, resulting in an overparameterized problem. The gradient flow with respect to these factors can be re-interpreted as a Riemannian gradient flow on the manifold of rank-$r$ matrices endowed with a suitable Riemannian metric. We show that the flow always converges to a critical point of the underlying functional. Moreover, we establish that, for almost all initializations, the flow converges to a global minimum on the manifold of rank $k$ matrices for some $k\leq r$.

Download Full-text

Truly Sparse Neural Networks at Scale

10.21203/rs.3.rs-133395/v1 ◽

2021 ◽

Author(s):

Selima Curci ◽

Decebal Constantin Mocanu ◽

Mykola Pechenizkiy

Keyword(s):

Neural Networks ◽

Brain Size ◽

Gradient Flow ◽

Activation Function ◽

Full Potential ◽

Training Methods ◽

Dense Matrix ◽

Representational Power ◽

Learning Software ◽

Hidden Neurons

Abstract Recently, sparse training methods have started to be established as a de facto approach for training and inference efficiency in artificial neural networks. Yet, this efficiency is just in theory. In practice, everyone uses a binary mask to simulate sparsity since the typical deep learning software and hardware are optimized for dense matrix operations. In this paper, we take an orthogonal approach, and we show that we can train truly sparse neural networks to harvest their full potential. To achieve this goal, we introduce three novel contributions, specially designed for sparse neural networks: (1) a parallel training algorithm and its corresponding sparse implementation from scratch, (2) an activation function with non-trainable parameters to favour the gradient flow, and (3) a hidden neurons importance metric to eliminate redundancies. All in one, we are able to break the record and to train the largest neural network ever trained in terms of representational power -- reaching the bat brain size. The results show that our approach has state-of-the-art performance while opening the path for an environmentally friendly artificial intelligence era.

Download Full-text

Performance Comparison of CNN Models Using Gradient Flow Analysis

Informatics ◽

10.3390/informatics8030053 ◽

2021 ◽

Vol 8 (3) ◽

pp. 53

Author(s):

Seol-Hyun Noh

Keyword(s):

Neural Networks ◽

Language Processing ◽

Input Data ◽

Flow Analysis ◽

Gradient Flow ◽

Performance Comparison ◽

Superior Performance ◽

Gradient Flows ◽

The Neural Network ◽

Learning Techniques

Convolutional neural networks (CNNs) are widely used among the various deep learning techniques available because of their superior performance in the fields of computer vision and natural language processing. CNNs can effectively extract the locality and correlation of input data using structures in which convolutional layers are successively applied to the input data. In general, the performance of neural networks has improved as the depth of CNNs has increased. However, an increase in the depth of a CNN is not always accompanied by an increase in the accuracy of the neural network. This is because the gradient vanishing problem may arise, causing the weights of the weighted layers to fail to converge. Accordingly, the gradient flows of the VGGNet, ResNet, SENet, and DenseNet models were analyzed and compared in this study, and the reasons for the differences in the error rate performances of the models were derived.

Download Full-text

ModuloNET: Neural Networks Meet Modular Arithmetic for Efficient Hardware Masking

IACR Transactions on Cryptographic Hardware and Embedded Systems ◽

10.46586/tches.v2022.i1.506-556 ◽

2021 ◽

pp. 506-556

Author(s):

Anuj Dubey ◽

Afzal Ahmad ◽

Muhammad Adeel Pasha ◽

Rosario Cammarota ◽

Aydin Aysu

Keyword(s):

Neural Networks ◽

State Of The Art ◽

Activation Function ◽

Modular Arithmetic ◽

Side Channel ◽

First Order ◽

Major Threat ◽

Network Layers ◽

Inference Engines ◽

High Area

Intellectual Property (IP) thefts of trained machine learning (ML) models through side-channel attacks on inference engines are becoming a major threat. Indeed, several recent works have shown reverse engineering of the model internals using such attacks, but the research on building defenses is largely unexplored. There is a critical need to efficiently and securely transform those defenses from cryptography such as masking to ML frameworks. Existing works, however, revealed that a straightforward adaptation of such defenses either provides partial security or leads to high area overheads. To address those limitations, this work proposes a fundamentally new direction to construct neural networks that are inherently more compatible with masking. The key idea is to use modular arithmetic in neural networks and then efficiently realize masking, in either Boolean or arithmetic fashion, depending on the type of neural network layers. We demonstrate our approach on the edge-computing friendly binarized neural networks (BNN) and show how to modify the training and inference of such a network to work with modular arithmetic without sacrificing accuracy. We then design novel masking gadgets using Domain-Oriented Masking (DOM) to efficiently mask the unique operations of ML such as the activation function and the output layer classification, and we prove their security in the glitch-extended probing model. Finally, we implement fully masked neural networks on an FPGA, quantify that they can achieve a similar latency while reducing the FF and LUT costs over the state-of-the-art protected implementations by 34.2% and 42.6%, respectively, and demonstrate their first-order side-channel security with up to 1M traces.

Download Full-text

SEM Cold Stage Techniques for the Observation of Vegetative and Floral Apices of Oat (Avena sativa L.) Plants

Proceedings, annual meeting, Electron Microscopy Society of America ◽

10.1017/s0424820100080377 ◽

1977 ◽

Vol 35 ◽

pp. 590-591

Author(s):

B. K. Kirchoff ◽

L.F. Allard ◽

W.C. Bigelow

Keyword(s):

Critical Point ◽

Avena Sativa ◽

Substantial Improvement ◽

Fresh Tissue ◽

Cold Stage ◽

Critical Point Drying ◽

Almost All ◽

Cell Collapse ◽

Conductive Paint ◽

Carbon Based

In attempting to use the SEM to investigate the transition from the vegetative to the floral state in oat (Avena sativa L.) it was discovered that the procedures of fixation and critical point drying (CPD), and fresh tissue examination of the specimens gave unsatisfactory results. In most cases, by using these techniques, cells of the tissue were collapsed or otherwise visibly distorted. Figure 1 shows the results of fixation with 4.5% formaldehyde-gluteraldehyde followed by CPD. Almost all cellular detail has been obscured by the resulting shrinkage distortions. The larger cracks seen on the left of the picture may be due to dissection damage, rather than CPD. The results of observation of fresh tissue are seen in Fig. 2. Although there is a substantial improvement over CPD, some cell collapse still occurs.Due to these difficulties, it was decided to experiment with cold stage techniques. The specimens to be observed were dissected out and attached to the sample stub using a carbon based conductive paint in acetone.

Download Full-text

SCORING MODELING BASED ON NEURAL NETWORKS FOR DETERMINING A BANK BORROWER'S RATING

Economy of Ukraine ◽

10.15407/economyukr.2020.10.054 ◽

2020 ◽

Vol 2020 (10) ◽

pp. 54-62

Author(s):

Oleksii VASYLIEV ◽

Keyword(s):

Neural Network ◽

Neural Networks ◽

Network Architecture ◽

Statistical Data ◽

Activation Function ◽

Decision Making Process ◽

Neural Network Architecture ◽

Acceptable Accuracy ◽

The Neural Network ◽

Sigmoid Activation Function

The problem of applying neural networks to calculate ratings used in banking in the decision-making process on granting or not granting loans to borrowers is considered. The task is to determine the rating function of the borrower based on a set of statistical data on the effectiveness of loans provided by the bank. When constructing a regression model to calculate the rating function, it is necessary to know its general form. If so, the task is to calculate the parameters that are included in the expression for the rating function. In contrast to this approach, in the case of using neural networks, there is no need to specify the general form for the rating function. Instead, certain neural network architecture is chosen and parameters are calculated for it on the basis of statistical data. Importantly, the same neural network architecture can be used to process different sets of statistical data. The disadvantages of using neural networks include the need to calculate a large number of parameters. There is also no universal algorithm that would determine the optimal neural network architecture. As an example of the use of neural networks to determine the borrower's rating, a model system is considered, in which the borrower's rating is determined by a known non-analytical rating function. A neural network with two inner layers, which contain, respectively, three and two neurons and have a sigmoid activation function, is used for modeling. It is shown that the use of the neural network allows restoring the borrower's rating function with quite acceptable accuracy.

Download Full-text

The New Activation Function for Complex Valued Neural Networks: Complex Swish Function

4th International Symposium on Innovative Approaches in Engineering and Natural Sciences Proceedings ◽

10.36287/setsci.4.6.050 ◽

2019 ◽

Author(s):

Mehmet Çelebi ◽

Murat Ceylan

Keyword(s):

Neural Networks ◽

Activation Function ◽

Complex Valued

Download Full-text

Analysis of Non-Linear Activation Functions for Classification Tasks Using Convolutional Neural Networks

Recent Patents on Computer Science ◽

10.2174/2213275911666181025143029 ◽

2019 ◽

Vol 12 (3) ◽

pp. 156-161 ◽

Cited By ~ 3

Author(s):

Aman Dureja ◽

Payal Pahwa

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Activation Function ◽

Primary Objective ◽

Experimental Comparison ◽

Activation Functions ◽

Practical Applications ◽

Network Activation ◽

Non Linear ◽

Hidden Layer

Background: In making the deep neural network, activation functions play an important role. But the choice of activation functions also affects the network in term of optimization and to retrieve the better results. Several activation functions have been introduced in machine learning for many practical applications. But which activation function should use at hidden layer of deep neural networks was not identified. Objective: The primary objective of this analysis was to describe which activation function must be used at hidden layers for deep neural networks to solve complex non-linear problems. Methods: The configuration for this comparative model was used by using the datasets of 2 classes (Cat/Dog). The number of Convolutional layer used in this network was 3 and the pooling layer was also introduced after each layer of CNN layer. The total of the dataset was divided into the two parts. The first 8000 images were mainly used for training the network and the next 2000 images were used for testing the network. Results: The experimental comparison was done by analyzing the network by taking different activation functions on each layer of CNN network. The validation error and accuracy on Cat/Dog dataset were analyzed using activation functions (ReLU, Tanh, Selu, PRelu, Elu) at number of hidden layers. Overall the Relu gave best performance with the validation loss at 25th Epoch 0.3912 and validation accuracy at 25th Epoch 0.8320. Conclusion: It is found that a CNN model with ReLU hidden layers (3 hidden layers here) gives best results and improve overall performance better in term of accuracy and speed. These advantages of ReLU in CNN at number of hidden layers are helpful to effectively and fast retrieval of images from the databases.

Download Full-text

Hardware implementation of radial-basis neural networks with Gaussian activation functions on FPGA

Neural Computing and Applications ◽

10.1007/s00521-021-05706-3 ◽

2021 ◽

Author(s):

Volodymyr Shymkovych ◽

Sergii Telenyk ◽

Petro Kravets

Keyword(s):

Neural Networks ◽

Hardware Implementation ◽

Gaussian Function ◽

Activation Function ◽

Rbf Neural Networks ◽

Activation Functions ◽

Rbf Network ◽

Combination Scheme ◽

Radial Basis ◽

Hidden Layer

AbstractThis article introduces a method for realizing the Gaussian activation function of radial-basis (RBF) neural networks with their hardware implementation on field-programmable gaits area (FPGAs). The results of modeling of the Gaussian function on FPGA chips of different families have been presented. RBF neural networks of various topologies have been synthesized and investigated. The hardware component implemented by this algorithm is an RBF neural network with four neurons of the latent layer and one neuron with a sigmoid activation function on an FPGA using 16-bit numbers with a fixed point, which took 1193 logic matrix gate (LUTs—LookUpTable). Each hidden layer neuron of the RBF network is designed on an FPGA as a separate computing unit. The speed as a total delay of the combination scheme of the block RBF network was 101.579 ns. The implementation of the Gaussian activation functions of the hidden layer of the RBF network occupies 106 LUTs, and the speed of the Gaussian activation functions is 29.33 ns. The absolute error is ± 0.005. The Spartan 3 family of chips for modeling has been used to get these results. Modeling on chips of other series has been also introduced in the article. RBF neural networks of various topologies have been synthesized and investigated. Hardware implementation of RBF neural networks with such speed allows them to be used in real-time control systems for high-speed objects.

Download Full-text

Trigonometric Inference Providing Learning in Deep Neural Networks

Applied Sciences ◽

10.3390/app11156704 ◽

2021 ◽

Vol 11 (15) ◽

pp. 6704

Author(s):

Jingyong Cai ◽

Masashi Takemoto ◽

Yuming Qiu ◽

Hironori Nakajo

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Deep Neural Networks ◽

Activation Function ◽

Trigonometric Approximation ◽

Model Parameters ◽

Training Algorithms ◽

Activation Functions ◽

Classical Training ◽

Sum Formula

Despite being heavily used in the training of deep neural networks (DNNs), multipliers are resource-intensive and insufficient in many different scenarios. Previous discoveries have revealed the superiority when activation functions, such as the sigmoid, are calculated by shift-and-add operations, although they fail to remove multiplications in training altogether. In this paper, we propose an innovative approach that can convert all multiplications in the forward and backward inferences of DNNs into shift-and-add operations. Because the model parameters and backpropagated errors of a large DNN model are typically clustered around zero, these values can be approximated by their sine values. Multiplications between the weights and error signals are transferred to multiplications of their sine values, which are replaceable with simpler operations with the help of the product to sum formula. In addition, a rectified sine activation function is utilized for further converting layer inputs into sine values. In this way, the original multiplication-intensive operations can be computed through simple add-and-shift operations. This trigonometric approximation method provides an efficient training and inference alternative for devices with insufficient hardware multipliers. Experimental results demonstrate that this method is able to obtain a performance close to that of classical training algorithms. The approach we propose sheds new light on future hardware customization research for machine learning.

Download Full-text

Deep Neural Networks Classification via Binary Error-Detecting Output Codes

Applied Sciences ◽

10.3390/app11083563 ◽

2021 ◽

Vol 11 (8) ◽

pp. 3563

Author(s):

Martin Klimo ◽

Peter Lukáč ◽

Peter Tarábek

Keyword(s):

Neural Networks ◽

Categorical Data ◽

Coding Theory ◽

Hamming Distance ◽

Activation Function ◽

Block Codes ◽

Ease Of Use ◽

Linear Block Codes ◽

Binary Coding ◽

The One

One-hot encoding is the prevalent method used in neural networks to represent multi-class categorical data. Its success stems from its ease of use and interpretability as a probability distribution when accompanied by a softmax activation function. However, one-hot encoding leads to very high dimensional vector representations when the categorical data’s cardinality is high. The Hamming distance in one-hot encoding is equal to two from the coding theory perspective, which does not allow detection or error-correcting capabilities. Binary coding provides more possibilities for encoding categorical data into the output codes, which mitigates the limitations of the one-hot encoding mentioned above. We propose a novel method based on Zadeh fuzzy logic to train binary output codes holistically. We study linear block codes for their possibility of separating class information from the checksum part of the codeword, showing their ability not only to detect recognition errors by calculating non-zero syndrome, but also to evaluate the truth-value of the decision. Experimental results show that the proposed approach achieves similar results as one-hot encoding with a softmax function in terms of accuracy, reliability, and out-of-distribution performance. It suggests a good foundation for future applications, mainly classification tasks with a high number of classes.

Download Full-text