Effect of Initial Configuration of Weights on Training and Function of Artificial Neural Networks

The function and performance of neural networks are largely determined by the evolution of their weights and biases in the process of training, starting from the initial configuration of these parameters to one of the local minima of the loss function. We perform the quantitative statistical characterization of the deviation of the weights of two-hidden-layer feedforward ReLU networks of various sizes trained via Stochastic Gradient Descent (SGD) from their initial random configuration. We compare the evolution of the distribution function of this deviation with the evolution of the loss during training. We observed that successful training via SGD leaves the network in the close neighborhood of the initial configuration of its weights. For each initial weight of a link we measured the distribution function of the deviation from this value after training and found how the moments of this distribution and its peak depend on the initial weight. We explored the evolution of these deviations during training and observed an abrupt increase within the overfitting region. This jump occurs simultaneously with a similarly abrupt increase recorded in the evolution of the loss function. Our results suggest that SGD’s ability to efficiently find local minima is restricted to the vicinity of the random initial configuration of weights.

Download Full-text

Archetypal landscapes for deep neural networks

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1919995117 ◽

2020 ◽

Vol 117 (36) ◽

pp. 21857-21864

Author(s):

Philipp C. Verpoort ◽

Alpha A. Lee ◽

David J. Wales

Keyword(s):

Neural Networks ◽

Learning Community ◽

Gradient Descent ◽

Deep Neural Networks ◽

Loss Functions ◽

Stochastic Gradient Descent ◽

High Dimensional ◽

Local Minima ◽

High Loss ◽

Optimization Schemes

The predictive capabilities of deep neural networks (DNNs) continue to evolve to increasingly impressive levels. However, it is still unclear how training procedures for DNNs succeed in finding parameters that produce good results for such high-dimensional and nonconvex loss functions. In particular, we wish to understand why simple optimization schemes, such as stochastic gradient descent, do not end up trapped in local minima with high loss values that would not yield useful predictions. We explain the optimizability of DNNs by characterizing the local minima and transition states of the loss-function landscape (LFL) along with their connectivity. We show that the LFL of a DNN in the shallow network or data-abundant limit is funneled, and thus easy to optimize. Crucially, in the opposite low-data/deep limit, although the number of minima increases, the landscape is characterized by many minima with similar loss values separated by low barriers. This organization is different from the hierarchical landscapes of structural glass formers and explains why minimization procedures commonly employed by the machine-learning community can navigate the LFL successfully and reach low-lying solutions.

Download Full-text

Shaping the learning landscape in neural networks around wide flat minima

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1908636117 ◽

2019 ◽

Vol 117 (1) ◽

pp. 161-170 ◽

Cited By ~ 2

Author(s):

Carlo Baldassi ◽

Fabrizio Pittorino ◽

Riccardo Zecchina

Keyword(s):

Neural Networks ◽

Loss Function ◽

Critical Points ◽

Learning Process ◽

Numerical Study ◽

Cross Entropy ◽

Stochastic Gradient Descent ◽

Neural Network Models ◽

Entropy Loss ◽

Error Loss

Learning in deep neural networks takes place by minimizing a nonconvex high-dimensional loss function, typically by a stochastic gradient descent (SGD) strategy. The learning process is observed to be able to find good minimizers without getting stuck in local critical points and such minimizers are often satisfactory at avoiding overfitting. How these 2 features can be kept under control in nonlinear devices composed of millions of tunable connections is a profound and far-reaching open question. In this paper we study basic nonconvex 1- and 2-layer neural network models that learn random patterns and derive a number of basic geometrical and algorithmic features which suggest some answers. We first show that the error loss function presents few extremely wide flat minima (WFM) which coexist with narrower minima and critical points. We then show that the minimizers of the cross-entropy loss function overlap with the WFM of the error loss. We also show examples of learning devices for which WFM do not exist. From the algorithmic perspective we derive entropy-driven greedy and message-passing algorithms that focus their search on wide flat regions of minimizers. In the case of SGD and cross-entropy loss, we show that a slow reduction of the norm of the weights along the learning process also leads to WFM. We corroborate the results by a numerical study of the correlations between the volumes of the minimizers, their Hessian, and their generalization performance on real data.

Download Full-text

A MODIFIED ERROR BACKPROPAGATION ALGORITHM FOR COMPLEX-VALUE NEURAL NETWORKS

International Journal of Neural Systems ◽

10.1142/s0129065705000426 ◽

2005 ◽

Vol 15 (06) ◽

pp. 435-443 ◽

Cited By ~ 10

Author(s):

XIAOMING CHEN ◽

ZHENG TANG ◽

CATHERINE VARIAPPAN ◽

SONGSONG LI ◽

TOSHIMI OKADA

Keyword(s):

Neural Networks ◽

Image Processing ◽

Fourier Transformation ◽

Error Function ◽

Local Minima ◽

Backpropagation Algorithm ◽

Speed Up ◽

Simulation Results ◽

Hidden Layer ◽

Complex Valued

The complex-valued backpropagation algorithm has been widely used in fields of dealing with telecommunications, speech recognition and image processing with Fourier transformation. However, the local minima problem usually occurs in the process of learning. To solve this problem and to speed up the learning process, we propose a modified error function by adding a term to the conventional error function, which is corresponding to the hidden layer error. The simulation results show that the proposed algorithm is capable of preventing the learning from sticking into the local minima and of speeding up the learning.

Download Full-text

Penentuan Arsitektur Jaringan Syaraf Tiruan Backpropagation (Bobot Awal dan Bias Awal) Menggunakan Algoritma Genetika

IJCCS (Indonesian Journal of Computing and Cybernetics Systems) ◽

10.22146/ijccs.6642 ◽

2015 ◽

Vol 9 (1) ◽

pp. 77 ◽

Cited By ~ 3

Author(s):

Christian Dwi Suhendra ◽

Retantyo Wardoyo

Keyword(s):

Neural Network ◽

Genetic Algorithm ◽

Local Minima ◽

Initial Weight ◽

Neural Network Learning ◽

Squared Error ◽

Fitness Value ◽

Artificial Neural ◽

Hidden Layer ◽

Backpropagation Learning

AbstrakKelemahan dari jaringan syaraf tiruan backpropagation adalah sangat lama untuk konvergen dan permasalahan lokal mininum yang membuat jaringan syaraf tiruan (JST) sering terjebak pada lokal minimum. Kombinasi parameter arsiktektur, bobot awal dan bias awal yang baik sangat menentukan kemampuan belajar dari JST untuk mengatasi kelemahan dari JST backpropagation. Pada penelitian Ini dikembangkan sebuah metode untuk menentukan kombinasi parameter arsitektur, bobot awal dan bias awal. Selama ini kombinasi ini dilakukan dengan mencoba kemungkinan satu per satu, baik kombinasi hidden layer pada architecture maupun bobot awal, dan bias awal. Bobot awal dan bias awal digunakan sebagai parameter dalam perhitungan nilai fitness. Ukuran setiap individu terbaik dilihat dari besarnya jumlah kuadrat galat (sum of squared error = SSE) masing – masing individu, individu dengan SSE terkecil merupakan individu terbaik. Kombinasi parameter arsiktektur, bobot awal dan bias awal yang terbaik akan digunakan sebagai parameter dalam pelatihan JST backpropagation.Hasil dari penelitian ini adalah sebuah solusi alternatif untuk menyelesaikan permasalahan pada pembelajaran backpropagation yang sering mengalami masalah dalam penentuan parameter pembelajaran. Hasil penelitian ini menunjukan bahwa metode algoritma genetika dapat memberikan solusi bagi pembelajaran backpropagation dan memberikan tingkat akurasi yang lebih baik, serta menurunkan lama pembelajaran jika dibandingkan dengan penentuan parameter yang dilakukan secara manual. Kata kunci Jaringan syaraf tiruan, algoritma genetika, backpropagation, SSE, lokal minimum AbstractThe weakness of back propagation neural network is very slow to converge and local minima issues that makes artificial neural networks (ANN) are often being trapped in a local minima. A good combination between architecture, intial weight and bias are so important to overcome the weakness of backpropagation neural network.This study developed a method to determine the combination parameter of architectur, initial weight and bias. So far, trial and error is commonly used to select the combination of hidden layer, intial weight and bias. Initial weight and bias is used as a parameter in order to evaluate fitness value. Sum of squared error(SSE) is used to determine best individual. individual with the smallest SSE is the best individual. Best combination parameter of architecture, initial weight and bias will be used as a paramater in the backpropagation neural network learning. The results of this study is an alternative solution to solve the problems on the backpropagation learning that often have problems in determining the parameters of the learning. The result shows genetic algorithm method can provide a solution for backpropagation learning and can improve the accuracy, also reduce long learning when it compared with the parameters were determined manually. Keywords: Artificial neural network, genetic algorithm, backpropagation, SSE, local minima.

Download Full-text

Mean Field Analysis of Deep Neural Networks

Mathematics of Operations Research ◽

10.1287/moor.2020.1118 ◽

2021 ◽

Author(s):

Justin Sirignano ◽

Konstantinos Spiliopoulos

Keyword(s):

Neural Network ◽

Neural Networks ◽

Mean Field ◽

Field Analysis ◽

Stochastic Gradient Descent ◽

Large Network ◽

Limiting Behavior ◽

Large Numbers ◽

Network Output ◽

Hidden Layer

We analyze multilayer neural networks in the asymptotic regime of simultaneously (a) large network sizes and (b) large numbers of stochastic gradient descent training iterations. We rigorously establish the limiting behavior of the multilayer neural network output. The limit procedure is valid for any number of hidden layers, and it naturally also describes the limiting behavior of the training loss. The ideas that we explore are to (a) take the limits of each hidden layer sequentially and (b) characterize the evolution of parameters in terms of their initialization. The limit satisfies a system of deterministic integro-differential equations. The proof uses methods from weak convergence and stochastic analysis. We show that, under suitable assumptions on the activation functions and the behavior for large times, the limit neural network recovers a global minimum (with zero loss for the objective function).

Download Full-text

A mean field view of the landscape of two-layer neural networks

Proceedings of the National Academy of Sciences ◽

10.1073/pnas.1806579115 ◽

2018 ◽

Vol 115 (33) ◽

pp. E7665-E7671 ◽

Cited By ~ 45

Author(s):

Song Mei ◽

Andrea Montanari ◽

Phan-Minh Nguyen

Keyword(s):

Neural Networks ◽

Nonlinear Partial Differential Equation ◽

Risk Function ◽

Scaling Limit ◽

Mean Field ◽

Convergence Result ◽

Global Optimum ◽

Stochastic Gradient Descent ◽

Local Optimum ◽

Local Minima

Multilayer neural networks are among the most powerful models in machine learning, yet the fundamental reasons for this success defy mathematical understanding. Learning a neural network requires optimizing a nonconvex high-dimensional objective (risk function), a problem that is usually attacked using stochastic gradient descent (SGD). Does SGD converge to a global optimum of the risk or only to a local optimum? In the former case, does this happen because local minima are absent or because SGD somehow avoids them? In the latter, why do local minima reached by SGD have good generalization properties? In this paper, we consider a simple case, namely two-layer neural networks, and prove that—in a suitable scaling limit—SGD dynamics is captured by a certain nonlinear partial differential equation (PDE) that we call distributional dynamics (DD). We then consider several specific examples and show how DD can be used to prove convergence of SGD to networks with nearly ideal generalization error. This description allows for “averaging out” some of the complexities of the landscape of neural networks and can be used to prove a general convergence result for noisy SGD.

Download Full-text

Learning Low-precision Neural Networks without Straight-Through Estimator (STE)

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/425 ◽

2019 ◽

Cited By ~ 3

Author(s):

Zhi-Gang Liu ◽

Matthew Mattina

Keyword(s):

Neural Networks ◽

Loss Function ◽

Gradient Descent ◽

Stochastic Gradient ◽

Coefficient Alpha ◽

Stochastic Gradient Descent ◽

Theoretical Understanding ◽

Alternative Methodology ◽

Affine Combination

The Straight-Through Estimator (STE) is widely used for back-propagating gradients through the quantization function, but the STE technique lacks a complete theoretical understanding. We propose an alternative methodology called alpha-blending (AB), which quantizes neural networks to low precision using stochastic gradient descent (SGD). Our AB method avoids STE approximation by replacing the quantized weight in the loss function by an affine combination of the quantized weight w_q and the corresponding full-precision weight w with non-trainable scalar coefficient alpha and (1- alpha). During training, alpha is gradually increased from 0 to 1; the gradient updates to the weights are through the full precision term, (1-alpha) * w, of the affine combination; the model is converted from full-precision to low precision progressively. To evaluate the AB method, a 1-bit BinaryNet on CIFAR10 dataset and 8-bits, 4-bits MobileNet v1, ResNet_50 v1/2 on ImageNet are trained using the alpha-blending approach, and the evaluation indicates that AB improves top-1 accuracy by 0.9\%, 0.82\% and 2.93\% respectively compared to the results of STE based quantization.

Download Full-text

Analysis of Non-Linear Activation Functions for Classification Tasks Using Convolutional Neural Networks

Recent Patents on Computer Science ◽

10.2174/2213275911666181025143029 ◽

2019 ◽

Vol 12 (3) ◽

pp. 156-161 ◽

Cited By ~ 3

Author(s):

Aman Dureja ◽

Payal Pahwa

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Activation Function ◽

Primary Objective ◽

Experimental Comparison ◽

Activation Functions ◽

Practical Applications ◽

Network Activation ◽

Non Linear ◽

Hidden Layer

Background: In making the deep neural network, activation functions play an important role. But the choice of activation functions also affects the network in term of optimization and to retrieve the better results. Several activation functions have been introduced in machine learning for many practical applications. But which activation function should use at hidden layer of deep neural networks was not identified. Objective: The primary objective of this analysis was to describe which activation function must be used at hidden layers for deep neural networks to solve complex non-linear problems. Methods: The configuration for this comparative model was used by using the datasets of 2 classes (Cat/Dog). The number of Convolutional layer used in this network was 3 and the pooling layer was also introduced after each layer of CNN layer. The total of the dataset was divided into the two parts. The first 8000 images were mainly used for training the network and the next 2000 images were used for testing the network. Results: The experimental comparison was done by analyzing the network by taking different activation functions on each layer of CNN network. The validation error and accuracy on Cat/Dog dataset were analyzed using activation functions (ReLU, Tanh, Selu, PRelu, Elu) at number of hidden layers. Overall the Relu gave best performance with the validation loss at 25th Epoch 0.3912 and validation accuracy at 25th Epoch 0.8320. Conclusion: It is found that a CNN model with ReLU hidden layers (3 hidden layers here) gives best results and improve overall performance better in term of accuracy and speed. These advantages of ReLU in CNN at number of hidden layers are helpful to effectively and fast retrieval of images from the databases.

Download Full-text

Hardware implementation of radial-basis neural networks with Gaussian activation functions on FPGA

Neural Computing and Applications ◽

10.1007/s00521-021-05706-3 ◽

2021 ◽

Author(s):

Volodymyr Shymkovych ◽

Sergii Telenyk ◽

Petro Kravets

Keyword(s):

Neural Networks ◽

Hardware Implementation ◽

Gaussian Function ◽

Activation Function ◽

Rbf Neural Networks ◽

Activation Functions ◽

Rbf Network ◽

Combination Scheme ◽

Radial Basis ◽

Hidden Layer

AbstractThis article introduces a method for realizing the Gaussian activation function of radial-basis (RBF) neural networks with their hardware implementation on field-programmable gaits area (FPGAs). The results of modeling of the Gaussian function on FPGA chips of different families have been presented. RBF neural networks of various topologies have been synthesized and investigated. The hardware component implemented by this algorithm is an RBF neural network with four neurons of the latent layer and one neuron with a sigmoid activation function on an FPGA using 16-bit numbers with a fixed point, which took 1193 logic matrix gate (LUTs—LookUpTable). Each hidden layer neuron of the RBF network is designed on an FPGA as a separate computing unit. The speed as a total delay of the combination scheme of the block RBF network was 101.579 ns. The implementation of the Gaussian activation functions of the hidden layer of the RBF network occupies 106 LUTs, and the speed of the Gaussian activation functions is 29.33 ns. The absolute error is ± 0.005. The Spartan 3 family of chips for modeling has been used to get these results. Modeling on chips of other series has been also introduced in the article. RBF neural networks of various topologies have been synthesized and investigated. Hardware implementation of RBF neural networks with such speed allows them to be used in real-time control systems for high-speed objects.

Download Full-text

Topological vectors as a fingerprinting system for 2D-material flake distributions

npj 2D Materials and Applications ◽

10.1038/s41699-021-00234-z ◽

2021 ◽

Vol 5 (1) ◽

Author(s):

Joyce C. C. Santos ◽

Mariana C. Prado ◽

Helane L. O. Morais ◽

Samuel M. Sousa ◽

Elisangela Silva-Pinto ◽

...

Keyword(s):

Distribution Function ◽

2D Materials ◽

Shape Parameters ◽

Representation System ◽

Statistical Characterization ◽

2D Material ◽

Fast Pace ◽

Parameter Distributions

AbstractThe production of 2D material flakes in large quantities is a rapidly evolving field and a cornerstone for their industrial applicability. Although flake production has advanced in a fast pace, its statistical characterization is somewhat slower, with few examples in the literature which may lack either modelling uniformity and/or physical equivalence to actual flake dimensions. The present work brings a methodology for 2D material flake characterization with a threefold target: (i) propose a set of morphological shape parameters that correctly map to actual and relevant flake dimensions; (ii) find a single distribution function that efficiently describes all these parameter distributions; and (iii) suggest a representation system—topological vectors—that uniquely characterizes the statistical flake morphology within a given distribution. The applicability of such methodology is illustrated via the analysis of tens of thousands flakes of graphene/graphite and talc, which were submitted to different production protocols. The richness of information unveiled by this universal methodology may help the development of necessary standardization procedures for the imminent 2D-materials industry.

Download Full-text