scholarly journals Effects of depth, width, and initialization: A convergence analysis of layer-wise training for deep linear neural networks

2021 ◽  
pp. 1-47
Author(s):  
Yeonjong Shin

Deep neural networks have been used in various machine learning applications and achieved tremendous empirical successes. However, training deep neural networks is a challenging task. Many alternatives have been proposed in place of end-to-end back-propagation. Layer-wise training is one of them, which trains a single layer at a time, rather than trains the whole layers simultaneously. In this paper, we study a layer-wise training using a block coordinate gradient descent (BCGD) for deep linear networks. We establish a general convergence analysis of BCGD and found the optimal learning rate, which results in the fastest decrease in the loss. We identify the effects of depth, width, and initialization. When the orthogonal-like initialization is employed, we show that the width of intermediate layers plays no role in gradient-based training beyond a certain threshold. Besides, we found that the use of deep networks could drastically accelerate convergence when it is compared to those of a depth 1 network, even when the computational cost is considered. Numerical examples are provided to justify our theoretical findings and demonstrate the performance of layer-wise training by BCGD.

Author(s):  
Siyu Liao ◽  
Bo Yuan

Deep neural networks (DNNs), especially deep convolutional neural networks (CNNs), have emerged as the powerful technique in various machine learning applications. However, the large model sizes of DNNs yield high demands on computation resource and weight storage, thereby limiting the practical deployment of DNNs. To overcome these limitations, this paper proposes to impose the circulant structure to the construction of convolutional layers, and hence leads to circulant convolutional layers (CircConvs) and circulant CNNs. The circulant structure and models can be either trained from scratch or re-trained from a pre-trained non-circulant model, thereby making it very flexible for different training environments. Through extensive experiments, such strong structureimposing approach is proved to be able to substantially reduce the number of parameters of convolutional layers and enable significant saving of computational cost by using fast multiplication of the circulant tensor.


2020 ◽  
Vol 34 (04) ◽  
pp. 5085-5092 ◽  
Author(s):  
Wan-Duo Kurt Ma ◽  
J. P. Lewis ◽  
W. Bastiaan Kleijn

We introduce the HSIC (Hilbert-Schmidt independence criterion) bottleneck for training deep neural networks. The HSIC bottleneck is an alternative to the conventional cross-entropy loss and backpropagation that has a number of distinct advantages. It mitigates exploding and vanishing gradients, resulting in the ability to learn very deep networks without skip connections. There is no requirement for symmetric feedback or update locking. We find that the HSIC bottleneck provides performance on MNIST/FashionMNIST/CIFAR10 classification comparable to backpropagation with a cross-entropy target, even when the system is not encouraged to make the output resemble the classification labels. Appending a single layer trained with SGD (without backpropagation) to reformat the information further improves performance.


2021 ◽  
Vol 12 (1) ◽  
Author(s):  
Florian Stelzer ◽  
André Röhm ◽  
Raul Vicente ◽  
Ingo Fischer ◽  
Serhiy Yanchuk

AbstractDeep neural networks are among the most widely applied machine learning tools showing outstanding performance in a broad range of tasks. We present a method for folding a deep neural network of arbitrary size into a single neuron with multiple time-delayed feedback loops. This single-neuron deep neural network comprises only a single nonlinearity and appropriately adjusted modulations of the feedback signals. The network states emerge in time as a temporal unfolding of the neuron’s dynamics. By adjusting the feedback-modulation within the loops, we adapt the network’s connection weights. These connection weights are determined via a back-propagation algorithm, where both the delay-induced and local network connections must be taken into account. Our approach can fully represent standard Deep Neural Networks (DNN), encompasses sparse DNNs, and extends the DNN concept toward dynamical systems implementations. The new method, which we call Folded-in-time DNN (Fit-DNN), exhibits promising performance in a set of benchmark tasks.


Author(s):  
Chen Qi ◽  
Shibo Shen ◽  
Rongpeng Li ◽  
Zhifeng Zhao ◽  
Qing Liu ◽  
...  

AbstractNowadays, deep neural networks (DNNs) have been rapidly deployed to realize a number of functionalities like sensing, imaging, classification, recognition, etc. However, the computational-intensive requirement of DNNs makes it difficult to be applicable for resource-limited Internet of Things (IoT) devices. In this paper, we propose a novel pruning-based paradigm that aims to reduce the computational cost of DNNs, by uncovering a more compact structure and learning the effective weights therein, on the basis of not compromising the expressive capability of DNNs. In particular, our algorithm can achieve efficient end-to-end training that transfers a redundant neural network to a compact one with a specifically targeted compression rate directly. We comprehensively evaluate our approach on various representative benchmark datasets and compared with typical advanced convolutional neural network (CNN) architectures. The experimental results verify the superior performance and robust effectiveness of our scheme. For example, when pruning VGG on CIFAR-10, our proposed scheme is able to significantly reduce its FLOPs (floating-point operations) and number of parameters with a proportion of 76.2% and 94.1%, respectively, while still maintaining a satisfactory accuracy. To sum up, our scheme could facilitate the integration of DNNs into the common machine-learning-based IoT framework and establish distributed training of neural networks in both cloud and edge.


Electronics ◽  
2021 ◽  
Vol 10 (13) ◽  
pp. 1511
Author(s):  
Taylor Simons ◽  
Dah-Jye Lee

There has been a recent surge in publications related to binarized neural networks (BNNs), which use binary values to represent both the weights and activations in deep neural networks (DNNs). Due to the bitwise nature of BNNs, there have been many efforts to implement BNNs on ASICs and FPGAs. While BNNs are excellent candidates for these kinds of resource-limited systems, most implementations still require very large FPGAs or CPU-FPGA co-processing systems. Our work focuses on reducing the computational cost of BNNs even further, making them more efficient to implement on FPGAs. We target embedded visual inspection tasks, like quality inspection sorting on manufactured parts and agricultural produce sorting. We propose a new binarized convolutional layer, called the neural jet features layer, that learns well-known classic computer vision kernels that are efficient to calculate as a group. We show that on visual inspection tasks, neural jet features perform comparably to standard BNN convolutional layers while using less computational resources. We also show that neural jet features tend to be more stable than BNN convolution layers when training small models.


2020 ◽  
Vol 20 (11) ◽  
pp. 6603-6608 ◽  
Author(s):  
Sung-Tae Lee ◽  
Suhwan Lim ◽  
Jong-Ho Bae ◽  
Dongseok Kwon ◽  
Hyeong-Su Kim ◽  
...  

Deep learning represents state-of-the-art results in various machine learning tasks, but for applications that require real-time inference, the high computational cost of deep neural networks becomes a bottleneck for the efficiency. To overcome the high computational cost of deep neural networks, spiking neural networks (SNN) have been proposed. Herein, we propose a hardware implementation of the SNN with gated Schottky diodes as synaptic devices. In addition, we apply L1 regularization for connection pruning of the deep spiking neural networks using gated Schottky diodes as synap-tic devices. Applying L1 regularization eliminates the need for a re-training procedure because it prunes the weights based on the cost function. The compressed hardware-based SNN is energy efficient while achieving a classification accuracy of 97.85% which is comparable to 98.13% of the software deep neural networks (DNN).


Automatica ◽  
2020 ◽  
Vol 122 ◽  
pp. 109222
Author(s):  
Jong Woo Kim ◽  
Tae Hoon Oh ◽  
Sang Hwan Son ◽  
Dong Hwi Jeong ◽  
Jong Min Lee

Sensors ◽  
2020 ◽  
Vol 20 (10) ◽  
pp. 2875 ◽  
Author(s):  
Vessela Krasteva ◽  
Sarah Ménétré ◽  
Jean-Philippe Didon ◽  
Irena Jekova

Deep neural networks (DNN) are state-of-the-art machine learning algorithms that can be learned to self-extract significant features of the electrocardiogram (ECG) and can generally provide high-output diagnostic accuracy if subjected to robust training and optimization on large datasets at high computational cost. So far, limited research and optimization of DNNs in shock advisory systems is found on large ECG arrhythmia databases from out-of-hospital cardiac arrests (OHCA). The objective of this study is to optimize the hyperparameters (HPs) of deep convolutional neural networks (CNN) for detection of shockable (Sh) and nonshockable (NSh) rhythms, and to validate the best HP settings for short and long analysis durations (2–10 s). Large numbers of (Sh + NSh) ECG samples were used for training (720 + 3170) and validation (739 + 5921) from Holters and defibrillators in OHCA. An end-to-end deep CNN architecture was implemented with one-lead raw ECG input layer (5 s, 125 Hz, 2.5 uV/LSB), configurable number of 5 to 23 hidden layers and output layer with diagnostic probability p ∈ [0: Sh,1: NSh]. The hidden layers contain N convolutional blocks × 3 layers (Conv1D (filters = Fi, kernel size = Ki), max-pooling (pool size = 2), dropout (rate = 0.3)), one global max-pooling and one dense layer. Random search optimization of HPs = {N, Fi, Ki}, i = 1, … N in a large grid of N = [1, 2, … 7], Fi = [5;50], Ki = [5;100] was performed. During training, the model with maximal balanced accuracy BAC = (Sensitivity + Specificity)/2 over 400 epochs was stored. The optimization principle is based on finding the common HPs space of a few top-ranked models and prediction of a robust HP setting by their median value. The optimal models for 1–7 CNN layers were trained with different learning rates LR = [10−5; 10−2] and the best model was finally validated on 2–10 s analysis durations. A number of 4216 random search models were trained. The optimal models with more than three convolutional layers did not exhibit substantial differences in performance BAC = (99.31–99.5%). Among them, the best model was found with {N = 5, Fi = {20, 15, 15, 10, 5}, Ki = {10, 10, 10, 10, 10}, 7521 trainable parameters} with maximal validation performance for 5-s analysis (BAC = 99.5%, Se = 99.6%, Sp = 99.4%) and tolerable drop in performance (<2% points) for very short 2-s analysis (BAC = 98.2%, Se = 97.6%, Sp = 98.7%). DNN application in future-generation shock advisory systems can improve the detection performance of Sh and NSh rhythms and can considerably shorten the analysis duration complying with resuscitation guidelines for minimal hands-off pauses.


2021 ◽  
Author(s):  
Alfonso Rojas-Domínguez ◽  
Ivvan Valdez ◽  
Manuel Ornelas-Rodríguez ◽  
Martín Carpio

Abstract Fostered by technological and theoretical developments, deep neural networks have achieved great success in many applications, but their training by means of mini-batch stochastic gradient descent (SGD) can be very costly due to the possibly tens of millions of parameters to be optimized and the large amounts of training examples that must be processed. Said computational cost is exacerbated by the inefficiency of the uniform sampling method typically used by SGD to form the training mini-batches: since not all training examples are equally relevant for training, sampling these under a uniform distribution is far from optimal. A better strategy is to form the mini-batches by sampling the training examples under a distribution where the probability of being selected is proportional to the relevance of each individual example. This can be achieved through Importance Sampling (IS), which also achieves the minimization of the gradients’ variance w.r.t. the network parameters, further improving convergence. In this paper, an IS-based adaptive sampling method is studied that exploits side information to construct the required probability distribution. Said method is modified to enable its application to deep neural networks, and the improved method is dubbed Regularized Adaptive Sampling (RAS). Experimental comparison (using deep convolutional networks for classification of the MNIST and CIFAR-10 datasets) of RAS against SGD and against another sampling method in the state of the art, shows that RAS achieves relative improvements of the training process, without incurring significant overhead or affecting the accuracy of the networks.


Sign in / Sign up

Export Citation Format

Share Document