Improved Training of Deep Convolutional Networks via Minimum-Variance Regularized Adaptive Sampling

Abstract Fostered by technological and theoretical developments, deep neural networks have achieved great success in many applications, but their training by means of mini-batch stochastic gradient descent (SGD) can be very costly due to the possibly tens of millions of parameters to be optimized and the large amounts of training examples that must be processed. Said computational cost is exacerbated by the inefficiency of the uniform sampling method typically used by SGD to form the training mini-batches: since not all training examples are equally relevant for training, sampling these under a uniform distribution is far from optimal. A better strategy is to form the mini-batches by sampling the training examples under a distribution where the probability of being selected is proportional to the relevance of each individual example. This can be achieved through Importance Sampling (IS), which also achieves the minimization of the gradients’ variance w.r.t. the network parameters, further improving convergence. In this paper, an IS-based adaptive sampling method is studied that exploits side information to construct the required probability distribution. Said method is modified to enable its application to deep neural networks, and the improved method is dubbed Regularized Adaptive Sampling (RAS). Experimental comparison (using deep convolutional networks for classification of the MNIST and CIFAR-10 datasets) of RAS against SGD and against another sampling method in the state of the art, shows that RAS achieves relative improvements of the training process, without incurring significant overhead or affecting the accuracy of the networks.

Download Full-text

Learning-Rate Annealing Methods for Deep Neural Networks

Electronics ◽

10.3390/electronics10162029 ◽

2021 ◽

Vol 10 (16) ◽

pp. 2029

Author(s):

Kensuke Nakamura ◽

Bilel Derbel ◽

Kyoung-Jae Won ◽

Byung-Woo Hong

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Network Models ◽

Adaptive Methods ◽

Learning Rate ◽

Stochastic Gradient Descent ◽

Data Sets ◽

Great Success ◽

Sigmoid Function ◽

Neural Network Models

Deep neural networks (DNNs) have achieved great success in the last decades. DNN is optimized using the stochastic gradient descent (SGD) with learning rate annealing that overtakes the adaptive methods in many tasks. However, there is no common choice regarding the scheduled-annealing for SGD. This paper aims to present empirical analysis of learning rate annealing based on the experimental results using the major data-sets on the image classification that is one of the key applications of the DNNs. Our experiment involves recent deep neural network models in combination with a variety of learning rate annealing methods. We also propose an annealing combining the sigmoid function with warmup that is shown to overtake both the adaptive methods and the other existing schedules in accuracy in most cases with DNNs.

Download Full-text

Nanosecond Photodynamics Simulations of a Cis-Trans Isomerization Are Enabled by Machine Learning

10.26434/chemrxiv.13047863 ◽

2020 ◽

Author(s):

Jingbai Li ◽

Patrick Reiser ◽

André Eberhard ◽

Pascal Friederich ◽

Steven Lopez

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Excited State ◽

Adaptive Sampling ◽

Computational Cost ◽

Ground Truth ◽

Absolute Error ◽

Photochemical Reactions ◽

Computational Techniques ◽

Full Potential

Photochemical reactions are being increasingly used to construct complex molecular architectures with mild and straightforward reaction conditions. Computational techniques are increasingly important to understand the reactivities and chemoselectivities of photochemical isomerization reactions because they offer molecular bonding information along the excited-state(s) of photodynamics. These photodynamics simulations are resource-intensive and are typically limited to 1–10 picoseconds and 1,000 trajectories due to high computational cost. Most organic photochemical reactions have excited-state lifetimes exceeding 1 picosecond, which places them outside possible computational studies. Westermeyr et al. demonstrated that a machine learning approach could significantly lengthen photodynamics simulation times for a model system, methylenimmonium cation (CH2NH2+).We have developed a Python-based code, Python Rapid Artificial Intelligence Ab Initio Molecular Dynamics (PyRAI2MD), to accomplish the unprecedented 10 ns cis-trans photodynamics of trans-hexafluoro-2-butene (CF3–CH=CH–CF3) in 3.5 days. The same simulation would take approximately 58 years with ground-truth multiconfigurational dynamics. We proposed an innovative scheme combining Wigner sampling, geometrical interpolations, and short-time quantum chemical trajectories to effectively sample the initial data, facilitating the adaptive sampling to generate an informative and data-efficient training set with 6,232 data points. Our neural networks achieved chemical accuracy (mean absolute error of 0.032 eV). Our 4,814 trajectories reproduced the S1 half-life (60.5 fs), the photochemical product ratio (trans: cis = 2.3: 1), and autonomously discovered a pathway towards a carbene. The neural networks have also shown the capability of generalizing the full potential energy surface with chemically incomplete data (trans → cis but not cis → trans pathways) that may offer future automated photochemical reaction discoveries.

Download Full-text

Analysis of Non-Linear Activation Functions for Classification Tasks Using Convolutional Neural Networks

Recent Patents on Computer Science ◽

10.2174/2213275911666181025143029 ◽

2019 ◽

Vol 12 (3) ◽

pp. 156-161 ◽

Cited By ~ 3

Author(s):

Aman Dureja ◽

Payal Pahwa

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Activation Function ◽

Primary Objective ◽

Experimental Comparison ◽

Activation Functions ◽

Practical Applications ◽

Network Activation ◽

Non Linear ◽

Hidden Layer

Background: In making the deep neural network, activation functions play an important role. But the choice of activation functions also affects the network in term of optimization and to retrieve the better results. Several activation functions have been introduced in machine learning for many practical applications. But which activation function should use at hidden layer of deep neural networks was not identified. Objective: The primary objective of this analysis was to describe which activation function must be used at hidden layers for deep neural networks to solve complex non-linear problems. Methods: The configuration for this comparative model was used by using the datasets of 2 classes (Cat/Dog). The number of Convolutional layer used in this network was 3 and the pooling layer was also introduced after each layer of CNN layer. The total of the dataset was divided into the two parts. The first 8000 images were mainly used for training the network and the next 2000 images were used for testing the network. Results: The experimental comparison was done by analyzing the network by taking different activation functions on each layer of CNN network. The validation error and accuracy on Cat/Dog dataset were analyzed using activation functions (ReLU, Tanh, Selu, PRelu, Elu) at number of hidden layers. Overall the Relu gave best performance with the validation loss at 25th Epoch 0.3912 and validation accuracy at 25th Epoch 0.8320. Conclusion: It is found that a CNN model with ReLU hidden layers (3 hidden layers here) gives best results and improve overall performance better in term of accuracy and speed. These advantages of ReLU in CNN at number of hidden layers are helpful to effectively and fast retrieval of images from the databases.

Download Full-text

An efficient pruning scheme of deep neural networks for Internet of Things applications

EURASIP Journal on Advances in Signal Processing ◽

10.1186/s13634-021-00744-4 ◽

2021 ◽

Vol 2021 (1) ◽

Author(s):

Chen Qi ◽

Shibo Shen ◽

Rongpeng Li ◽

Zhifeng Zhao ◽

Qing Liu ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Internet Of Things ◽

Deep Neural Networks ◽

Computational Cost ◽

Superior Performance ◽

Compact Structure ◽

Resource Limited ◽

Benchmark Datasets ◽

Iot Devices

AbstractNowadays, deep neural networks (DNNs) have been rapidly deployed to realize a number of functionalities like sensing, imaging, classification, recognition, etc. However, the computational-intensive requirement of DNNs makes it difficult to be applicable for resource-limited Internet of Things (IoT) devices. In this paper, we propose a novel pruning-based paradigm that aims to reduce the computational cost of DNNs, by uncovering a more compact structure and learning the effective weights therein, on the basis of not compromising the expressive capability of DNNs. In particular, our algorithm can achieve efficient end-to-end training that transfers a redundant neural network to a compact one with a specifically targeted compression rate directly. We comprehensively evaluate our approach on various representative benchmark datasets and compared with typical advanced convolutional neural network (CNN) architectures. The experimental results verify the superior performance and robust effectiveness of our scheme. For example, when pruning VGG on CIFAR-10, our proposed scheme is able to significantly reduce its FLOPs (floating-point operations) and number of parameters with a proportion of 76.2% and 94.1%, respectively, while still maintaining a satisfactory accuracy. To sum up, our scheme could facilitate the integration of DNNs into the common machine-learning-based IoT framework and establish distributed training of neural networks in both cloud and edge.

Download Full-text

Efficient Binarized Convolutional Layers for Visual Inspection Applications on Resource-Limited FPGAs and ASICs

Electronics ◽

10.3390/electronics10131511 ◽

2021 ◽

Vol 10 (13) ◽

pp. 1511

Author(s):

Taylor Simons ◽

Dah-Jye Lee

Keyword(s):

Neural Networks ◽

Visual Inspection ◽

Deep Neural Networks ◽

Computational Cost ◽

Quality Inspection ◽

Agricultural Produce ◽

Resource Limited ◽

Inspection Tasks ◽

Computational Resources ◽

Small Models

There has been a recent surge in publications related to binarized neural networks (BNNs), which use binary values to represent both the weights and activations in deep neural networks (DNNs). Due to the bitwise nature of BNNs, there have been many efforts to implement BNNs on ASICs and FPGAs. While BNNs are excellent candidates for these kinds of resource-limited systems, most implementations still require very large FPGAs or CPU-FPGA co-processing systems. Our work focuses on reducing the computational cost of BNNs even further, making them more efficient to implement on FPGAs. We target embedded visual inspection tasks, like quality inspection sorting on manufactured parts and agricultural produce sorting. We propose a new binarized convolutional layer, called the neural jet features layer, that learns well-known classic computer vision kernels that are efficient to calculate as a group. We show that on visual inspection tasks, neural jet features perform comparably to standard BNN convolutional layers while using less computational resources. We also show that neural jet features tend to be more stable than BNN convolution layers when training small models.

Download Full-text

A Diffusion Approximation Theory of Momentum Stochastic Gradient Descent in Nonconvex Optimization

Stochastic Systems ◽

10.1287/stsy.2021.0083 ◽

2021 ◽

Author(s):

Tianyi Liu ◽

Zhehui Chen ◽

Enlu Zhou ◽

Tuo Zhao

Keyword(s):

Neural Networks ◽

Nonconvex Optimization ◽

Gradient Descent ◽

Deep Neural Networks ◽

Optimization Problems ◽

Saddle Points ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Nonconvex Optimization Problems ◽

Empirical Success

Momentum stochastic gradient descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning (e.g., training deep neural networks, variational Bayesian inference, etc.). Despite its empirical success, there is still a lack of theoretical understanding of convergence properties of MSGD. To fill this gap, we propose to analyze the algorithmic behavior of MSGD by diffusion approximations for nonconvex optimization problems with strict saddle points and isolated local optima. Our study shows that the momentum helps escape from saddle points but hurts the convergence within the neighborhood of optima (if without the step size annealing or momentum annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks.

Download Full-text

Pruning for Hardware-Based Deep Spiking Neural Networks Using Gated Schottky Diode as Synaptic Devices

Journal of Nanoscience and Nanotechnology ◽

10.1166/jnn.2020.18772 ◽

2020 ◽

Vol 20 (11) ◽

pp. 6603-6608 ◽

Cited By ~ 1

Author(s):

Sung-Tae Lee ◽

Suhwan Lim ◽

Jong-Ho Bae ◽

Dongseok Kwon ◽

Hyeong-Su Kim ◽

...

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Schottky Diodes ◽

Computational Cost ◽

Spiking Neural Networks ◽

Training Procedure ◽

Learning Tasks ◽

L1 Regularization ◽

The Cost ◽

High Computational Cost

Deep learning represents state-of-the-art results in various machine learning tasks, but for applications that require real-time inference, the high computational cost of deep neural networks becomes a bottleneck for the efficiency. To overcome the high computational cost of deep neural networks, spiking neural networks (SNN) have been proposed. Herein, we propose a hardware implementation of the SNN with gated Schottky diodes as synaptic devices. In addition, we apply L1 regularization for connection pruning of the deep spiking neural networks using gated Schottky diodes as synap-tic devices. Applying L1 regularization eliminates the need for a re-training procedure because it prunes the weights based on the cost function. The compressed hardware-based SNN is energy efficient while achieving a classification accuracy of 97.85% which is comparable to 98.13% of the software deep neural networks (DNN).

Download Full-text

CircConv: A Structured Convolution with Low Complexity

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014287 ◽

2019 ◽

Vol 33 ◽

pp. 4287-4294

Author(s):

Siyu Liao ◽

Bo Yuan

Keyword(s):

Machine Learning ◽

Neural Networks ◽

Deep Neural Networks ◽

Computational Cost ◽

Low Complexity ◽

Deep Convolutional Neural Networks ◽

Significant Saving ◽

Machine Learning Applications ◽

Fast Multiplication ◽

Large Model

Deep neural networks (DNNs), especially deep convolutional neural networks (CNNs), have emerged as the powerful technique in various machine learning applications. However, the large model sizes of DNNs yield high demands on computation resource and weight storage, thereby limiting the practical deployment of DNNs. To overcome these limitations, this paper proposes to impose the circulant structure to the construction of convolutional layers, and hence leads to circulant convolutional layers (CircConvs) and circulant CNNs. The circulant structure and models can be either trained from scratch or re-trained from a pre-trained non-circulant model, thereby making it very flexible for different training environments. Through extensive experiments, such strong structureimposing approach is proved to be able to substantially reduce the number of parameters of convolutional layers and enable significant saving of computational cost by using fast multiplication of the circulant tensor.

Download Full-text

Fully Convolutional Deep Neural Networks with Optimized Hyperparameters for Detection of Shockable and Non-Shockable Rhythms

Sensors ◽

10.3390/s20102875 ◽

2020 ◽

Vol 20 (10) ◽

pp. 2875 ◽

Cited By ~ 1

Author(s):

Vessela Krasteva ◽

Sarah Ménétré ◽

Jean-Philippe Didon ◽

Irena Jekova

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Random Search ◽

Computational Cost ◽

Dropout Rate ◽

Machine Learning Algorithms ◽

Dense Layer ◽

Max Pooling ◽

Resuscitation Guidelines ◽

Advisory Systems

Deep neural networks (DNN) are state-of-the-art machine learning algorithms that can be learned to self-extract significant features of the electrocardiogram (ECG) and can generally provide high-output diagnostic accuracy if subjected to robust training and optimization on large datasets at high computational cost. So far, limited research and optimization of DNNs in shock advisory systems is found on large ECG arrhythmia databases from out-of-hospital cardiac arrests (OHCA). The objective of this study is to optimize the hyperparameters (HPs) of deep convolutional neural networks (CNN) for detection of shockable (Sh) and nonshockable (NSh) rhythms, and to validate the best HP settings for short and long analysis durations (2–10 s). Large numbers of (Sh + NSh) ECG samples were used for training (720 + 3170) and validation (739 + 5921) from Holters and defibrillators in OHCA. An end-to-end deep CNN architecture was implemented with one-lead raw ECG input layer (5 s, 125 Hz, 2.5 uV/LSB), configurable number of 5 to 23 hidden layers and output layer with diagnostic probability p ∈ [0: Sh,1: NSh]. The hidden layers contain N convolutional blocks × 3 layers (Conv1D (filters = Fi, kernel size = Ki), max-pooling (pool size = 2), dropout (rate = 0.3)), one global max-pooling and one dense layer. Random search optimization of HPs = {N, Fi, Ki}, i = 1, … N in a large grid of N = [1, 2, … 7], Fi = [5;50], Ki = [5;100] was performed. During training, the model with maximal balanced accuracy BAC = (Sensitivity + Specificity)/2 over 400 epochs was stored. The optimization principle is based on finding the common HPs space of a few top-ranked models and prediction of a robust HP setting by their median value. The optimal models for 1–7 CNN layers were trained with different learning rates LR = [10−5; 10−2] and the best model was finally validated on 2–10 s analysis durations. A number of 4216 random search models were trained. The optimal models with more than three convolutional layers did not exhibit substantial differences in performance BAC = (99.31–99.5%). Among them, the best model was found with {N = 5, Fi = {20, 15, 15, 10, 5}, Ki = {10, 10, 10, 10, 10}, 7521 trainable parameters} with maximal validation performance for 5-s analysis (BAC = 99.5%, Se = 99.6%, Sp = 99.4%) and tolerable drop in performance (<2% points) for very short 2-s analysis (BAC = 98.2%, Se = 97.6%, Sp = 98.7%). DNN application in future-generation shock advisory systems can improve the detection performance of Sh and NSh rhythms and can considerably shorten the analysis duration complying with resuscitation guidelines for minimal hands-off pauses.

Download Full-text

Interpreting Deep Neural Networks Beyond Attribution Methods: Quantifying Global Importance of Genomic Features

10.1101/2020.02.19.956896 ◽

2020 ◽

Cited By ~ 1

Author(s):

Peter K. Koo ◽

Matt Ploenzke

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Population Level ◽

Computational Genomics ◽

Great Success ◽

Single Nucleotide Variants ◽

Single Nucleotide ◽

Genomic Features ◽

High Performing ◽

Importance Analysis

AbstractDespite deep neural networks (DNNs) having found great success at improving performance on various prediction tasks in computational genomics, it remains difficult to understand why they make any given prediction. In genomics, the main approaches to interpret a high-performing DNN are to visualize learned representations via weight visualizations and attribution methods. While these methods can be informative, each has strong limitations. For instance, attribution methods only uncover the independent contribution of single nucleotide variants in a given sequence. Here we discuss and argue for global importance analysis which can quantify population-level importance of putative features and their interactions learned by a DNN. We highlight recent work that has benefited from this interpretability approach and then discuss connections between global importance analysis and causality.

Download Full-text