Inefficiency of K-FAC for Large Batch Size Training

There have been several recent work claiming record times for ImageNet training. This is achieved by using large batch sizes during training to leverage parallel resources to produce faster wall-clock training times per training epoch. However, often these solutions require massive hyper-parameter tuning, which is an important cost that is often ignored. In this work, we perform an extensive analysis of large batch size training for two popular methods that is Stochastic Gradient Descent (SGD) as well as Kronecker-Factored Approximate Curvature (K-FAC) method. We evaluate the performance of these methods in terms of both wall-clock time and aggregate computational cost, and study the hyper-parameter sensitivity by performing more than 512 experiments per batch size for each of these methods. We perform experiments on multiple different models on two datasets of CIFAR-10 and SVHN. The results show that beyond a critical batch size both K-FAC and SGD significantly deviate from ideal strong scaling behaviour, and that despite common belief K-FAC does not exhibit improved large-batch scalability behavior, as compared to SGD.

Download Full-text

Taming the Noisy Gradient: Train Deep Neural Networks with Small Batch Sizes

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/604 ◽

2019 ◽

Cited By ~ 3

Author(s):

Yikai Zhang ◽

Hui Qu ◽

Chao Chen ◽

Dimitris Metaxas

Keyword(s):

Neural Networks ◽

Gradient Descent ◽

Deep Neural Networks ◽

Stochastic Gradient Descent ◽

Batch Size ◽

Training Performance ◽

Strongly Convex ◽

Batch Sizes ◽

Learning Architectures ◽

New Framework

Deep learning architectures are usually proposed with millions of parameters, resulting in a memory issue when training deep neural networks with stochastic gradient descent type methods using large batch sizes. However, training with small batch sizes tends to produce low quality solution due to the large variance of stochastic gradients. In this paper, we tackle this problem by proposing a new framework for training deep neural network with small batches/noisy gradient. During optimization, our method iteratively applies a proximal type regularizer to make loss function strongly convex. Such regularizer stablizes the gradient, leading to better training performance. We prove that our algorithm achieves comparable convergence rate as vanilla SGD even with small batch size. Our framework is simple to implement and can be potentially combined with many existing optimization algorithms. Empirical results show that our method outperforms SGD and Adam when batch size is small. Our implementation is available at https://github.com/huiqu18/TRAlgorithm.

Download Full-text

Why Does Large Batch Training Result in Poor Generalization? A Comprehensive Explanation and a Better Strategy from the Viewpoint of Stochastic Optimization

Neural Computation ◽

10.1162/neco_a_01089 ◽

2018 ◽

Vol 30 (7) ◽

pp. 2005-2023 ◽

Cited By ~ 3

Author(s):

Tomoumi Takase ◽

Satoshi Oyama ◽

Masahito Kurihara

Keyword(s):

Gradient Descent ◽

Optimization Problems ◽

Descent Method ◽

Batch Size ◽

Gradient Descent Method ◽

Neural Network Training ◽

Nonconvex Optimization Problems ◽

Large Batch ◽

Network Training ◽

Comprehensive Framework

We present a comprehensive framework of search methods, such as simulated annealing and batch training, for solving nonconvex optimization problems. These methods search a wider range by gradually decreasing the randomness added to the standard gradient descent method. The formulation that we define on the basis of this framework can be directly applied to neural network training. This produces an effective approach that gradually increases batch size during training. We also explain why large batch training degrades generalization performance, which previous studies have not clarified.

Download Full-text

Computational Complexity of Gradient Descent Algorithm

10.36227/techrxiv.14544000.v1 ◽

2021 ◽

Author(s):

Nishchal J ◽

neel bhandari

Keyword(s):

Computational Complexity ◽

Linear Regression ◽

Gradient Descent ◽

Stochastic Gradient Descent ◽

Batch Size ◽

Automated Learning ◽

Gradient Descent Algorithm ◽

The World ◽

Different Types ◽

Insight Into

Information is mounting exponentially, and the world is moving to hunt knowledge with the help of Big Data. The labelled data is used for automated learning and data analysis which is termed as Machine Learning. Linear Regression is a statistical method for predictive analysis. Gradient Descent is the process which uses cost function on gradients for minimizing the complexity in computing mean square error. This work presents an insight into the different types of Gradient descent algorithms namely, Batch Gradient Descent, Stochastic Gradient Descent and Mini-Batch Gradient Descent, which are implemented on a Linear regression dataset, and hence determine the computational complexity and other factors like learning rate, batch size and number of iterations which affect the efficiency of the algorithm.

Download Full-text

Submodular Batch Selection for Training Deep Neural Networks

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/372 ◽

2019 ◽

Cited By ~ 1

Author(s):

K J Joseph ◽

Vamshi Teja R ◽

Krishnakant Singh ◽

Vineeth N Balasubramanian

Keyword(s):

Gradient Descent ◽

Optimization Problem ◽

Sampling Strategy ◽

Stochastic Gradient Descent ◽

Selection Strategy ◽

Network Architectures ◽

Distance Metrics ◽

Learning Rates ◽

Batch Sizes ◽

Selection For

Mini-batch gradient descent based methods are the de facto algorithms for training neural network architectures today.We introduce a mini-batch selection strategy based on submodular function maximization. Our novel submodular formulation captures the informativeness of each sample and diversity of the whole subset. We design an efficient, greedy algorithm which can give high-quality solutions to this NP-hard combinatorial optimization problem. Our extensive experiments on standard datasets show that the deep models trained using the proposed batch selection strategy provide better generalization than Stochastic Gradient Descent as well as a popular baseline sampling strategy across different learning rates, batch sizes, and distance metrics.

Download Full-text

Accelerating Stochastic Gradient Descent using Adaptive Mini-Batch Size

2019 2nd International Conference on new Trends in Computing Sciences (ICTCS) ◽

10.1109/ictcs.2019.8923046 ◽

2019 ◽

Author(s):

Muayyad Saleh Alsadi ◽

Rawan Ghnemat ◽

Arafat Awajan

Keyword(s):

Gradient Descent ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Batch Size

Download Full-text

Seismic Waveform Inversion by Stochastic Optimization

International Journal of Geophysics ◽

10.1155/2011/689041 ◽

2011 ◽

Vol 2011 ◽

pp. 1-18 ◽

Cited By ~ 26

Author(s):

Tristan van Leeuwen ◽

Aleksandr Y. Aravkin ◽

Felix J. Herrmann

Keyword(s):

Stochastic Optimization ◽

Waveform Inversion ◽

Computational Cost ◽

Optimization Methods ◽

Batch Size ◽

Stochastic Methods ◽

Seismic Waveform ◽

Recent Developments ◽

Order Of Magnitude ◽

Batch Sizes

We explore the use of stochastic optimization methods for seismic waveform inversion. The basic principle of such methods is to randomly draw a batch of realizations of a given misfit function and goes back to the 1950s. The ultimate goal of such an approach is to dramatically reduce the computational cost involved in evaluating the misfit. Following earlier work, we introduce the stochasticity in waveform inversion problem in a rigorous way via a technique calledrandomized trace estimation. We then review theoretical results that underlie recent developments in the use of stochastic methods for waveform inversion. We present numerical experiments to illustrate the behavior of different types of stochastic optimization methods and investigate the sensitivity to the batch size and the noise level in the data. We find that it is possible to reproduce results that are qualitatively similar to the solution of the full problem with modest batch sizes, even on noisy data. Each iteration of the corresponding stochastic methods requires an order of magnitude fewer PDE solves than a comparable deterministic method applied to the full problem, which may lead to an order of magnitude speedup for waveform inversion in practice.

Download Full-text

Can Deep Learning Identify Tomato Leaf Disease?

Advances in Multimedia ◽

10.1155/2018/6710865 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10 ◽

Cited By ~ 19

Author(s):

Keke Zhang ◽

Qiufeng Wu ◽

Anwang Liu ◽

Xiangyan Meng

Keyword(s):

Gradient Descent ◽

Tomato Leaf ◽

Plant Diseases ◽

Fine Tuning ◽

Stochastic Gradient Descent ◽

Batch Size ◽

Optimal Model ◽

Combined Model ◽

Leaf Disease ◽

Fully Connected

This paper applies deep convolutional neural network (CNN) to identify tomato leaf disease by transfer learning. AlexNet, GoogLeNet, and ResNet were used as backbone of the CNN. The best combined model was utilized to change the structure, aiming at exploring the performance of full training and fine-tuning of CNN. The highest accuracy of 97.28% for identifying tomato leaf disease is achieved by the optimal model ResNet with stochastic gradient descent (SGD), the number of batch size of 16, the number of iterations of 4992, and the training layers from the 37 layer to the fully connected layer (denote as “fc”). The experimental results show that the proposed technique is effective in identifying tomato leaf disease and could be generalized to identify other plant diseases.

Download Full-text

Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio

Artificial Neural Networks and Machine Learning – ICANN 2018 - Lecture Notes in Computer Science ◽

10.1007/978-3-030-01424-7_39 ◽

2018 ◽

pp. 392-402 ◽

Cited By ~ 3

Author(s):

Stanislaw Jastrzębski ◽

Zachary Kenton ◽

Devansh Arpit ◽

Nicolas Ballas ◽

Asja Fischer ◽

...

Keyword(s):

Gradient Descent ◽

Learning Rate ◽

Stochastic Gradient ◽

Size Ratio ◽

Stochastic Gradient Descent ◽

Batch Size

Download Full-text

Computational Complexity of Gradient Descent Algorithm

10.36227/techrxiv.14544000 ◽

2021 ◽

Author(s):

Nishchal J ◽

neel bhandari

Keyword(s):

Computational Complexity ◽

Linear Regression ◽

Gradient Descent ◽

Stochastic Gradient Descent ◽

Batch Size ◽

Automated Learning ◽

Gradient Descent Algorithm ◽

The World ◽

Different Types ◽

Insight Into

Download Full-text

Convolutional ensembles for Arabic Handwritten Character and Digit Recognition

PeerJ Computer Science ◽

10.7717/peerj-cs.167 ◽

2018 ◽

Vol 4 ◽

pp. e167 ◽

Cited By ~ 2

Author(s):

Iam Palatnik de Sousa

Keyword(s):

Gradient Descent ◽

Cross Validation ◽

Learning Algorithm ◽

State Of The Art ◽

Parameter Tuning ◽

Stochastic Gradient Descent ◽

Digit Recognition ◽

Handwritten Character ◽

Monte Carlo Cross Validation ◽

Fold Cross Validation

A learning algorithm is proposed for the task of Arabic Handwritten Character and Digit recognition. The architecture consists on an ensemble of different Convolutional Neural Networks. The proposed training algorithm uses a combination of adaptive gradient descent on the first epochs and regular stochastic gradient descent in the last epochs, to facilitate convergence. Different validation strategies are tested, namely Monte Carlo Cross-Validation and K-fold Cross Validation. Hyper-parameter tuning was done by using the MADbase digits dataset. State of the art validation and testing classification accuracies were achieved, with average values of 99.74% and 99.47% respectively. The same algorithm was then trained and tested with the AHCD character dataset, also yielding state of the art validation and testing classification accuracies: 98.60% and 98.42% respectively.

Download Full-text