scholarly journals Inefficiency of K-FAC for Large Batch Size Training

2020 ◽  
Vol 34 (04) ◽  
pp. 5053-5060
Author(s):  
Linjian Ma ◽  
Gabe Montague ◽  
Jiayu Ye ◽  
Zhewei Yao ◽  
Amir Gholami ◽  
...  

There have been several recent work claiming record times for ImageNet training. This is achieved by using large batch sizes during training to leverage parallel resources to produce faster wall-clock training times per training epoch. However, often these solutions require massive hyper-parameter tuning, which is an important cost that is often ignored. In this work, we perform an extensive analysis of large batch size training for two popular methods that is Stochastic Gradient Descent (SGD) as well as Kronecker-Factored Approximate Curvature (K-FAC) method. We evaluate the performance of these methods in terms of both wall-clock time and aggregate computational cost, and study the hyper-parameter sensitivity by performing more than 512 experiments per batch size for each of these methods. We perform experiments on multiple different models on two datasets of CIFAR-10 and SVHN. The results show that beyond a critical batch size both K-FAC and SGD significantly deviate from ideal strong scaling behaviour, and that despite common belief K-FAC does not exhibit improved large-batch scalability behavior, as compared to SGD.

Author(s):  
Yikai Zhang ◽  
Hui Qu ◽  
Chao Chen ◽  
Dimitris Metaxas

Deep learning architectures are usually proposed with millions of parameters, resulting in a memory issue when training deep neural networks with stochastic gradient descent type methods using large batch sizes. However, training with small batch sizes tends to produce low quality solution due to the large variance of stochastic gradients. In this paper, we tackle this problem by proposing a new framework for training deep neural network with small batches/noisy gradient. During optimization, our method iteratively applies a proximal type regularizer to make loss function strongly convex. Such regularizer stablizes the gradient, leading to better training performance. We prove that our algorithm achieves comparable convergence rate as vanilla SGD even with small batch size. Our framework is simple to implement and can be potentially combined with many existing optimization algorithms. Empirical results show that our method outperforms SGD and Adam when batch size is small. Our implementation is available at https://github.com/huiqu18/TRAlgorithm.


2018 ◽  
Vol 30 (7) ◽  
pp. 2005-2023 ◽  
Author(s):  
Tomoumi Takase ◽  
Satoshi Oyama ◽  
Masahito Kurihara

We present a comprehensive framework of search methods, such as simulated annealing and batch training, for solving nonconvex optimization problems. These methods search a wider range by gradually decreasing the randomness added to the standard gradient descent method. The formulation that we define on the basis of this framework can be directly applied to neural network training. This produces an effective approach that gradually increases batch size during training. We also explain why large batch training degrades generalization performance, which previous studies have not clarified.


2021 ◽  
Author(s):  
Nishchal J ◽  
neel bhandari

Information is mounting exponentially, and the world is moving to hunt knowledge with the help of Big Data. The labelled data is used for automated learning and data analysis which is termed as Machine Learning. Linear Regression is a statistical method for predictive analysis. Gradient Descent is the process which uses cost function on gradients for minimizing the complexity in computing mean square error. This work presents an insight into the different types of Gradient descent algorithms namely, Batch Gradient Descent, Stochastic Gradient Descent and Mini-Batch Gradient Descent, which are implemented on a Linear regression dataset, and hence determine the computational complexity and other factors like learning rate, batch size and number of iterations which affect the efficiency of the algorithm.


Author(s):  
K J Joseph ◽  
Vamshi Teja R ◽  
Krishnakant Singh ◽  
Vineeth N Balasubramanian

Mini-batch gradient descent based methods are the de facto algorithms for training neural network architectures today.We introduce a mini-batch selection strategy based on submodular function maximization. Our novel submodular formulation captures the informativeness of each sample and diversity of the whole subset. We design an efficient, greedy algorithm which can give high-quality solutions to this NP-hard combinatorial optimization problem. Our extensive experiments on standard datasets show that the deep models trained using the proposed batch selection strategy provide better generalization than Stochastic Gradient Descent as well as a popular baseline sampling strategy across different learning rates, batch sizes, and distance metrics.


2011 ◽  
Vol 2011 ◽  
pp. 1-18 ◽  
Author(s):  
Tristan van Leeuwen ◽  
Aleksandr Y. Aravkin ◽  
Felix J. Herrmann

We explore the use of stochastic optimization methods for seismic waveform inversion. The basic principle of such methods is to randomly draw a batch of realizations of a given misfit function and goes back to the 1950s. The ultimate goal of such an approach is to dramatically reduce the computational cost involved in evaluating the misfit. Following earlier work, we introduce the stochasticity in waveform inversion problem in a rigorous way via a technique calledrandomized trace estimation. We then review theoretical results that underlie recent developments in the use of stochastic methods for waveform inversion. We present numerical experiments to illustrate the behavior of different types of stochastic optimization methods and investigate the sensitivity to the batch size and the noise level in the data. We find that it is possible to reproduce results that are qualitatively similar to the solution of the full problem with modest batch sizes, even on noisy data. Each iteration of the corresponding stochastic methods requires an order of magnitude fewer PDE solves than a comparable deterministic method applied to the full problem, which may lead to an order of magnitude speedup for waveform inversion in practice.


2018 ◽  
Vol 2018 ◽  
pp. 1-10 ◽  
Author(s):  
Keke Zhang ◽  
Qiufeng Wu ◽  
Anwang Liu ◽  
Xiangyan Meng

This paper applies deep convolutional neural network (CNN) to identify tomato leaf disease by transfer learning. AlexNet, GoogLeNet, and ResNet were used as backbone of the CNN. The best combined model was utilized to change the structure, aiming at exploring the performance of full training and fine-tuning of CNN. The highest accuracy of 97.28% for identifying tomato leaf disease is achieved by the optimal model ResNet with stochastic gradient descent (SGD), the number of batch size of 16, the number of iterations of 4992, and the training layers from the 37 layer to the fully connected layer (denote as “fc”). The experimental results show that the proposed technique is effective in identifying tomato leaf disease and could be generalized to identify other plant diseases.


2021 ◽  
Author(s):  
Nishchal J ◽  
neel bhandari

Information is mounting exponentially, and the world is moving to hunt knowledge with the help of Big Data. The labelled data is used for automated learning and data analysis which is termed as Machine Learning. Linear Regression is a statistical method for predictive analysis. Gradient Descent is the process which uses cost function on gradients for minimizing the complexity in computing mean square error. This work presents an insight into the different types of Gradient descent algorithms namely, Batch Gradient Descent, Stochastic Gradient Descent and Mini-Batch Gradient Descent, which are implemented on a Linear regression dataset, and hence determine the computational complexity and other factors like learning rate, batch size and number of iterations which affect the efficiency of the algorithm.


2018 ◽  
Vol 4 ◽  
pp. e167 ◽  
Author(s):  
Iam Palatnik de Sousa

A learning algorithm is proposed for the task of Arabic Handwritten Character and Digit recognition. The architecture consists on an ensemble of different Convolutional Neural Networks. The proposed training algorithm uses a combination of adaptive gradient descent on the first epochs and regular stochastic gradient descent in the last epochs, to facilitate convergence. Different validation strategies are tested, namely Monte Carlo Cross-Validation and K-fold Cross Validation. Hyper-parameter tuning was done by using the MADbase digits dataset. State of the art validation and testing classification accuracies were achieved, with average values of 99.74% and 99.47% respectively. The same algorithm was then trained and tested with the AHCD character dataset, also yielding state of the art validation and testing classification accuracies: 98.60% and 98.42% respectively.


Sign in / Sign up

Export Citation Format

Share Document