scholarly journals Quantized Adam with Error Feedback

2021 ◽  
Vol 12 (5) ◽  
pp. 1-26
Author(s):  
Congliang Chen ◽  
Li Shen ◽  
Haozhi Huang ◽  
Wei Liu

In this article, we present a distributed variant of an adaptive stochastic gradient method for training deep neural networks in the parameter-server model. To reduce the communication cost among the workers and server, we incorporate two types of quantization schemes, i.e., gradient quantization and weight quantization, into the proposed distributed Adam. In addition, to reduce the bias introduced by quantization operations, we propose an error-feedback technique to compensate for the quantized gradient. Theoretically, in the stochastic nonconvex setting, we show that the distributed adaptive gradient method with gradient quantization and error feedback converges to the first-order stationary point, and that the distributed adaptive gradient method with weight quantization and error feedback converges to the point related to the quantized level under both the single-worker and multi-worker modes. Last, we apply the proposed distributed adaptive gradient methods to train deep neural networks. Experimental results demonstrate the efficacy of our methods.

2021 ◽  
Author(s):  
Tianyi Liu ◽  
Zhehui Chen ◽  
Enlu Zhou ◽  
Tuo Zhao

Momentum stochastic gradient descent (MSGD) algorithm has been widely applied to many nonconvex optimization problems in machine learning (e.g., training deep neural networks, variational Bayesian inference, etc.). Despite its empirical success, there is still a lack of theoretical understanding of convergence properties of MSGD. To fill this gap, we propose to analyze the algorithmic behavior of MSGD by diffusion approximations for nonconvex optimization problems with strict saddle points and isolated local optima. Our study shows that the momentum helps escape from saddle points but hurts the convergence within the neighborhood of optima (if without the step size annealing or momentum annealing). Our theoretical discovery partially corroborates the empirical success of MSGD in training deep neural networks.


Author(s):  
Hao Yu ◽  
Sen Yang ◽  
Shenghuo Zhu

In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradients in parallel, aggregates all gradients in a single server to obtain the average, and updates each worker’s local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, and Mann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.


Author(s):  
Derya Soydaner

In recent years, we have witnessed the rise of deep learning. Deep neural networks have proved their success in many areas. However, the optimization of these networks has become more difficult as neural networks going deeper and datasets becoming bigger. Therefore, more advanced optimization algorithms have been proposed over the past years. In this study, widely used optimization algorithms for deep learning are examined in detail. To this end, these algorithms called adaptive gradient methods are implemented for both supervised and unsupervised tasks. The behavior of the algorithms during training and results on four image datasets, namely, MNIST, CIFAR-10, Kaggle Flowers and Labeled Faces in the Wild are compared by pointing out their differences against basic optimization algorithms.


Sign in / Sign up

Export Citation Format

Share Document