Accelerating deep neural network training with inconsistent stochastic gradient descent

In this paper, we prove that, in the deep limit, the stochastic gradient descent on a ResNet type deep neural network, where each layer shares the same weight matrix, converges to the stochastic gradient descent for a Neural ODE and that the corresponding value/loss functions converge. Our result gives, in the context of minimization by stochastic gradient descent, a theoretical foundation for considering Neural ODEs as the deep limit of ResNets. Our proof is based on certain decay estimates for associated Fokker–Planck equations.

Download Full-text

Convergence of Stochastic Gradient Descent in Deep Neural Network

Acta Mathematicae Applicatae Sinica English Series ◽

10.1007/s10255-021-0991-2 ◽

2021 ◽

Vol 37 (1) ◽

pp. 126-136

Author(s):

Bai-cun Zhou ◽

Cong-ying Han ◽

Tian-de Guo

Keyword(s):

Neural Network ◽

Gradient Descent ◽

Deep Neural Network ◽

Stochastic Gradient ◽

Stochastic Gradient Descent

Download Full-text

Faster Distributed Deep Net Training: Computation and Communication Decoupled Stochastic Gradient Descent

Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2019/637 ◽

2019 ◽

Cited By ~ 1

Author(s):

Shuheng Shen ◽

Linli Xu ◽

Jingchang Liu ◽

Xianfeng Liang ◽

Yifei Cheng

Keyword(s):

Gradient Descent ◽

Communication Complexity ◽

Linear Time ◽

Stochastic Gradient ◽

Communication Cost ◽

Stochastic Gradient Descent ◽

Neural Network Training ◽

Network Training ◽

Successful Technique ◽

Linear Iteration

With the increase in the amount of data and the expansion of model scale, distributed parallel training becomes an important and successful technique to address the optimization challenges. Nevertheless, although distributed stochastic gradient descent (SGD) algorithms can achieve a linear iteration speedup, they are limited significantly in practice by the communication cost, making it difficult to achieve a linear time speedup. In this paper, we propose a computation and communication decoupled stochastic gradient descent (CoCoD-SGD) algorithm to run computation and communication in parallel to reduce the communication cost. We prove that CoCoD-SGD has a linear iteration speedup with respect to the total computation capability of the hardware resources. In addition, it has a lower communication complexity and better time speedup comparing with traditional distributed SGD algorithms. Experiments on deep neural network training demonstrate the significant improvements of CoCoD-SGD: when training ResNet18 and VGG16 with 16 Geforce GTX 1080Ti GPUs, CoCoD-SGD is up to 2-3 x faster than traditional synchronous SGD.

Download Full-text

MonolithNet: Training monolithic deep neural networks via a partitioned training strategy

Journal of Computational Vision and Imaging Systems ◽

10.15353/jcvis.v4i1.340 ◽

2018 ◽

Vol 4 (1) ◽

pp. 3

Author(s):

Rene Bidart ◽

Alexander Wong

Keyword(s):

Neural Network ◽

Neural Networks ◽

Gradient Descent ◽

Deep Neural Network ◽

Deep Neural Networks ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Neural Net ◽

Training Strategy ◽

Effective Manner

In this study, we explore the training of monolithic deep neural net-works in an effective manner. One of the biggest challenges withtraining such networks to the desired level of accuracy is the dif-ficulty in converging to a good solution using iterative optimizationmethods such as stochastic gradient descent due to the enormousnumber of parameters that need to be learned. To achieve this,we introduce a partitioned training strategy, where proxy layersare connected to different partitions of a deep neural network toenable isolated training of a much smaller number of parametersto convergence. To illustrate the efficacy of this training strategy,we introduce MonolithNet, a massive residual deep neural networkconsisting of 437 million parameters. The trained MonolithNet wasable to achieve a top-1 accuracy of 97% on the CIFAR10 imageclassification dataset, which demonstrates the feasibility of the pro-posed training strategy for training monolithic deep neural networksto high accuracies.

Download Full-text