scholarly journals Accelerating deep neural network training with inconsistent stochastic gradient descent

2017 ◽  
Vol 93 ◽  
pp. 219-229 ◽  
Author(s):  
Linnan Wang ◽  
Yi Yang ◽  
Renqiang Min ◽  
Srimat Chakradhar
2020 ◽  
pp. 1-41 ◽  
Author(s):  
Benny Avelin ◽  
Kaj Nyström

In this paper, we prove that, in the deep limit, the stochastic gradient descent on a ResNet type deep neural network, where each layer shares the same weight matrix, converges to the stochastic gradient descent for a Neural ODE and that the corresponding value/loss functions converge. Our result gives, in the context of minimization by stochastic gradient descent, a theoretical foundation for considering Neural ODEs as the deep limit of ResNets. Our proof is based on certain decay estimates for associated Fokker–Planck equations.


Author(s):  
Shuheng Shen ◽  
Linli Xu ◽  
Jingchang Liu ◽  
Xianfeng Liang ◽  
Yifei Cheng

With the increase in the amount of data and the expansion of model scale, distributed parallel training becomes an important and successful technique to address the optimization challenges. Nevertheless, although distributed stochastic gradient descent (SGD) algorithms can achieve a linear iteration speedup, they are limited significantly in practice by the communication cost, making it difficult to achieve a linear time speedup. In this paper, we propose a computation and communication decoupled stochastic gradient descent (CoCoD-SGD) algorithm to run computation and communication in parallel to reduce the communication cost. We prove that CoCoD-SGD has a linear iteration speedup with respect to the total computation capability of the hardware resources. In addition, it has a lower communication complexity and better time speedup comparing with traditional distributed SGD algorithms. Experiments on deep neural network training demonstrate the significant improvements of CoCoD-SGD: when training ResNet18 and VGG16 with 16 Geforce GTX 1080Ti GPUs, CoCoD-SGD is up to 2-3 x faster than traditional synchronous SGD.


2018 ◽  
Vol 4 (1) ◽  
pp. 3
Author(s):  
Rene Bidart ◽  
Alexander Wong

In this study, we explore the training of monolithic deep neural net-works in an effective manner. One of the biggest challenges withtraining such networks to the desired level of accuracy is the dif-ficulty in converging to a good solution using iterative optimizationmethods such as stochastic gradient descent due to the enormousnumber of parameters that need to be learned. To achieve this,we introduce a partitioned training strategy, where proxy layersare connected to different partitions of a deep neural network toenable isolated training of a much smaller number of parametersto convergence. To illustrate the efficacy of this training strategy,we introduce MonolithNet, a massive residual deep neural networkconsisting of 437 million parameters. The trained MonolithNet wasable to achieve a top-1 accuracy of 97% on the CIFAR10 imageclassification dataset, which demonstrates the feasibility of the pro-posed training strategy for training monolithic deep neural networksto high accuracies.


Sign in / Sign up

Export Citation Format

Share Document