scholarly journals Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

Author(s):  
Hao Yu ◽  
Sen Yang ◽  
Shenghuo Zhu

In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradients in parallel, aggregates all gradients in a single server to obtain the average, and updates each worker’s local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, and Mann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.

2021 ◽  
Vol 14 (5) ◽  
pp. 771-784
Author(s):  
Jayashree Mohan ◽  
Amar Phanishayee ◽  
Ashish Raniwala ◽  
Vijay Chidambaram

Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While prior research has explored many different ways of reducing DNN training time, the impact of input data pipeline , i.e., fetching raw data items from storage and performing data pre-processing in memory, has been relatively unexplored. This paper makes the following contributions: (1) We present the first comprehensive analysis of how the input data pipeline affects the training time of widely-used computer vision and audio Deep Neural Networks (DNNs), that typically involve complex data pre-processing. We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft. We find that in many cases, DNN training time is dominated by data stall time : time spent waiting for data to be fetched and pre-processed. (2) We build a tool, DS-Analyzer to precisely measure data stalls using a differential technique, and perform predictive what-if analysis on data stalls. (3) Finally, based on the insights from our analysis, we design and implement three simple but effective techniques in a data-loading library, CoorDL, to mitigate data stalls. Our experiments on a range of DNN tasks, models, datasets, and hardware configs show that when PyTorch uses CoorDL instead of the state-of-the-art DALI data loading library, DNN training time is reduced significantly (by as much as 5X on a single server).


2017 ◽  
Vol 109 (1) ◽  
pp. 29-38 ◽  
Author(s):  
Valentin Deyringer ◽  
Alexander Fraser ◽  
Helmut Schmid ◽  
Tsuyoshi Okita

Abstract Neural Networks are prevalent in todays NLP research. Despite their success for different tasks, training time is relatively long. We use Hogwild! to counteract this phenomenon and show that it is a suitable method to speed up training Neural Networks of different architectures and complexity. For POS tagging and translation we report considerable speedups of training, especially for the latter. We show that Hogwild! can be an important tool for training complex NLP architectures.


Author(s):  
Chunyuan Li ◽  
Changyou Chen ◽  
Yunchen Pu ◽  
Ricardo Henao ◽  
Lawrence Carin

Learning probability distributions on the weights of neural networks has recently proven beneficial in many applications. Bayesian methods such as Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) offer an elegant framework to reason about model uncertainty in neural networks. However, these advantages usually come with a high computational cost. We propose accelerating SG-MCMC under the masterworker framework: workers asynchronously and in parallel share responsibility for gradient computations, while the master collects the final samples. To reduce communication overhead, two protocols (downpour and elastic) are developed to allow periodic interaction between the master and workers. We provide a theoretical analysis on the finite-time estimation consistency of posterior expectations, and establish connections to sample thinning. Our experiments on various neural networks demonstrate that the proposed algorithms can greatly reduce training time while achieving comparable (or better) test accuracy/log-likelihood levels, relative to traditional SG-MCMC. When applied to reinforcement learning, it naturally provides exploration for asynchronous policy optimization, with encouraging performance improvement.


2018 ◽  
Vol 173 ◽  
pp. 01009 ◽  
Author(s):  
Gennady Ososkov ◽  
Pavel Goncharov

The paper demonstrates the advantages of the deep learning networks over the ordinary neural networks on their comparative applications to image classifying. An autoassociative neural network is used as a standalone autoencoder for prior extraction of the most informative features of the input data for neural networks to be compared further as classifiers. The main efforts to deal with deep learning networks are spent for a quite painstaking work of optimizing the structures of those networks and their components, as activation functions, weights, as well as the procedures of minimizing their loss function to improve their performances and speed up their learning time. It is also shown that the deep autoencoders develop the remarkable ability for denoising images after being specially trained. Convolutional Neural Networks are also used to solve a quite actual problem of protein genetics on the example of the durum wheat classification. Results of our comparative study demonstrate the undoubted advantage of the deep networks, as well as the denoising power of the autoencoders. In our work we use both GPU and cloud services to speed up the calculations.


Sensors ◽  
2021 ◽  
Vol 21 (2) ◽  
pp. 501
Author(s):  
Marcin Malesa ◽  
Piotr Rajkiewicz

Product quality control is currently the leading trend in industrial production. It is heading towards the exact analysis of each product before reaching the end customer. Every stage of production control is of particular importance in the food and pharmaceutical industries, where, apart from visual issues, additional safety regulations are demanded. Many production processes can be controlled completely contactless through the use of machine vision cameras and advanced image processing techniques. The most dynamically growing sector of image analysis methods are solutions based on deep neural networks. Their major advantages are fast performance, robustness, and the fact that they can be exploited even in complicated classification problems. However, the use of machine learning methods on high-performance production lines may be limited by inference time or, in the case of multiformated production lines, training time. The article presents a novel data preprocessing (or calibration) method. It uses prior knowledge about the optical system, which enables the use of the lightweight Convolutional Neural Network (CNN) model for product quality control of polyethylene terephthalate (PET) bottle caps. The combination of preprocessing with the lightweight CNN model resulted in at least a five-fold reduction in prediction and training time compared to the lighter standard models tested on ImageNet, without loss of accuracy.


2021 ◽  
Vol 11 (18) ◽  
pp. 8441
Author(s):  
Anh-Cang Phan ◽  
Ngoc-Hoang-Quyen Nguyen  ◽  
Thanh-Ngoan Trieu ◽  
Thuong-Cang Phan

Drowsy driving is one of the common causes of road accidents resulting in injuries, even death, and significant economic losses to drivers, road users, families, and society. There have been many studies carried out in an attempt to detect drowsiness for alert systems. However, a majority of the studies focused on determining eyelid and mouth movements, which have revealed many limitations for drowsiness detection. Besides, physiological measures-based studies may not be feasible in practice because the measuring devices are often not available on vehicles and often uncomfortable for drivers. In this research, we therefore propose two efficient methods with three scenarios for doze alert systems. The former applies facial landmarks to detect blinks and yawns based on appropriate thresholds for each driver. The latter uses deep learning techniques with two adaptive deep neural networks based on MobileNet-V2 and ResNet-50V2. The second method analyzes the videos and detects driver’s activities in every frame to learn all features automatically. We leverage the advantage of the transfer learning technique to train the proposed networks on our training dataset. This solves the problem of limited training datasets, provides fast training time, and keeps the advantage of the deep neural networks. Experiments were conducted to test the effectiveness of our methods compared with other methods. Empirical results demonstrate that the proposed method using deep learning techniques can achieve a high accuracy of 97% . This study provides meaningful solutions in practice to prevent unfortunate automobile accidents caused by drowsiness.


Sign in / Sign up

Export Citation Format

Share Document