Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015693 ◽

2019 ◽

Vol 33 ◽

pp. 5693-5700 ◽

Cited By ~ 16

Author(s):

Hao Yu ◽

Sen Yang ◽

Shenghuo Zhu

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Model Averaging ◽

Communication Overhead ◽

Single Server ◽

Training Time ◽

Distributed Training ◽

Speed Up ◽

Experimental Works ◽

Single Worker

In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradients in parallel, aggregates all gradients in a single server to obtain the average, and updates each worker’s local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, and Mann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.

Download Full-text

Analyzing and mitigating data stalls in DNN training

Proceedings of the VLDB Endowment ◽

10.14778/3446095.3446100 ◽

2021 ◽

Vol 14 (5) ◽

pp. 771-784

Author(s):

Jayashree Mohan ◽

Amar Phanishayee ◽

Ashish Raniwala ◽

Vijay Chidambaram

Keyword(s):

Neural Networks ◽

Input Data ◽

Deep Neural Networks ◽

Measure Data ◽

Single Server ◽

Complex Data ◽

Training Time ◽

Data Pipeline ◽

Data Loading ◽

The Impact

Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While prior research has explored many different ways of reducing DNN training time, the impact of input data pipeline , i.e., fetching raw data items from storage and performing data pre-processing in memory, has been relatively unexplored. This paper makes the following contributions: (1) We present the first comprehensive analysis of how the input data pipeline affects the training time of widely-used computer vision and audio Deep Neural Networks (DNNs), that typically involve complex data pre-processing. We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft. We find that in many cases, DNN training time is dominated by data stall time : time spent waiting for data to be fetched and pre-processed. (2) We build a tool, DS-Analyzer to precisely measure data stalls using a differential technique, and perform predictive what-if analysis on data stalls. (3) Finally, based on the insights from our analysis, we design and implement three simple but effective techniques in a data-loading library, CoorDL, to mitigate data stalls. Our experiments on a range of DNN tasks, models, datasets, and hardware configs show that when PyTorch uses CoorDL instead of the state-of-the-art DALI data loading library, DNN training time is reduced significantly (by as much as 5X on a single server).

Download Full-text

Parallelization of Neural Network Training for NLP with Hogwild!

Prague Bulletin of Mathematical Linguistics ◽

10.1515/pralin-2017-0036 ◽

2017 ◽

Vol 109 (1) ◽

pp. 29-38 ◽

Cited By ~ 2

Author(s):

Valentin Deyringer ◽

Alexander Fraser ◽

Helmut Schmid ◽

Tsuyoshi Okita

Keyword(s):

Neural Network ◽

Neural Networks ◽

Suitable Method ◽

Neural Network Training ◽

Training Time ◽

Pos Tagging ◽

Network Training ◽

Speed Up

Abstract Neural Networks are prevalent in todays NLP research. Despite their success for different tasks, training time is relatively long. We use Hogwild! to counteract this phenomenon and show that it is a suitable method to speed up training Neural Networks of different architectures and complexity. For POS tagging and translation we report considerable speedups of training, especially for the latter. We show that Hogwild! can be an important tool for training complex NLP architectures.

Download Full-text

Communication-Efficient Stochastic Gradient MCMC for Neural Networks

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33014173 ◽

2019 ◽

Vol 33 ◽

pp. 4173-4180 ◽

Cited By ~ 1

Author(s):

Chunyuan Li ◽

Changyou Chen ◽

Yunchen Pu ◽

Ricardo Henao ◽

Lawrence Carin

Keyword(s):

Neural Networks ◽

Probability Distributions ◽

Computational Cost ◽

Time Estimation ◽

Stochastic Gradient ◽

Communication Overhead ◽

Test Accuracy ◽

Training Time ◽

Learning Probability ◽

Policy Optimization

Learning probability distributions on the weights of neural networks has recently proven beneficial in many applications. Bayesian methods such as Stochastic Gradient Markov Chain Monte Carlo (SG-MCMC) offer an elegant framework to reason about model uncertainty in neural networks. However, these advantages usually come with a high computational cost. We propose accelerating SG-MCMC under the masterworker framework: workers asynchronously and in parallel share responsibility for gradient computations, while the master collects the final samples. To reduce communication overhead, two protocols (downpour and elastic) are developed to allow periodic interaction between the master and workers. We provide a theoretical analysis on the finite-time estimation consistency of posterior expectations, and establish connections to sample thinning. Our experiments on various neural networks demonstrate that the proposed algorithms can greatly reduce training time while achieving comparable (or better) test accuracy/log-likelihood levels, relative to traditional SG-MCMC. When applied to reinforcement learning, it naturally provides exploration for asynchronous policy optimization, with encouraging performance improvement.

Download Full-text

A GPU memory efficient speed-up scheme for training ultra-deep neural networks

Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming ◽

10.1145/3293883.3295718 ◽

2019 ◽

Cited By ~ 2

Author(s):

Jinrong Guo ◽

Wantao Liu ◽

Wang Wang ◽

Qu Lu ◽

Songlin Hu ◽

...

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Speed Up ◽

Memory Efficient

Download Full-text

A Flexible Research-Oriented Framework for Distributed Training of Deep Neural Networks

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) ◽

10.1109/ipdpsw52791.2021.00110 ◽

2021 ◽

Author(s):

Sergio Barrachina ◽

Adrian Castello ◽

Mar Catalan ◽

Manuel F. Dolz ◽

Jose I. Mestre

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Distributed Training

Download Full-text

Efficient Model Averaging for Deep Neural Networks

Computer Vision – ACCV 2016 - Lecture Notes in Computer Science ◽

10.1007/978-3-319-54184-6_13 ◽

2017 ◽

pp. 205-220 ◽

Cited By ~ 1

Author(s):

Michael Opitz ◽

Horst Possegger ◽

Horst Bischof

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Model Averaging

Download Full-text

PSO-PS:Parameter Synchronization with Particle Swarm Optimization for Distributed Training of Deep Neural Networks

2020 International Joint Conference on Neural Networks (IJCNN) ◽

10.1109/ijcnn48605.2020.9207698 ◽

2020 ◽

Author(s):

Qing Ye ◽

Yuxuan Han ◽

Yanan Sun ◽

Jiancheng Lv

Keyword(s):

Neural Networks ◽

Particle Swarm Optimization ◽

Deep Neural Networks ◽

Particle Swarm ◽

Swarm Optimization ◽

Distributed Training

Download Full-text

Two-Stage Approach to Image Classification by Deep Neural Networks

EPJ Web of Conferences ◽

10.1051/epjconf/201817301009 ◽

2018 ◽

Vol 173 ◽

pp. 01009 ◽

Cited By ~ 3

Author(s):

Gennady Ososkov ◽

Pavel Goncharov

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Deep Neural Networks ◽

Cloud Services ◽

Activation Functions ◽

Learning Networks ◽

Actual Problem ◽

Learning Time ◽

Autoassociative Neural Network ◽

Speed Up

The paper demonstrates the advantages of the deep learning networks over the ordinary neural networks on their comparative applications to image classifying. An autoassociative neural network is used as a standalone autoencoder for prior extraction of the most informative features of the input data for neural networks to be compared further as classifiers. The main efforts to deal with deep learning networks are spent for a quite painstaking work of optimizing the structures of those networks and their components, as activation functions, weights, as well as the procedures of minimizing their loss function to improve their performances and speed up their learning time. It is also shown that the deep autoencoders develop the remarkable ability for denoising images after being specially trained. Convolutional Neural Networks are also used to solve a quite actual problem of protein genetics on the example of the durum wheat classification. Results of our comparative study demonstrate the undoubted advantage of the deep networks, as well as the denoising power of the autoencoders. In our work we use both GPU and cloud services to speed up the calculations.

Download Full-text

Quality Control of PET Bottles Caps with Dedicated Image Calibration and Deep Neural Networks

Sensors ◽

10.3390/s21020501 ◽

2021 ◽

Vol 21 (2) ◽

pp. 501

Author(s):

Marcin Malesa ◽

Piotr Rajkiewicz

Keyword(s):

Neural Networks ◽

Quality Control ◽

Product Quality ◽

Deep Neural Networks ◽

Production Control ◽

Production Lines ◽

Product Quality Control ◽

Training Time ◽

Pet Bottles ◽

Pharmaceutical Industries

Product quality control is currently the leading trend in industrial production. It is heading towards the exact analysis of each product before reaching the end customer. Every stage of production control is of particular importance in the food and pharmaceutical industries, where, apart from visual issues, additional safety regulations are demanded. Many production processes can be controlled completely contactless through the use of machine vision cameras and advanced image processing techniques. The most dynamically growing sector of image analysis methods are solutions based on deep neural networks. Their major advantages are fast performance, robustness, and the fact that they can be exploited even in complicated classification problems. However, the use of machine learning methods on high-performance production lines may be limited by inference time or, in the case of multiformated production lines, training time. The article presents a novel data preprocessing (or calibration) method. It uses prior knowledge about the optical system, which enables the use of the lightweight Convolutional Neural Network (CNN) model for product quality control of polyethylene terephthalate (PET) bottle caps. The combination of preprocessing with the lightweight CNN model resulted in at least a five-fold reduction in prediction and training time compared to the lighter standard models tested on ImageNet, without loss of accuracy.

Download Full-text

An Efficient Approach for Detecting Driver Drowsiness Based on Deep Learning

Applied Sciences ◽

10.3390/app11188441 ◽

2021 ◽

Vol 11 (18) ◽

pp. 8441

Author(s):

Anh-Cang Phan ◽

Ngoc-Hoang-Quyen Nguyen ◽

Thanh-Ngoan Trieu ◽

Thuong-Cang Phan

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Deep Neural Networks ◽

Training Dataset ◽

Economic Losses ◽

Drowsy Driving ◽

Training Time ◽

Road Users ◽

Fast Training ◽

Learning Techniques

Drowsy driving is one of the common causes of road accidents resulting in injuries, even death, and significant economic losses to drivers, road users, families, and society. There have been many studies carried out in an attempt to detect drowsiness for alert systems. However, a majority of the studies focused on determining eyelid and mouth movements, which have revealed many limitations for drowsiness detection. Besides, physiological measures-based studies may not be feasible in practice because the measuring devices are often not available on vehicles and often uncomfortable for drivers. In this research, we therefore propose two efficient methods with three scenarios for doze alert systems. The former applies facial landmarks to detect blinks and yawns based on appropriate thresholds for each driver. The latter uses deep learning techniques with two adaptive deep neural networks based on MobileNet-V2 and ResNet-50V2. The second method analyzes the videos and detects driver’s activities in every frame to learn all features automatically. We leverage the advantage of the transfer learning technique to train the proposed networks on our training dataset. This solves the problem of limited training datasets, provides fast training time, and keeps the advantage of the deep neural networks. Experiments were conducted to test the effectiveness of our methods compared with other methods. Empirical results demonstrate that the proposed method using deep learning techniques can achieve a high accuracy of 97% . This study provides meaningful solutions in practice to prevent unfortunate automobile accidents caused by drowsiness.

Download Full-text