scholarly journals A Convergence Analysis of Distributed SGD with Communication-Efficient Gradient Sparsification

Author(s):  
Shaohuai Shi ◽  
Kaiyong Zhao ◽  
Qiang Wang ◽  
Zhenheng Tang ◽  
Xiaowen Chu

Gradient sparsification is a promising technique to significantly reduce the communication overhead in decentralized synchronous stochastic gradient descent (S-SGD) algorithms. Yet, many existing gradient sparsification schemes (e.g., Top-k sparsification) have a communication complexity of O(kP), where k is the number of selected gradients by each worker and P is the number of workers. Recently, the gTop-k sparsification scheme has been proposed to reduce the communication complexity from O(kP) to O(k logP), which significantly boosts the system scalability. However, it remains unclear whether the gTop-k sparsification scheme can converge in theory. In this paper, we first provide theoretical proofs on the convergence of the gTop-k scheme for non-convex objective functions under certain analytic assumptions. We then derive the convergence rate of gTop-k S-SGD, which is at the same order as the vanilla mini-batch SGD. Finally, we conduct extensive experiments on different machine learning models and data sets to verify the soundness of the assumptions and theoretical results, and discuss the impact of the compression ratio on the convergence performance.

2021 ◽  
pp. 1-18
Author(s):  
Angeliki Koutsimpela ◽  
Konstantinos D. Koutroumbas

Several well known clustering algorithms have their own online counterparts, in order to deal effectively with the big data issue, as well as with the case where the data become available in a streaming fashion. However, very few of them follow the stochastic gradient descent philosophy, despite the fact that the latter enjoys certain practical advantages (such as the possibility of (a) running faster than their batch processing counterparts and (b) escaping from local minima of the associated cost function), while, in addition, strong theoretical convergence results have been established for it. In this paper a novel stochastic gradient descent possibilistic clustering algorithm, called O- PCM 2 is introduced. The algorithm is presented in detail and it is rigorously proved that the gradient of the associated cost function tends to zero in the L 2 sense, based on general convergence results established for the family of the stochastic gradient descent algorithms. Furthermore, an additional discussion is provided on the nature of the points where the algorithm may converge. Finally, the performance of the proposed algorithm is tested against other related algorithms, on the basis of both synthetic and real data sets.


Author(s):  
Simone Göttlich ◽  
Claudia Totzeck

AbstractWe propose a neural network approach to model general interaction dynamics and an adjoint-based stochastic gradient descent algorithm to calibrate its parameters. The parameter calibration problem is considered as optimal control problem that is investigated from a theoretical and numerical point of view. We prove the existence of optimal controls, derive the corresponding first-order optimality system and formulate a stochastic gradient descent algorithm to identify parameters for given data sets. To validate the approach, we use real data sets from traffic and crowd dynamics to fit the parameters. The results are compared to forces corresponding to well-known interaction models such as the Lighthill–Whitham–Richards model for traffic and the social force model for crowd motion.


2014 ◽  
Vol 687-691 ◽  
pp. 1342-1345 ◽  
Author(s):  
Jie Ding ◽  
Li Peng Zhu ◽  
Bin Hu ◽  
Ren Long Hang ◽  
Yu Bao Sun

With the rapid advance of data collection and storage technique, it is easy to acquire tens of millions or even billions of data sets. How to explore and exploit the useful or interesting information for human beings from these data sets has become an urgent issue. Traditional k-means clustering algorithm has been widely used in data mining community. First, randomly initialize k clustering centres. Then, all instances are classified into k different classes according to their distances to clustering centres. Lastly, update the clustering centres by the mean of its corresponding constituent instances. This whole process will be iterated until convergence. Obviously, at each iteration, distance matrix from all instances to k clustering centres must be calculated which will cost so much time when encounter large scale data sets. To address this issue, in this paper, we proposed a fast optimization algorithm based on stochastic gradient descent (SGD). At each iteration, randomly choose an instance, search its corresponding clustering centre and then update it immediately. Experimental results show that our proposed method achieves a competitive clustering results with less time cost.


2020 ◽  
Vol 21 (2) ◽  
pp. 323-336
Author(s):  
Zhen Lu ◽  
Meng Lu ◽  
Yan Liang

The application of deep learning in industry often needs to train large-scale neural networks and use large-scale data sets. However, larger networks and larger data sets lead to longer training time, which hinders the research of algorithms and the progress of actual engineering development. Data-parallel distributed training is a commonly used solution, but it is still in the stage of technical exploration. In this paper, we study how to improve the training accuracy and speed of distributed training, and propose a distributed training strategy based on hybrid gradient computing. Specifically, in the gradient descent stage, we propose a hybrid method, which combines a new warmup scheme with the linear-scaling stochastic gradient descent (SGD) algorithm to effectively improve the training accuracy and convergence rate. At the same time, we adopt the mixed precision gradient computing. In the single-GPU gradient computing and inter-GPU gradient synchronization, we use the mixed numerical precision of single precision (FP32) and half precision (FP16), which not only improves the training speed of single-GPU, but also improves the speed of inter-GPU communication. Through the integration of various training strategies and system engineering implementation, we finished ResNet-50 training in 20 minutes on a cluster of 24 V100 GPUs, with 75.6% Top-1 accuracy, and 97.5% GPU scaling efficiency. In addition, this paper proposes a new criterion for the evaluation of the distributed training efficiency, that is, the actual average single-GPU training time, which can evaluate the improvement of training methods in a more reasonable manner than just the improved performance due to the increased number of GPUs. In terms of this criterion, our method outperforms those existing methods.


Author(s):  
Shuheng Shen ◽  
Linli Xu ◽  
Jingchang Liu ◽  
Xianfeng Liang ◽  
Yifei Cheng

With the increase in the amount of data and the expansion of model scale, distributed parallel training becomes an important and successful technique to address the optimization challenges. Nevertheless, although distributed stochastic gradient descent (SGD) algorithms can achieve a linear iteration speedup, they are limited significantly in practice by the communication cost, making it difficult to achieve a linear time speedup. In this paper, we propose a computation and communication decoupled stochastic gradient descent (CoCoD-SGD) algorithm to run computation and communication in parallel to reduce the communication cost. We prove that CoCoD-SGD has a linear iteration speedup with respect to the total computation capability of the hardware resources. In addition, it has a lower communication complexity and better time speedup comparing with traditional distributed SGD algorithms. Experiments on deep neural network training demonstrate the significant improvements of CoCoD-SGD: when training ResNet18 and VGG16 with 16 Geforce GTX 1080Ti GPUs, CoCoD-SGD is up to 2-3 x faster than traditional synchronous SGD.


2019 ◽  
Vol 23 ◽  
pp. 310-337 ◽  
Author(s):  
Stephan Clémençon ◽  
Patrice Bertail ◽  
Emilie Chautru ◽  
Guillaume Papa

Iterative stochastic approximation methods are widely used to solve M-estimation problems, in the context of predictive learning in particular. In certain situations that shall be undoubtedly more and more common in the Big Data era, the datasets available are so massive that computing statistics over the full sample is hardly feasible, if not unfeasible. A natural and popular approach to gradient descent in this context consists in substituting the “full data” statistics with their counterparts based on subsamples picked at random of manageable size. It is the main purpose of this paper to investigate the impact of survey sampling with unequal inclusion probabilities on stochastic gradient descent-based M-estimation methods. Precisely, we prove that, in presence of some a priori information, one may significantly increase statistical accuracy in terms of limit variance, when choosing appropriate first order inclusion probabilities. These results are described by asymptotic theorems and are also supported by illustrative numerical experiments.


2020 ◽  
Vol 39 (3) ◽  
pp. 4677-4688
Author(s):  
Weimin Ding ◽  
Shengli Wu

Stacking is one of the major types of ensemble learning techniques in which a set of base classifiers contributes their outputs to the meta-level classifier, and the meta-level classifier combines them so as to produce more accurate classifications. In this paper, we propose a new stacking algorithm that defines the cross-entropy as the loss function for the classification problem. The training process is conducted by using a neural network with the stochastic gradient descent technique. One major characteristic of our method is its treatment of each meta instance as a whole with one optimization model, which is different from some other stacking methods such as stacking with multi-response linear regression and stacking with multi-response model trees. In these methods each meta instance is divided into a set of sub-instances. Multiple models apply to those sub-instances and each for a class label. There is no connection between different models. It is very likely that our treatment is a better choice for finding suitable weights. Experiments with 22 data sets from the UCI machine learning repository show that the proposed stacking approach performs well. It outperforms all three base classifiers, several state-of-the-art stacking algorithms, and some other representative ensemble learning methods on average.


2020 ◽  
Vol 32 (3) ◽  
pp. 779-789
Author(s):  
Yaqin Zhou ◽  
Shaojie Tang

The rich data used to train learning models increasingly tend to be distributed and private. It is important to efficiently perform learning tasks without compromising individual users’ privacy even considering untrusted learning applications and, furthermore, understand how privacy-preservation mechanisms impact the learning process. To address the problem, we design a differentially private distributed algorithm based on the stochastic variance reduced gradient (SVRG) algorithm, which prevents the learning server from accessing and inferring private training data with a theoretical guarantee. We quantify the impact of the adopted privacy-preservation measure on the learning process in terms of convergence rate, by which it indicates noises added at each gradient update results in a bounded deviation from the optimum. To further evaluate the impact on the trained models, we compare the proposed algorithm with SVRG and stochastic gradient descent using logistic regression and neural nets. The experimental results on benchmark data sets show that the proposed algorithm has minor impact on the accuracy of trained models under a moderate amount of privacy budget.


2020 ◽  
Vol 10 (2) ◽  
pp. 124-151
Author(s):  
Justin Sirignano ◽  
Konstantinos Spiliopoulos

Stochastic gradient descent in continuous time (SGDCT) provides a computationally efficient method for the statistical learning of continuous-time models, which are widely used in science, engineering, and finance. The SGDCT algorithm follows a (noisy) descent direction along a continuous stream of data. The parameter updates occur in continuous time and satisfy a stochastic differential equation. This paper analyzes the asymptotic convergence rate of the SGDCT algorithm by proving a central limit theorem for strongly convex objective functions and, under slightly stronger conditions, for nonconvex objective functions as well. An [Formula: see text] convergence rate is also proven for the algorithm in the strongly convex case. The mathematical analysis lies at the intersection of stochastic analysis and statistical learning.


Author(s):  
Thanh Hiên Nguyễn ◽  
Thi Tran ◽  
Hieu Duong ◽  
Hoai Tran

Runoff prediction has recently become an essential task with respect to assessing the impact of climate change to people’s livelihoods and production. However, the runoff time series always exhibits nonlinear and non-stationary features, which makes it very difficult to be accurately predicted. Machine learning have been recently proved to be a powerful tool in helping society adapt to a changing climate and its subfield, deep learning, showed the power in approximate nonlinear functions. In this study, we propose a method based on deep belief networks (DBN) for runoff prediction. In order to evaluate the proposed method, we collected runoff datasets from Srepok and Dak Nong rivers located in mountain regions of the Central Highland of Vietnam in the periods of 2001-2007 at Dak Nong hydrology station and 1990-2011 at Buon Don hydrology station, respectively. Experimental results show that DBN outperforms, respectively, LSTM, BiLSTM, multi-layer perceptron (MLP) trained by particle swarm optimization (PSO) and MLP trained by stochastic gradient descent (SGD) in which gradients are computed using the backpropagation (BP) procedure. The results also confirm that DBN is suitable to employ for the task of runoff prediction.


Sign in / Sign up

Export Citation Format

Share Document