A Distributed Neural Network Training Method Based on Hybrid Gradient Computing

Zhen Lu; Meng Lu; Yan Liang

doi:10.12694/scpe.v21i2.1727

A Distributed Neural Network Training Method Based on Hybrid Gradient Computing

Scalable Computing Practice and Experience ◽

10.12694/scpe.v21i2.1727 ◽

2020 ◽

Vol 21 (2) ◽

pp. 323-336

Author(s):

Zhen Lu ◽

Meng Lu ◽

Yan Liang

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Stochastic Gradient Descent ◽

Data Sets ◽

Training Methods ◽

Training Time ◽

Training Strategy ◽

Distributed Training ◽

Mixed Precision ◽

Improved Performance

The application of deep learning in industry often needs to train large-scale neural networks and use large-scale data sets. However, larger networks and larger data sets lead to longer training time, which hinders the research of algorithms and the progress of actual engineering development. Data-parallel distributed training is a commonly used solution, but it is still in the stage of technical exploration. In this paper, we study how to improve the training accuracy and speed of distributed training, and propose a distributed training strategy based on hybrid gradient computing. Specifically, in the gradient descent stage, we propose a hybrid method, which combines a new warmup scheme with the linear-scaling stochastic gradient descent (SGD) algorithm to effectively improve the training accuracy and convergence rate. At the same time, we adopt the mixed precision gradient computing. In the single-GPU gradient computing and inter-GPU gradient synchronization, we use the mixed numerical precision of single precision (FP32) and half precision (FP16), which not only improves the training speed of single-GPU, but also improves the speed of inter-GPU communication. Through the integration of various training strategies and system engineering implementation, we finished ResNet-50 training in 20 minutes on a cluster of 24 V100 GPUs, with 75.6% Top-1 accuracy, and 97.5% GPU scaling efficiency. In addition, this paper proposes a new criterion for the evaluation of the distributed training efficiency, that is, the actual average single-GPU training time, which can evaluate the improvement of training methods in a more reasonable manner than just the improved performance due to the increased number of GPUs. In terms of this criterion, our method outperforms those existing methods.

Download Full-text

Stochastic Gradient Descent Based K-Means Algorithm on Large Scale Data Clustering

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.687-691.1342 ◽

2014 ◽

Vol 687-691 ◽

pp. 1342-1345 ◽

Cited By ~ 1

Author(s):

Jie Ding ◽

Li Peng Zhu ◽

Bin Hu ◽

Ren Long Hang ◽

Yu Bao Sun

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Clustering Algorithm ◽

Distance Matrix ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Data Sets ◽

Human Beings ◽

Large Scale Data ◽

Scale Data

With the rapid advance of data collection and storage technique, it is easy to acquire tens of millions or even billions of data sets. How to explore and exploit the useful or interesting information for human beings from these data sets has become an urgent issue. Traditional k-means clustering algorithm has been widely used in data mining community. First, randomly initialize k clustering centres. Then, all instances are classified into k different classes according to their distances to clustering centres. Lastly, update the clustering centres by the mean of its corresponding constituent instances. This whole process will be iterated until convergence. Obviously, at each iteration, distance matrix from all instances to k clustering centres must be calculated which will cost so much time when encounter large scale data sets. To address this issue, in this paper, we proposed a fast optimization algorithm based on stochastic gradient descent (SGD). At each iteration, randomly choose an instance, search its corresponding clustering centre and then update it immediately. Experimental results show that our proposed method achieves a competitive clustering results with less time cost.

Download Full-text

Meta-Descent for Online, Continual Prediction

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33013943 ◽

2019 ◽

Vol 33 ◽

pp. 3943-3950

Author(s):

Andrew Jacobsen ◽

Matthew Schlegel ◽

Cameron Linke ◽

Thomas Degris ◽

Adam White ◽

...

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Time Series Prediction ◽

Real Data ◽

Second Order ◽

Stochastic Gradient Descent ◽

Step Size ◽

Vector Approximation ◽

Prediction Problems ◽

Stationary Problems

This paper investigates different vector step-size adaptation approaches for non-stationary online, continual prediction problems. Vanilla stochastic gradient descent can be considerably improved by scaling the update with a vector of appropriately chosen step-sizes. Many methods, including AdaGrad, RMSProp, and AMSGrad, keep statistics about the learning process to approximate a second order update—a vector approximation of the inverse Hessian. Another family of approaches use meta-gradient descent to adapt the stepsize parameters to minimize prediction error. These metadescent strategies are promising for non-stationary problems, but have not been as extensively explored as quasi-second order methods. We first derive a general, incremental metadescent algorithm, called AdaGain, designed to be applicable to a much broader range of algorithms, including those with semi-gradient updates or even those with accelerations, such as RMSProp. We provide an empirical comparison of methods from both families. We conclude that methods from both families can perform well, but in non-stationary prediction problems the meta-descent methods exhibit advantages. Our method is particularly robust across several prediction problems, and is competitive with the state-of-the-art method on a large-scale, time-series prediction problem on real data from a mobile robot.

Download Full-text

Large-Scale Machine Learning with Stochastic Gradient Descent Léon Bottou

Statistical Learning and Data Science ◽

10.1201/b11429-6 ◽

2011 ◽

pp. 33-42 ◽

Cited By ~ 1

Keyword(s):

Machine Learning ◽

Gradient Descent ◽

Large Scale ◽

Stochastic Gradient ◽

Stochastic Gradient Descent

Download Full-text

Distributed training and scalability for the particle clustering method UCluster

EPJ Web of Conferences ◽

10.1051/epjconf/202125102054 ◽

2021 ◽

Vol 251 ◽

pp. 02054

Author(s):

Olga Sunneborn Gudnadottir ◽

Daniel Gedon ◽

Colin Desmarais ◽

Karl Bengtsson Bernander ◽

Raazesh Sainudiin ◽

...

Keyword(s):

Particle Physics ◽

Hadron Collider ◽

Large Data ◽

Large Data Sets ◽

Data Sets ◽

Training Time ◽

Distributed Training ◽

Machine Learning Methods ◽

Multi Class Classification

In recent years, machine-learning methods have become increasingly important for the experiments at the Large Hadron Collider (LHC). They are utilised in everything from trigger systems to reconstruction and data analysis. The recent UCluster method is a general model providing unsupervised clustering of particle physics data, that can be easily modified to provide solutions for a variety of different decision problems. In the current paper, we improve on the UCluster method by adding the option of training the model in a scalable and distributed fashion, and thereby extending its utility to learn from arbitrarily large data sets. UCluster combines a graph-based neural network called ABCnet with a clustering step, using a combined loss function in the training phase. The original code is publicly available in TensorFlow v1.14 and has previously been trained on a single GPU. It shows a clustering accuracy of 81% when applied to the problem of multi-class classification of simulated jet events. Our implementation adds the distributed training functionality by utilising the Horovod distributed training framework, which necessitated a migration of the code to TensorFlow v2. Together with using parquet files for splitting data up between different compute nodes, the distributed training makes the model scalable to any amount of input data, something that will be essential for use with real LHC data sets. We find that the model is well suited for distributed training, with the training time decreasing in direct relation to the number of GPU’s used. However, further improvements by a more exhaustive and possibly distributed hyper-parameter search is required in order to achieve the reported accuracy of the original UCluster method.

Download Full-text

A new stochastic gradient descent possibilistic clustering algorithm

AI Communications ◽

10.3233/aic-210125 ◽

2021 ◽

pp. 1-18

Author(s):

Angeliki Koutsimpela ◽

Konstantinos D. Koutroumbas

Keyword(s):

Cost Function ◽

Gradient Descent ◽

Clustering Algorithm ◽

Clustering Algorithms ◽

Real Data ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Data Sets ◽

Convergence Results ◽

Possibilistic Clustering

Several well known clustering algorithms have their own online counterparts, in order to deal effectively with the big data issue, as well as with the case where the data become available in a streaming fashion. However, very few of them follow the stochastic gradient descent philosophy, despite the fact that the latter enjoys certain practical advantages (such as the possibility of (a) running faster than their batch processing counterparts and (b) escaping from local minima of the associated cost function), while, in addition, strong theoretical convergence results have been established for it. In this paper a novel stochastic gradient descent possibilistic clustering algorithm, called O- PCM 2 is introduced. The algorithm is presented in detail and it is rigorously proved that the gradient of the associated cost function tends to zero in the L 2 sense, based on general convergence results established for the family of the stochastic gradient descent algorithms. Furthermore, an additional discussion is provided on the nature of the points where the algorithm may converge. Finally, the performance of the proposed algorithm is tested against other related algorithms, on the basis of both synthetic and real data sets.

Download Full-text

Large-Scale Machine Learning with Stochastic Gradient Descent

Proceedings of COMPSTAT'2010 ◽

10.1007/978-3-7908-2604-3_16 ◽

2010 ◽

pp. 177-186 ◽

Cited By ~ 1247

Author(s):

Léon Bottou

Keyword(s):

Machine Learning ◽

Gradient Descent ◽

Large Scale ◽

Stochastic Gradient ◽

Stochastic Gradient Descent

Download Full-text

Improving training time of deep neural networkwith asynchronous averaged stochastic gradient descent

The 9th International Symposium on Chinese Spoken Language Processing ◽

10.1109/iscslp.2014.6936596 ◽

2014 ◽

Cited By ~ 2

Author(s):

Zhao You ◽

Bo Xu

Keyword(s):

Gradient Descent ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Training Time

Download Full-text

Stochastic Subgradient for Large-Scale Support Vector Machine Using the Generalized Pinball Loss Function

Symmetry ◽

10.3390/sym13091652 ◽

2021 ◽

Vol 13 (9) ◽

pp. 1652

Author(s):

Wanida Panup ◽

Rabian Wangkeeree

Keyword(s):

Support Vector Machine ◽

Loss Function ◽

Gradient Descent ◽

Large Scale ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Support Vector ◽

Hinge Loss ◽

Gradient Descent Algorithm ◽

Pinball Loss

In this paper, we propose a stochastic gradient descent algorithm, called stochastic gradient descent method-based generalized pinball support vector machine (SG-GPSVM), to solve data classification problems. This approach was developed by replacing the hinge loss function in the conventional support vector machine (SVM) with a generalized pinball loss function. We show that SG-GPSVM is convergent and that it approximates the conventional generalized pinball support vector machine (GPSVM). Further, the symmetric kernel method was adopted to evaluate the performance of SG-GPSVM as a nonlinear classifier. Our suggested algorithm surpasses existing methods in terms of noise insensitivity, resampling stability, and accuracy for large-scale data scenarios, according to the experimental results.

Download Full-text

Local Regularizer Improves Generalization

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v34i04.6167 ◽

2020 ◽

Vol 34 (04) ◽

pp. 6861-6868 ◽

Cited By ~ 1

Author(s):

Yikai Zhang ◽

Hui Qu ◽

Dimitris Metaxas ◽

Chao Chen

Keyword(s):

Deep Learning ◽

Theoretical Analysis ◽

Experimental Evidence ◽

Gradient Descent ◽

Stochastic Gradient ◽

Stochastic Gradient Descent ◽

Training Algorithms ◽

Training Methods ◽

Theoretical Understanding ◽

Better Than

Regularization plays an important role in generalization of deep learning. In this paper, we study the generalization power of an unbiased regularizor for training algorithms in deep learning. We focus on training methods called Locally Regularized Stochastic Gradient Descent (LRSGD). An LRSGD leverages a proximal type penalty in gradient descent steps to regularize SGD in training. We show that by carefully choosing relevant parameters, LRSGD generalizes better than SGD. Our thorough theoretical analysis is supported by experimental evidence. It advances our theoretical understanding of deep learning and provides new perspectives on designing training algorithms. The code is available at https://github.com/huiqu18/LRSGD.

Download Full-text

SSGD: A Safe and Efficient Method of Gradient Descent

Security and Communication Networks ◽

10.1155/2021/5404061 ◽

2021 ◽

Vol 2021 ◽

pp. 1-11

Author(s):

Jinhuan Duan ◽

Xianxian Li ◽

Shiqi Gao ◽

Zili Zhong ◽

Jinyan Wang

Keyword(s):

Gradient Descent ◽

Large Scale ◽

Optimization Problems ◽

Unit Vector ◽

Descent Method ◽

Stochastic Gradient ◽

Learning System ◽

Training Data ◽

Stochastic Gradient Descent ◽

Gradient Descent Method

With the vigorous development of artificial intelligence technology, various engineering technology applications have been implemented one after another. The gradient descent method plays an important role in solving various optimization problems, due to its simple structure, good stability, and easy implementation. However, in multinode machine learning system, the gradients usually need to be shared, which will cause privacy leakage, because attackers can infer training data with the gradient information. In this paper, to prevent gradient leakage while keeping the accuracy of the model, we propose the super stochastic gradient descent approach to update parameters by concealing the modulus length of gradient vectors and converting it or them into a unit vector. Furthermore, we analyze the security of super stochastic gradient descent approach and demonstrate that our algorithm can defend against the attacks on the gradient. Experiment results show that our approach is obviously superior to prevalent gradient descent approaches in terms of accuracy, robustness, and adaptability to large-scale batches. Interestingly, our algorithm can also resist model poisoning attacks to a certain extent.

Download Full-text