Flexible data parallel training of neural networks using MIMD-Computers

AbstractIn this work, we build a general piece-wise model to analyze data-parallel (DP) training costs of convolutional neural networks (CNNs) on clusters of GPUs. This general model is based on i) multi-layer perceptrons (MLPs) in charge of modeling the NVIDIA cuDNN/cuBLAS library kernels involved in the training of some of the state-of-the-art CNNs; and ii) an analytical model in charge of modeling the NVIDIA NCCL Allreduce collective primitive using the Ring algorithm. The CNN training scalability study performed using this model in combination with the Roofline technique on varying batch sizes, node (floating-point) arithmetic performance, node memory bandwidth, network link bandwidth, and cluster dimension unveil some crucial bottlenecks at both GPU and cluster level. To provide evidence of this analysis, we validate the accuracy of the proposed model against a Python library for distributed deep learning training.

Download Full-text

Data parallel software simulation of cellular neural networks

1996 Fourth IEEE International Workshop on Cellular Neural Networks and their Applications Proceedings (CNNA-96) ◽

10.1109/cnna.1996.566573 ◽

2002 ◽

Cited By ~ 3

Author(s):

E. Schikuta

Keyword(s):

Neural Networks ◽

Cellular Neural Networks ◽

Software Simulation ◽

Data Parallel ◽

Parallel Software

Download Full-text

Accelerating Neural Network Training with Distributed Asynchronous and Selective Optimization (DASO)

10.21203/rs.3.rs-832355/v1 ◽

2021 ◽

Author(s):

Daniel Coquelin ◽

Charlotte Debus ◽

Markus Götz ◽

Fabrice von der Lehr ◽

James Kahn ◽

...

Keyword(s):

Neural Networks ◽

Large Scale ◽

Training Methods ◽

Network Parameter ◽

Distributed Resources ◽

Global Networks ◽

Training Time ◽

Data Parallel ◽

Network Training ◽

Time Required

Abstract With increasing data and model complexities, the time required to train neural networks has become prohibitively large. To address the exponential rise in training time, users are turning to data parallel neural networks (DPNN) and large-scale distributed resources on computer clusters. Current DPNN approaches implement the network parameter updates by synchronizing and averaging gradients across all processes with blocking communication operations after each forward-backward pass. This synchronization is the central algorithmic bottleneck. We introduce the Distributed Asynchronous and Selective Optimization (DASO) method, which leverages multi-GPU compute node architectures to accelerate network training while maintaining accuracy. DASO uses a hierarchical and asynchronous communication scheme comprised of node-local and global networks while adjusting the global synchronization rate during the learning process. We show that DASO yields a reduction in training time of up to 34% on classical and state-of-the-art networks, as compared to current optimized data parallel training methods.

Download Full-text