Distributed Newton Methods for Deep Neural Networks

Deep learning involves a difficult nonconvex optimization problem with a large number of weights between any two adjacent layers of a deep structure. To handle large data sets or complicated networks, distributed training is needed, but the calculation of function, gradient, and Hessian is expensive. In particular, the communication and the synchronization cost may become a bottleneck. In this letter, we focus on situations where the model is distributedly stored and propose a novel distributed Newton method for training deep neural networks. By variable and feature-wise data partitions and some careful designs, we are able to explicitly use the Jacobian matrix for matrix-vector products in the Newton method. Some techniques are incorporated to reduce the running time as well as memory consumption. First, to reduce the communication cost, we propose a diagonalization method such that an approximate Newton direction can be obtained without communication between machines. Second, we consider subsampled Gauss-Newton matrices for reducing the running time as well as the communication cost. Third, to reduce the synchronization cost, we terminate the process of finding an approximate Newton direction even though some nodes have not finished their tasks. Details of some implementation issues in distributed environments are thoroughly investigated. Experiments demonstrate that the proposed method is effective for the distributed training of deep neural networks. Compared with stochastic gradient methods, it is more robust and may give better test accuracy.

Download Full-text

Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015693 ◽

2019 ◽

Vol 33 ◽

pp. 5693-5700 ◽

Cited By ~ 16

Author(s):

Hao Yu ◽

Sen Yang ◽

Shenghuo Zhu

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Model Averaging ◽

Communication Overhead ◽

Single Server ◽

Training Time ◽

Distributed Training ◽

Speed Up ◽

Experimental Works ◽

Single Worker

In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradients in parallel, aggregates all gradients in a single server to obtain the average, and updates each worker’s local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, and Mann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.

Download Full-text

A Comparison of Optimization Algorithms for Deep Learning

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001420520138 ◽

2020 ◽

Vol 34 (13) ◽

pp. 2052013 ◽

Cited By ~ 3

Author(s):

Derya Soydaner

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Deep Neural Networks ◽

Optimization Algorithms ◽

Gradient Methods ◽

The Past ◽

Basic Optimization ◽

In The Wild ◽

Image Datasets

In recent years, we have witnessed the rise of deep learning. Deep neural networks have proved their success in many areas. However, the optimization of these networks has become more difficult as neural networks going deeper and datasets becoming bigger. Therefore, more advanced optimization algorithms have been proposed over the past years. In this study, widely used optimization algorithms for deep learning are examined in detail. To this end, these algorithms called adaptive gradient methods are implemented for both supervised and unsupervised tasks. The behavior of the algorithms during training and results on four image datasets, namely, MNIST, CIFAR-10, Kaggle Flowers and Labeled Faces in the Wild are compared by pointing out their differences against basic optimization algorithms.

Download Full-text

Prediction of Causative Genes in Inherited Retinal Disorders from Spectral-Domain Optical Coherence Tomography Utilizing Deep Learning Techniques

Journal of Ophthalmology ◽

10.1155/2019/1691064 ◽

2019 ◽

Vol 2019 ◽

pp. 1-7 ◽

Cited By ~ 10

Author(s):

Yu Fujinami-Yokokawa ◽

Nikolas Pontikos ◽

Lizhu Yang ◽

Kazushige Tsunoda ◽

Kazutoshi Yoshitake ◽

...

Keyword(s):

Neural Networks ◽

Optical Coherence Tomography ◽

Classification Accuracy ◽

Deep Neural Networks ◽

Test Accuracy ◽

Test Sets ◽

The Mean ◽

Per Gene ◽

Sd Oct ◽

Gene Category

Purpose. To illustrate a data-driven deep learning approach to predicting the gene responsible for the inherited retinal disorder (IRD) in macular dystrophy caused by ABCA4 and RP1L1 gene aberration in comparison with retinitis pigmentosa caused by EYS gene aberration and normal subjects. Methods. Seventy-five subjects with IRD or no ocular diseases have been ascertained from the database of Japan Eye Genetics Consortium; 10 ABCA4 retinopathy, 20 RP1L1 retinopathy, 28 EYS retinopathy, and 17 normal patients/subjects. Horizontal/vertical cross-sectional scans of optical coherence tomography (SD-OCT) at the central fovea were cropped/adjusted to a resolution of 400 pixels/inch with a size of 750 × 500 pix2 for learning. Subjects were randomly split following a 3 : 1 ratio into training and test sets. The commercially available learning tool, Medic mind was applied to this four-class classification program. The classification accuracy, sensitivity, and specificity were calculated during the learning process. This process was repeated four times with random assignment to training and test sets to control for selection bias. For each training/testing process, the classification accuracy was calculated per gene category. Results. A total of 178 images from 75 subjects were included in this study. The mean training accuracy was 98.5%, ranging from 90.6 to 100.0. The mean overall test accuracy was 90.9% (82.0–97.6). The mean test accuracy per gene category was 100% for ABCA4, 78.0% for RP1L1, 89.8% for EYS, and 93.4% for Normal. Test accuracy of RP1L1 and EYS was not high relative to the training accuracy which suggests overfitting. Conclusion. This study highlighted a novel application of deep neural networks in the prediction of the causative gene in IRD retinopathies from SD-OCT, with a high prediction accuracy. It is anticipated that deep neural networks will be integrated into general screening to support clinical/genetic diagnosis, as well as enrich the clinical education.

Download Full-text

Modified Deep Neural Networks for Dog Breeds Identification

10.20944/preprints201812.0232.v1 ◽

2018 ◽

Cited By ~ 1

Author(s):

Aydin Ayanzadeh ◽

Sahand Vahidnia

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

State Of The Art ◽

The State ◽

Fine Tuning ◽

Test Accuracy ◽

Data Sets ◽

Data Set

In this paper, we leverage state of the art models on Imagenet data-sets. We use the pre-trained model and learned weighs to extract the feature from the Dog breeds identification data-set. Afterwards, we applied fine-tuning and dataaugmentation to increase the performance of our test accuracy in classification of dog breeds datasets. The performance of the proposed approaches are compared with the state of the art models of Image-Net datasets such as ResNet-50, DenseNet-121, DenseNet-169 and GoogleNet. we achieved 89.66% , 85.37% 84.01% and 82.08% test accuracy respectively which shows thesuperior performance of proposed method to the previous works on Stanford dog breeds datasets.

Download Full-text

A Flexible Research-Oriented Framework for Distributed Training of Deep Neural Networks

2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) ◽

10.1109/ipdpsw52791.2021.00110 ◽

2021 ◽

Author(s):

Sergio Barrachina ◽

Adrian Castello ◽

Mar Catalan ◽

Manuel F. Dolz ◽

Jose I. Mestre

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Distributed Training

Download Full-text

PSO-PS:Parameter Synchronization with Particle Swarm Optimization for Distributed Training of Deep Neural Networks

2020 International Joint Conference on Neural Networks (IJCNN) ◽

10.1109/ijcnn48605.2020.9207698 ◽

2020 ◽

Author(s):

Qing Ye ◽

Yuxuan Han ◽

Yanan Sun ◽

Jiancheng Lv

Keyword(s):

Neural Networks ◽

Particle Swarm Optimization ◽

Deep Neural Networks ◽

Particle Swarm ◽

Swarm Optimization ◽

Distributed Training

Download Full-text

Detection of Melanoma Skin Cancer with Deep Neural Networks

Medical & Clinical Research ◽

10.33140/mcr.04.04.05 ◽

2019 ◽

Vol 4 (4) ◽

Keyword(s):

Neural Networks ◽

Skin Cancer ◽

High Performance ◽

Deep Neural Networks ◽

Histopathological Examination ◽

Skin Lesions ◽

Image Features ◽

Test Accuracy ◽

Data Sets ◽

Deep Convolutional Neural Networks

Detection of skin cancer involves several steps of examinations first being visual diagnosis that is followed by dermoscopic analysis, a biopsy, and histopathological examination. The classification of skin lesions in the first step is critical and challenging as classes vary by minute appearance in skin lesions. Deep convolutional neural networks (CNNs) have great potential in multicategory image-based classification by considering coarse-to-fine image features. This study aims to demonstrate how to classify skin lesions, in particular, melanoma, using CNN trained on data sets with disease labels. We developed and trained our own CNN model using a subset of the images from International Skin Imaging Collaboration (ISIC) Dermoscopic Archive. To test the performance of the proposed model, we used a different subset of images from the same archive as the test set. Our model is trained to classify images into two categories: malignant melanoma and nevus and is shown to achieve excellent classification results with high test accuracy (91.16%) and high performance as measured by various metrics. Our study demonstrated the potential of using deep neural networks to assist early detection of melanoma and thereby improve the patient survival rate from this aggressive skin cancer.

Download Full-text

Closing the Generalization Gap of Adaptive Gradient Methods in Training Deep Neural Networks

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/452 ◽

2020 ◽

Author(s):

Jinghui Chen ◽

Dongruo Zhou ◽

Yiqi Tang ◽

Ziyan Yang ◽

Yuan Cao ◽

...

Keyword(s):

Neural Networks ◽

Convergence Rate ◽

Gradient Descent ◽

Deep Neural Networks ◽

Gradient Methods ◽

Estimation Method ◽

Fast Convergence ◽

Stochastic Gradient Descent ◽

Adaptive Parameter ◽

Fast Convergence Rate

Adaptive gradient methods, which adopt historical gradient information to automatically adjust the learning rate, despite the nice property of fast convergence, have been observed to generalize worse than stochastic gradient descent (SGD) with momentum in training deep neural networks. This leaves how to close the generalization gap of adaptive gradient methods an open problem. In this work, we show that adaptive gradient methods such as Adam, Amsgrad, are sometimes "over adapted". We design a new algorithm, called Partially adaptive momentum estimation method, which unifies the Adam/Amsgrad with SGD by introducing a partial adaptive parameter $p$, to achieve the best from both worlds. We also prove the convergence rate of our proposed algorithm to a stationary point in the stochastic nonconvex optimization setting. Experiments on standard benchmarks show that our proposed algorithm can maintain fast convergence rate as Adam/Amsgrad while generalizing as well as SGD in training deep neural networks. These results would suggest practitioners pick up adaptive gradient methods once again for faster training of deep neural networks.

Download Full-text