Comparative evaluation of performance and scalability of convolutional neural network implementations on a multisystem HPC architecture

Abstract The convolutional neural network training algorithm has been implemented for a central processing unit based high performance multisystem architecture machine. The multisystem or the multicomputer is a parallel machine model which is essentially an abstraction of distributed memory parallel machines. In actual practice, this model corresponds to high performance computing clusters. The proposed implementation of the convolutional neural network training algorithm is based on modeling the convolutional neural network as a computational pipeline. The various functions or tasks of the convolutional neural network pipeline have been mapped onto the multiple nodes of a central processing unit based high performance computing cluster for task parallelism. The pipeline implementation provides a first level performance gain through pipeline parallelism. Further performance gains are obtained by distributing the convolutional neural network training onto the different nodes of the compute cluster. The two gains are multiplicative. In this work, the authors have carried out a comparative evaluation of the computational performance and scalability of this pipeline implementation of the convolutional neural network training with a distributed neural network software program which is based on conventional multi-model training and makes use of a centralized server. The dataset considered for this work is the North Eastern University’s hot rolled steel strip surface defects imaging dataset. In both the cases, the convolutional neural networks have been trained to classify the different defects on hot rolled steel strips on the basis of the input image. One hundred images corresponding to each class of defects have been used for the training in order to keep the training times manageable. The hyperparameters of both the convolutional neural networks were kept identical and the programs were run on the same computational cluster to enable fair comparison. Both the convolutional neural network implementations have been observed to train to nearly 80% training accuracy in 200 epochs. In effect, therefore, the comparison is on the time taken to complete the training epochs.

Download Full-text

Aspects of programming for implementation of convolutional neural networks on multisystem HPC architectures

Journal of Physics Conference Series ◽

10.1088/1742-6596/2062/1/012016 ◽

2021 ◽

Vol 2062 (1) ◽

pp. 012016

Author(s):

Sunil Pandey ◽

Naresh Kumar Nagwani ◽

Shrish Verma

Keyword(s):

Neural Network ◽

Neural Networks ◽

Deep Learning ◽

Convolutional Neural Network ◽

High Performance Computing ◽

Convolutional Neural Networks ◽

High Performance ◽

Processing Unit ◽

Computational Performance ◽

Performance Computing

Abstract The training of deep learning convolutional neural networks is extremely compute intensive and takes long times for completion, on all except small datasets. This is a major limitation inhibiting the widespread adoption of convolutional neural networks in real world applications despite their better image classification performance in comparison with other techniques. Multidirectional research and development efforts are therefore being pursued with the objective of boosting the computational performance of convolutional neural networks. Development of parallel and scalable deep learning convolutional neural network implementations for multisystem high performance computing architectures is important in this background. Prior analysis based on computational experiments indicates that a combination of pipeline and task parallelism results in significant convolutional neural network performance gains of up to 18 times. This paper discusses the aspects which are important from the perspective of implementation of parallel and scalable convolutional neural networks on central processing unit based multisystem high performance computing architectures including computational pipelines, convolutional neural networks, convolutional neural network pipelines, multisystem high performance computing architectures and parallel programming models.

Download Full-text

Soft Memory Box: A Virtual Shared Memory Framework for Fast Deep Neural Network Training in Distributed High Performance Computing

IEEE Access ◽

10.1109/access.2018.2834146 ◽

2018 ◽

Vol 6 ◽

pp. 26493-26504 ◽

Cited By ~ 6

Author(s):

Shinyoung Ahn ◽

Joongheon Kim ◽

Eunji Lim ◽

Sungwon Kang

Keyword(s):

Neural Network ◽

High Performance Computing ◽

Shared Memory ◽

High Performance ◽

Deep Neural Network ◽

Neural Network Training ◽

Network Training ◽

Virtual Shared Memory ◽

Performance Computing

Download Full-text

Automated generation of convolutional neural network training data using video sources

2017 IEEE Applied Imagery Pattern Recognition Workshop (AIPR) ◽

10.1109/aipr.2017.8457936 ◽

2017 ◽

Author(s):

Andrew R. Kalukin ◽

Wade Leonard ◽

Joan Green ◽

Lester Burgwardt

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Training Data ◽

Neural Network Training ◽

Automated Generation ◽

Network Training

Download Full-text

Synchronized Analog Capacitor Arrays for Parallel Convolutional Neural Network Training

2020 IEEE 63rd International Midwest Symposium on Circuits and Systems (MWSCAS) ◽

10.1109/mwscas48704.2020.9184482 ◽

2020 ◽

Author(s):

E. Leobandung ◽

M. J. Rasch ◽

Y. Li

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Neural Network Training ◽

Network Training ◽

Capacitor Arrays

Download Full-text

SeisNoise.jl: Ambient Seismic Noise Cross Correlation on the CPU and GPU in Julia

Seismological Research Letters ◽

10.1785/0220200192 ◽

2020 ◽

Vol 92 (1) ◽

pp. 517-527

Author(s):

Timothy Clements ◽

Marine A. Denolle

Keyword(s):

Seismic Noise ◽

High Performance ◽

Cross Correlation ◽

Graphic Processing Unit ◽

Ambient Seismic Noise ◽

Processing Unit ◽

Central Processing ◽

And Performance ◽

Noise Cross Correlation ◽

Performance Computing

Abstract We introduce SeisNoise.jl, a library for high-performance ambient seismic noise cross correlation, written entirely in the computing language Julia. Julia is a new language, with syntax and a learning curve similar to MATLAB (see Data and Resources), R, or Python and performance close to Fortran or C. SeisNoise.jl is compatible with high-performance computing resources, using both the central processing unit and the graphic processing unit. SeisNoise.jl is a modular toolbox, giving researchers common tools and data structures to design custom ambient seismic cross-correlation workflows in Julia.

Download Full-text

Convolutional Neural Network Training for RGBN Camera Color Restoration Using Generated Image Pairs

IEEE Photonics Journal ◽

10.1109/jphot.2020.3025088 ◽

2020 ◽

Vol 12 (5) ◽

pp. 1-15

Author(s):

Zhenghao Han ◽

Li Li ◽

Weiqi Jin ◽

Xia Wang ◽

Gangcheng Jiao ◽

...

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Neural Network Training ◽

Network Training ◽

Color Restoration ◽

Image Pairs

Download Full-text

Speeding up Convolutional Neural Network Training with Dynamic Precision Scaling and Flexible Multiplier-Accumulator

Proceedings of the 2016 International Symposium on Low Power Electronics and Design - ISLPED '16 ◽

10.1145/2934583.2934625 ◽

2016 ◽

Cited By ~ 10

Author(s):

Taesik Na ◽

Saibal Mukhopadhyay

Keyword(s):

Neural Network ◽

Convolutional Neural Network ◽

Neural Network Training ◽

Network Training ◽

Dynamic Precision

Download Full-text

Analog CMOS-based resistive processing unit for deep neural network training

2017 IEEE 60th International Midwest Symposium on Circuits and Systems (MWSCAS) ◽

10.1109/mwscas.2017.8052950 ◽

2017 ◽

Cited By ~ 11

Author(s):

Seyoung Kim ◽

Tayfun Gokmen ◽

Hyung-Min Lee ◽

Wilfried E. Haensch

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

Processing Unit ◽

Neural Network Training ◽

Network Training ◽

Analog Cmos

Download Full-text

Controllers: An abstraction to ease the use of hardware accelerators

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017702962 ◽

2017 ◽

Vol 32 (6) ◽

pp. 838-853 ◽

Cited By ~ 4

Author(s):

Ana Moreton–Fernandez ◽

Hector Ortega–Arranz ◽

Arturo Gonzalez–Escribano

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Abstract Entity ◽

Hardware Accelerators ◽

Processing Unit ◽

Central Processing ◽

Computing Platforms ◽

Graphics Processing ◽

Performance Computing ◽

Selection Of

Nowadays the use of hardware accelerators, such as the graphics processing units or XeonPhi coprocessors, is key in solving computationally costly problems that require high performance computing. However, programming solutions for an efficient deployment for these kind of devices is a very complex task that relies on the manual management of memory transfers and configuration parameters. The programmer has to carry out a deep study of the particular data that needs to be computed at each moment, across different computing platforms, also considering architectural details. We introduce the controller concept as an abstract entity that allows the programmer to easily manage the communications and kernel launching details on hardware accelerators in a transparent way. This model also provides the possibility of defining and launching central processing unit kernels in multi-core processors with the same abstraction and methodology used for the accelerators. It internally combines different native programming models and technologies to exploit the potential of each kind of device. Additionally, the model also allows the programmer to simplify the proper selection of values for several configuration parameters that can be selected when a kernel is launched. This is done through a qualitative characterization process of the kernel code to be executed. Finally, we present the implementation of the controller model in a prototype library, together with its application in several case studies. Its use has led to reductions in the development and porting costs, with significantly low overheads in the execution times when compared to manually programmed and optimized solutions which directly use CUDA and OpenMP.

Download Full-text