Analyzing and mitigating data stalls in DNN training

Training Deep Neural Networks (DNNs) is resource-intensive and time-consuming. While prior research has explored many different ways of reducing DNN training time, the impact of input data pipeline , i.e., fetching raw data items from storage and performing data pre-processing in memory, has been relatively unexplored. This paper makes the following contributions: (1) We present the first comprehensive analysis of how the input data pipeline affects the training time of widely-used computer vision and audio Deep Neural Networks (DNNs), that typically involve complex data pre-processing. We analyze nine different models across three tasks and four datasets while varying factors such as the amount of memory, number of CPU threads, storage device, GPU generation etc on servers that are a part of a large production cluster at Microsoft. We find that in many cases, DNN training time is dominated by data stall time : time spent waiting for data to be fetched and pre-processed. (2) We build a tool, DS-Analyzer to precisely measure data stalls using a differential technique, and perform predictive what-if analysis on data stalls. (3) Finally, based on the insights from our analysis, we design and implement three simple but effective techniques in a data-loading library, CoorDL, to mitigate data stalls. Our experiments on a range of DNN tasks, models, datasets, and hardware configs show that when PyTorch uses CoorDL instead of the state-of-the-art DALI data loading library, DNN training time is reduced significantly (by as much as 5X on a single server).

Download Full-text

Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33015693 ◽

2019 ◽

Vol 33 ◽

pp. 5693-5700 ◽

Cited By ~ 16

Author(s):

Hao Yu ◽

Sen Yang ◽

Shenghuo Zhu

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Model Averaging ◽

Communication Overhead ◽

Single Server ◽

Training Time ◽

Distributed Training ◽

Speed Up ◽

Experimental Works ◽

Single Worker

In distributed training of deep neural networks, parallel minibatch SGD is widely used to speed up the training process by using multiple workers. It uses multiple workers to sample local stochastic gradients in parallel, aggregates all gradients in a single server to obtain the average, and updates each worker’s local model using a SGD update with the averaged gradient. Ideally, parallel mini-batch SGD can achieve a linear speed-up of the training time (with respect to the number of workers) compared with SGD over a single worker. However, such linear scalability in practice is significantly limited by the growing demand for gradient communication as more workers are involved. Model averaging, which periodically averages individual models trained over parallel workers, is another common practice used for distributed training of deep neural networks since (Zinkevich et al. 2010) (McDonald, Hall, and Mann 2010). Compared with parallel mini-batch SGD, the communication overhead of model averaging is significantly reduced. Impressively, tremendous experimental works have verified that model averaging can still achieve a good speed-up of the training time as long as the averaging interval is carefully controlled. However, it remains a mystery in theory why such a simple heuristic works so well. This paper provides a thorough and rigorous theoretical study on why model averaging can work as well as parallel mini-batch SGD with significantly less communication overhead.

Download Full-text

IoT-Based Bee Swarm Activity Acoustic Classification Using Deep Neural Networks

Sensors ◽

10.3390/s21030676 ◽

2021 ◽

Vol 21 (3) ◽

pp. 676

Author(s):

Andrej Zgank

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Markov Models ◽

Audio Signal ◽

Audio Signals ◽

Mel Frequency Cepstral Coefficients ◽

Animal Activity ◽

The Impact ◽

Acoustic Classification ◽

Swarm Activity

Animal activity acoustic monitoring is becoming one of the necessary tools in agriculture, including beekeeping. It can assist in the control of beehives in remote locations. It is possible to classify bee swarm activity from audio signals using such approaches. A deep neural networks IoT-based acoustic swarm classification is proposed in this paper. Audio recordings were obtained from the Open Source Beehive project. Mel-frequency cepstral coefficients features were extracted from the audio signal. The lossless WAV and lossy MP3 audio formats were compared for IoT-based solutions. An analysis was made of the impact of the deep neural network parameters on the classification results. The best overall classification accuracy with uncompressed audio was 94.09%, but MP3 compression degraded the DNN accuracy by over 10%. The evaluation of the proposed deep neural networks IoT-based bee activity acoustic classification showed improved results if compared to the previous hidden Markov models system.

Download Full-text

Impact of Low Resolution on Image Recognition with Deep Neural Networks: An Experimental Study

International Journal of Applied Mathematics and Computer Science ◽

10.2478/amcs-2018-0056 ◽

2018 ◽

Vol 28 (4) ◽

pp. 735-744 ◽

Cited By ~ 9

Author(s):

Michał Koziarski ◽

Bogusław Cyganek

Keyword(s):

Neural Networks ◽

Image Recognition ◽

Classification Accuracy ◽

Deep Neural Networks ◽

Dynamic Range ◽

Super Resolution ◽

Image Resolution ◽

Quality Data ◽

Low Resolution ◽

The Impact

Abstract Due to the advances made in recent years, methods based on deep neural networks have been able to achieve a state-of-the-art performance in various computer vision problems. In some tasks, such as image recognition, neural-based approaches have even been able to surpass human performance. However, the benchmarks on which neural networks achieve these impressive results usually consist of fairly high quality data. On the other hand, in practical applications we are often faced with images of low quality, affected by factors such as low resolution, presence of noise or a small dynamic range. It is unclear how resilient deep neural networks are to the presence of such factors. In this paper we experimentally evaluate the impact of low resolution on the classification accuracy of several notable neural architectures of recent years. Furthermore, we examine the possibility of improving neural networks’ performance in the task of low resolution image recognition by applying super-resolution prior to classification. The results of our experiments indicate that contemporary neural architectures remain significantly affected by low image resolution. By applying super-resolution prior to classification we were able to alleviate this issue to a large extent as long as the resolution of the images did not decrease too severely. However, in the case of very low resolution images the classification accuracy remained considerably affected.

Download Full-text

Training Deep Neural Networks with Low Precision Input Data: A Hurricane Prediction Case Study

Lecture Notes in Computer Science - High Performance Computing ◽

10.1007/978-3-030-02465-9_40 ◽

2018 ◽

pp. 562-569

Author(s):

Albert Kahira ◽

Leonardo Bautista Gomez ◽

Rosa M. Badia

Keyword(s):

Neural Networks ◽

Input Data ◽

Deep Neural Networks ◽

Hurricane Prediction

Download Full-text

The Impact of Architecture on the Deep Neural Networks Training

2019 12th International Conference on Human System Interaction (HSI) ◽

10.1109/hsi47298.2019.8942622 ◽

2019 ◽

Author(s):

Pawel Rozycki ◽

Janusz Kolbusz ◽

Aleksander Malinowski ◽

Bogdan Wilamowski

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

The Impact

Download Full-text

Deep Neural Networks Constrained by Decision Rules

Proceedings of the AAAI Conference on Artificial Intelligence ◽

10.1609/aaai.v33i01.33012496 ◽

2019 ◽

Vol 33 ◽

pp. 2496-2505

Author(s):

Yuzuru Okajima ◽

Kunihiko Sadamasa

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Predictive Accuracy ◽

Decision Rules ◽

Hybrid Technique ◽

Complex Data ◽

Rule Based ◽

Prior Probabilities ◽

The Neural Network ◽

Latent Representations

Deep neural networks achieve high predictive accuracy by learning latent representations of complex data. However, the reasoning behind their decisions is difficult for humans to understand. On the other hand, rule-based approaches are able to justify the decisions by showing the decision rules leading to them, but they have relatively low accuracy. To improve the interpretability of neural networks, several techniques provide post-hoc explanations of decisions made by neural networks, but they cannot guarantee that the decisions are always explained in a simple form like decision rules because their explanations are generated after the decisions are made by neural networks.In this paper, to balance the accuracy of neural networks and the interpretability of decision rules, we propose a hybrid technique called rule-constrained networks, namely, neural networks that make decisions by selecting decision rules from a given ruleset. Because the networks are forced to make decisions based on decision rules, it is guaranteed that every decision is supported by a decision rule. Furthermore, we propose a technique to jointly optimize the neural network and the ruleset from which the network select rules. The log likelihood of correct classifications is maximized under a model with hyper parameters about the ruleset size and the prior probabilities of rules being selected. This feature makes it possible to limit the ruleset size or prioritize human-made rules over automatically acquired rules for promoting the interpretability of the output. Experiments on datasets of time-series and sentiment classification showed rule-constrained networks achieved accuracy as high as that achieved by original neural networks and significantly higher than that achieved by existing rule-based models, while presenting decision rules supporting the decisions.

Download Full-text

Analysis of Explainers of Black Box Deep Neural Networks for Computer Vision: A Survey

Machine Learning and Knowledge Extraction ◽

10.3390/make3040048 ◽

2021 ◽

Vol 3 (4) ◽

pp. 966-989

Author(s):

Vanessa Buhrmester ◽

David Münch ◽

Michael Arens

Keyword(s):

Neural Networks ◽

Computer Vision ◽

Deep Neural Networks ◽

State Of The Art ◽

Black Box ◽

Complex Data ◽

Comprehensive Overview ◽

Nonlinear Structure ◽

Black Boxes ◽

Insight Into

Deep Learning is a state-of-the-art technique to make inference on extensive or complex data. As a black box model due to their multilayer nonlinear structure, Deep Neural Networks are often criticized as being non-transparent and their predictions not traceable by humans. Furthermore, the models learn from artificially generated datasets, which often do not reflect reality. By basing decision-making algorithms on Deep Neural Networks, prejudice and unfairness may be promoted unknowingly due to a lack of transparency. Hence, several so-called explanators, or explainers, have been developed. Explainers try to give insight into the inner structure of machine learning black boxes by analyzing the connection between the input and output. In this survey, we present the mechanisms and properties of explaining systems for Deep Neural Networks for Computer Vision tasks. We give a comprehensive overview about the taxonomy of related studies and compare several survey papers that deal with explainability in general. We work out the drawbacks and gaps and summarize further research ideas.

Download Full-text

Quality Control of PET Bottles Caps with Dedicated Image Calibration and Deep Neural Networks

Sensors ◽

10.3390/s21020501 ◽

2021 ◽

Vol 21 (2) ◽

pp. 501

Author(s):

Marcin Malesa ◽

Piotr Rajkiewicz

Keyword(s):

Neural Networks ◽

Quality Control ◽

Product Quality ◽

Deep Neural Networks ◽

Production Control ◽

Production Lines ◽

Product Quality Control ◽

Training Time ◽

Pet Bottles ◽

Pharmaceutical Industries

Product quality control is currently the leading trend in industrial production. It is heading towards the exact analysis of each product before reaching the end customer. Every stage of production control is of particular importance in the food and pharmaceutical industries, where, apart from visual issues, additional safety regulations are demanded. Many production processes can be controlled completely contactless through the use of machine vision cameras and advanced image processing techniques. The most dynamically growing sector of image analysis methods are solutions based on deep neural networks. Their major advantages are fast performance, robustness, and the fact that they can be exploited even in complicated classification problems. However, the use of machine learning methods on high-performance production lines may be limited by inference time or, in the case of multiformated production lines, training time. The article presents a novel data preprocessing (or calibration) method. It uses prior knowledge about the optical system, which enables the use of the lightweight Convolutional Neural Network (CNN) model for product quality control of polyethylene terephthalate (PET) bottle caps. The combination of preprocessing with the lightweight CNN model resulted in at least a five-fold reduction in prediction and training time compared to the lighter standard models tested on ImageNet, without loss of accuracy.

Download Full-text

Non-linear Memristive Synaptic Dynamics for Efficient Unsupervised Learning in Spiking Neural Networks

Frontiers in Neuroscience ◽

10.3389/fnins.2021.580909 ◽

2021 ◽

Vol 15 ◽

Author(s):

Stefano Brivio ◽

Denys R. B. Ly ◽

Elisa Vianello ◽

Sabina Spiga

Keyword(s):

Neural Networks ◽

Classification Performance ◽

Spike Timing ◽

Spiking Neural Networks ◽

System Level ◽

Training Time ◽

Background Theory ◽

Non Linear ◽

Synaptic Dynamics ◽

The Impact

Spiking neural networks (SNNs) are a computational tool in which the information is coded into spikes, as in some parts of the brain, differently from conventional neural networks (NNs) that compute over real-numbers. Therefore, SNNs can implement intelligent information extraction in real-time at the edge of data acquisition and correspond to a complementary solution to conventional NNs working for cloud-computing. Both NN classes face hardware constraints due to limited computing parallelism and separation of logic and memory. Emerging memory devices, like resistive switching memories, phase change memories, or memristive devices in general are strong candidates to remove these hurdles for NN applications. The well-established training procedures of conventional NNs helped in defining the desiderata for memristive device dynamics implementing synaptic units. The generally agreed requirements are a linear evolution of memristive conductance upon stimulation with train of identical pulses and a symmetric conductance change for conductance increase and decrease. Conversely, little work has been done to understand the main properties of memristive devices supporting efficient SNN operation. The reason lies in the lack of a background theory for their training. As a consequence, requirements for NNs have been taken as a reference to develop memristive devices for SNNs. In the present work, we show that, for efficient CMOS/memristive SNNs, the requirements for synaptic memristive dynamics are very different from the needs of a conventional NN. System-level simulations of a SNN trained to classify hand-written digit images through a spike timing dependent plasticity protocol are performed considering various linear and non-linear plausible synaptic memristive dynamics. We consider memristive dynamics bounded by artificial hard conductance values and limited by the natural dynamics evolution toward asymptotic values (soft-boundaries). We quantitatively analyze the impact of resolution and non-linearity properties of the synapses on the network training and classification performance. Finally, we demonstrate that the non-linear synapses with hard boundary values enable higher classification performance and realize the best trade-off between classification accuracy and required training time. With reference to the obtained results, we discuss how memristive devices with non-linear dynamics constitute a technologically convenient solution for the development of on-line SNN training.

Download Full-text

Align, then memorise: the dynamics of learning with feedback alignment

Journal of Physics A Mathematical and Theoretical ◽

10.1088/1751-8121/ac411b ◽

2021 ◽

Author(s):

Maria Refinetti ◽

Stéphane d'Ascoli ◽

Ruben Ohana ◽

Sebastian Goldt

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

State Of The Art ◽

Simple Explanation ◽

Low Loss ◽

Convolutional Networks ◽

Linear Networks ◽

Alignment Algorithms ◽

Direct Feedback ◽

The Impact

Abstract Direct Feedback Alignment (DFA) is emerging as an eﬁcient and biologically plausible alternative to backpropagation for training deep neural networks. Despite relying on random feedback weights for the backward pass, DFA successfully trains state-of-the-art models such as Transformers. On the other hand, it notoriously fails to train convolutional networks. An understanding of the inner workings of DFA to explain these diverging results remains elusive. Here, we propose a theory of feedback alignment algorithms. We ﬀrst show that learning in shallow networks proceeds in two steps: an alignment phase, where the model adapts its weights to align the approximate gradient with the true gradient of the loss function, is followed by a memorisation phase, where the model focuses on ﬀtting the data. This two-step process has a degeneracy breaking eﬂect: out of all the low-loss solutions in the landscape, a network trained with DFA naturally converges to the solution which maximises gradient alignment. We also identify a key quantity underlying alignment in deep linear networks: the conditioning of the alignment matrices. The latter enables a detailed understanding of the impact of data structure on alignment, and suggests a simple explanation for the well-known failure of DFA to train convolutional neural networks. Numerical experiments on MNIST and CIFAR10 clearly demonstrate degeneracy breaking in deep non-linear networks and show that the align-then-memorize process occurs sequentially from the bottom layers of the network to the top.

Download Full-text