Optimal memory-aware backpropagation of deep join networks

Deep learning training memory needs can prevent the user from considering large models and large batch sizes. In this work, we propose to use techniques from memory-aware scheduling and automatic differentiation (AD) to execute a backpropagation graph with a bounded memory requirement at the cost of extra recomputations. The case of a single homogeneous chain, i.e. the case of a network whose stages are all identical and form a chain, is well understood and optimal solutions have been proposed in the AD literature. The networks encountered in practice in the context of deep learning are much more diverse, both in terms of shape and heterogeneity. In this work, we define the class of backpropagation graphs, and extend those on which one can compute in polynomial time a solution that minimizes the total number of recomputations. In particular, we consider join graphs which correspond to models such as siamese or cross-modal networks. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Download Full-text

A cost-benefit analysis of GPU-based EC2 instances for a deep learning algorithm

10.5753/eradsp.2019.13588 ◽

2019 ◽

Author(s):

Eva Malta ◽

Charles Rodamilans ◽

Sandra Avila ◽

Edson Borin

Keyword(s):

Deep Learning ◽

Virtual Machine ◽

Learning Algorithm ◽

Cost Benefit Analysis ◽

Cost Benefit ◽

Benefit Analysis ◽

Machine Type ◽

Deep Learning Algorithm ◽

Batch Sizes ◽

The Cost

This paper analyzes the cost-benefit of using EC2 instances, specif- ically the p2 and p3 virtual machine types, which have GPU accelerators, to execute a machine learning algorithm. This analysis includes the runtime of a convolutional neural network executions, and it takes into consideration the necessary time to stabilize the accuracy value with different batch sizes. Also, we measure the cost of using each machine type, and we define a relation be- tween this cost and the execution time for each virtual machine. The results show that, although the price per hour of the p3 instance is three times bigger, it is faster and costs almost the same as the p2 instance type to train the deep learning algorithm.

Download Full-text

Applying a Hybrid Sequential Model to Chinese Sentence Correction

Symmetry ◽

10.3390/sym12121939 ◽

2020 ◽

Vol 12 (12) ◽

pp. 1939

Author(s):

Jun Wei Chen ◽

Xanno K. Sigalingging ◽

Jenq-Shiou Leu ◽

Jun-Ichi Takada

Keyword(s):

Neural Network ◽

Deep Learning ◽

Language Learning ◽

High Performance ◽

Large Scale ◽

Computation Time ◽

Semantic Error ◽

Commercial Applications ◽

Time Required ◽

The Cost

In recent years, Chinese has become one of the most popular languages globally. The demand for automatic Chinese sentence correction has gradually increased. This research can be adopted to Chinese language learning to reduce the cost of learning and feedback time, and help writers check for wrong words. The traditional way to do Chinese sentence correction is to check if the word exists in the predefined dictionary. However, this kind of method cannot deal with semantic error. As deep learning becomes popular, an artificial neural network can be applied to understand the sentence’s context to correct the semantic error. However, there are still many issues that need to be discussed. For example, the accuracy and the computation time required to correct a sentence are still lacking, so maybe it is still not the time to adopt the deep learning based Chinese sentence correction system to large-scale commercial applications. Our goal is to obtain a model with better accuracy and computation time. Combining recurrent neural network and Bidirectional Encoder Representations from Transformers (BERT), a recently popular model, known for its high performance and slow inference speed, we introduce a hybrid model which can be applied to Chinese sentence correction, improving the accuracy and also the inference speed. Among the results, BERT-GRU has obtained the highest BLEU Score in all experiments. The inference speed of the transformer-based original model can be improved by 1131% in beam search decoding in the 128-word experiment, and greedy decoding can also be improved by 452%. The longer the sequence, the larger the improvement.

Download Full-text

Machine learning and big scientific data

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2019.0054 ◽

2020 ◽

Vol 378 (2166) ◽

pp. 20190054 ◽

Cited By ~ 8

Author(s):

Tony Hey ◽

Keith Butler ◽

Sam Jackson ◽

Jeyarajan Thiyagalingam

Keyword(s):

Machine Learning ◽

Deep Learning ◽

Language Processing ◽

High Performance ◽

Large Scale ◽

Materials Science ◽

Numerical Algorithms ◽

Scientific Data ◽

Learning Technology ◽

Challenges And Opportunities

This paper reviews some of the challenges posed by the huge growth of experimental data generated by the new generation of large-scale experiments at UK national facilities at the Rutherford Appleton Laboratory (RAL) site at Harwell near Oxford. Such ‘Big Scientific Data’ comes from the Diamond Light Source and Electron Microscopy Facilities, the ISIS Neutron and Muon Facility and the UK's Central Laser Facility. Increasingly, scientists are now required to use advanced machine learning and other AI technologies both to automate parts of the data pipeline and to help find new scientific discoveries in the analysis of their data. For commercially important applications, such as object recognition, natural language processing and automatic translation, deep learning has made dramatic breakthroughs. Google's DeepMind has now used the deep learning technology to develop their AlphaFold tool to make predictions for protein folding. Remarkably, it has been able to achieve some spectacular results for this specific scientific problem. Can deep learning be similarly transformative for other scientific problems? After a brief review of some initial applications of machine learning at the RAL, we focus on challenges and opportunities for AI in advancing materials science. Finally, we discuss the importance of developing some realistic machine learning benchmarks using Big Scientific Data coming from several different scientific domains. We conclude with some initial examples of our ‘scientific machine learning’ benchmark suite and of the research challenges these benchmarks will enable. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Download Full-text

On the cost of iterative computations

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2019.0050 ◽

2020 ◽

Vol 378 (2166) ◽

pp. 20190050

Author(s):

Erin Carson ◽

Zdeněk Strakoš

Keyword(s):

High Performance ◽

A Priori ◽

Numerical Algorithms ◽

Computational Science ◽

Rounding Errors ◽

Performance Bottlenecks ◽

Linear Algebraic Systems ◽

Key Points ◽

Practical Performance ◽

The Cost

With exascale-level computation on the horizon, the art of predicting the cost of computations has acquired a renewed focus. This task is especially challenging in the case of iterative methods, for which convergence behaviour often cannot be determined with certainty a priori (unless we are satisfied with potentially outrageous overestimates) and which typically suffer from performance bottlenecks at scale due to synchronization cost. Moreover, the amplification of rounding errors can substantially affect the practical performance, in particular for methods with short recurrences. In this article, we focus on what we consider to be key points which are crucial to understanding the cost of iteratively solving linear algebraic systems. This naturally leads us to questions on the place of numerical analysis in relation to mathematics, computer science and sciences, in general. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Download Full-text

Comparative Evaluations of Human Behavior Recognition Using Deep Learning

Handbook of Research on Multimedia Cyber Security - Advances in Information Security, Privacy, and Ethics ◽

10.4018/978-1-7998-2701-6.ch009 ◽

2020 ◽

pp. 176-189 ◽

Cited By ~ 2

Author(s):

Jia Lu ◽

Wei Qi Yan

Keyword(s):

Deep Learning ◽

Human Behavior ◽

High Performance ◽

Shopping Malls ◽

Learning Models ◽

Behavior Recognition ◽

Model Training ◽

Public Datasets ◽

Human Behavior Recognition ◽

The Cost

With the cost decrease of security monitoring facilities such as cameras, video surveillance has been widely applied to public security and safety such as banks, transportation, shopping malls, etc. which allows police to monitor abnormal events. Through deep learning, authors can achieve high performance of human behavior detection and recognition by using model training and tests. This chapter uses public datasets Weizmann dataset and KTH dataset to train deep learning models. Four deep learning models were investigated for human behavior recognition. Results show that YOLOv3 model is the best one and achieved 96.29% of mAP based on Weizmann dataset and 84.58% of mAP on KTH dataset. The chapter conducts human behavior recognition using deep learning and evaluates the outcomes of different approaches with the support of the datasets.

Download Full-text

A Survey of Convolutional Neural Networks on Edge with Reconfigurable Computing

Algorithms ◽

10.3390/a12080154 ◽

2019 ◽

Vol 12 (8) ◽

pp. 154 ◽

Cited By ~ 14

Author(s):

Mário P. Véstias

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Convolutional Neural Networks ◽

Reconfigurable Computing ◽

Data Privacy ◽

High Performance ◽

State Of The Art ◽

Machine Learning Algorithms ◽

Computing Platform ◽

The Cost

The convolutional neural network (CNN) is one of the most used deep learning models for image detection and classification, due to its high accuracy when compared to other machine learning algorithms. CNNs achieve better results at the cost of higher computing and memory requirements. Inference of convolutional neural networks is therefore usually done in centralized high-performance platforms. However, many applications based on CNNs are migrating to edge devices near the source of data due to the unreliability of a transmission channel in exchanging data with a central server, the uncertainty about channel latency not tolerated by many applications, security and data privacy, etc. While advantageous, deep learning on edge is quite challenging because edge devices are usually limited in terms of performance, cost, and energy. Reconfigurable computing is being considered for inference on edge due to its high performance and energy efficiency while keeping a high hardware flexibility that allows for the easy adaption of the target computing platform to the CNN model. In this paper, we described the features of the most common CNNs, the capabilities of reconfigurable computing for running CNNs, the state-of-the-art of reconfigurable computing implementations proposed to run CNN models, as well as the trends and challenges for future edge reconfigurable platforms.

Download Full-text

The parallelism motifs of genomic data analysis

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2019.0394 ◽

2020 ◽

Vol 378 (2166) ◽

pp. 20190394

Author(s):

Katherine Yelick ◽

Aydın Buluç ◽

Muaaz Awan ◽

Ariful Azad ◽

Benjamin Brock ◽

...

Keyword(s):

Data Analysis ◽

High Performance ◽

Architectural Design ◽

Large Scale ◽

Numerical Algorithms ◽

Genomic Data ◽

Scientific Simulations ◽

Genomic Data Analysis ◽

The Cost ◽

Support Software

Genomic datasets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share these data with the research community, but some of these genomic data analysis problems require large-scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high-end parallel systems today and place different requirements on programming support, software libraries and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high-performance genomics analysis, including alignment, profiling, clustering and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or ‘motifs’ that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Download Full-text

Numerical algorithms for high-performance computational science

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2019.0066 ◽

2020 ◽

Vol 378 (2166) ◽

pp. 20190066 ◽

Cited By ~ 2

Author(s):

Jack Dongarra ◽

Laura Grigori ◽

Nicholas J. Higham

Keyword(s):

High Performance ◽

Numerical Algorithms ◽

Computational Science ◽

Floating Point ◽

Important Criterion ◽

Data Movement ◽

Floating Point Arithmetic ◽

High Performance Computers ◽

Point Arithmetic ◽

Speed And Accuracy

A number of features of today’s high-performance computers make it challenging to exploit these machines fully for computational science. These include increasing core counts but stagnant clock frequencies; the high cost of data movement; use of accelerators (GPUs, FPGAs, coprocessors), making architectures increasingly heterogeneous; and multi- ple precisions of floating-point arithmetic, including half-precision. Moreover, as well as maximizing speed and accuracy, minimizing energy consumption is an important criterion. New generations of algorithms are needed to tackle these challenges. We discuss some approaches that we can take to develop numerical algorithms for high-performance computational science, with a view to exploiting the next generation of supercomputers. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Download Full-text

Deep Learning and Edge Computing Solutions for High Performance Computing

10.1007/978-3-030-60265-9 ◽

2021 ◽

Keyword(s):

Deep Learning ◽

High Performance Computing ◽

High Performance ◽

Edge Computing ◽

Performance Computing

Download Full-text

Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

ACM Transactions on Mathematical Software ◽

10.1145/3441850 ◽

2021 ◽

Vol 47 (2) ◽

pp. 1-28

Author(s):

Goran Flegar ◽

Hartwig Anzt ◽

Terry Cojean ◽

Enrique S. Quintana-Ortí

Keyword(s):

Linear Algebra ◽

Graphics Processing Units ◽

High Performance ◽

Numerical Algorithms ◽

Mixed Precision ◽

Before And After ◽

Memory Accesses ◽

Specialized Hardware ◽

The Individual ◽

Graphics Processing

The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator–like a preconditioner–in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.

Download Full-text