mixed precision Latest Research Papers

Field-programmable Gate Array (FPGA) is a high-performance computing platform for Convolution Neural Networks (CNNs) inference. Winograd algorithm, weight pruning, and quantization are widely adopted to reduce the storage and arithmetic overhead of CNNs on FPGAs. Recent studies strive to prune the weights in the Winograd domain, however, resulting in irregular sparse patterns and leading to low parallelism and reduced utilization of resources. Besides, there are few works to discuss a suitable quantization scheme for Winograd. In this article, we propose a regular sparse pruning pattern in the Winograd-based CNN, namely, Sub-row-balanced Sparsity (SRBS) pattern, to overcome the challenge of the irregular sparse pattern. Then, we develop a two-step hardware co-optimization approach to improve the model accuracy using the SRBS pattern. Based on the pruned model, we implement a mixed precision quantization to further reduce the computational complexity of bit operations. Finally, we design an FPGA accelerator that takes both the advantage of the SRBS pattern to eliminate low-parallelism computation and the irregular memory accesses, as well as the mixed precision quantization to get a layer-wise bit width. Experimental results on VGG16/VGG-nagadomi with CIFAR-10 and ResNet-18/34/50 with ImageNet show up to 11.8×/8.67× and 8.17×/8.31×/10.6× speedup, 12.74×/9.19× and 8.75×/8.81×/11.1× energy efficiency improvement, respectively, compared with the state-of-the-art dense Winograd accelerator [20] with negligible loss of model accuracy. We also show that our design has 4.11× speedup compared with the state-of-the-art sparse Winograd accelerator [19] on VGG16.

Download Full-text

Language Models for the Prediction of SARS-CoV-2 Inhibitors

10.1101/2021.12.10.471928 ◽

2021 ◽

Author(s):

Andrew E Blanchard ◽

John Gounley ◽

Debsindhu Bhowmik ◽

Mayanka Chandra Shekar ◽

Isaac Lyngaas ◽

...

Keyword(s):

Deep Learning ◽

Binding Affinity ◽

Language Model ◽

Specific Protein ◽

Language Models ◽

Peak Performance ◽

Protein Targets ◽

Training Time ◽

Mixed Precision ◽

Learning Language

The COVID-19 pandemic highlights the need for computational tools to automate and accelerate drug design for novel protein targets. We leverage deep learning language models to generate and score drug candidates based on predicted protein binding affinity. We pre-trained a deep learning language model (BERT) on ~9.6 billion molecules and achieved peak performance of 603 petaflops in mixed precision. Our work reduces pre-training time from days to hours, compared to previous efforts with this architecture, while also increasing the dataset size by nearly an order of magnitude. For scoring, we fine-tuned the language model using an assembled set of thousands of protein targets with binding affinity data and searched for inhibitors of specific protein targets, SARS-CoV-2 Mpro and PLpro. We utilized a genetic algorithm approach for finding optimal candidates using the generation and scoring capabilities of the language model. Our generalizable models accelerate the identification of inhibitors for emerging therapeutic targets.

Download Full-text

PhotoNs-GPU: A GPU accelerated cosmological simulation code

Research in Astronomy and Astrophysics ◽

10.1088/1674-4527/21/11/281 ◽

2021 ◽

Vol 21 (11) ◽

pp. 281

Author(s):

Qiao Wang ◽

Chen Meng

Keyword(s):

Special Functions ◽

Fast Multipole Method ◽

Kernel Functions ◽

Peak Performance ◽

Double Precision ◽

Simulation Code ◽

Small Noise ◽

Mixed Precision ◽

Speed Up ◽

Different Levels

Abstract We present a GPU-accelerated cosmological simulation code, PhotoNs-GPU, based on an algorithm of Particle Mesh Fast Multipole Method (PM-FMM), and focus on the GPU utilization and optimization. A proper interpolated method for truncated gravity is introduced to speed up the special functions in kernels. We verify the GPU code in mixed precision and different levels of theinterpolated method on GPU. A run with single precision is roughly two times faster than double precision for current practical cosmological simulations. But it could induce an unbiased small noise in power spectrum. Compared with the CPU version of PhotoNs and Gadget-2, the efficiency of the new code is significantly improved. Activated all the optimizations on the memory access, kernel functions and concurrency management, the peak performance of our test runs achieves 48% of the theoretical speed and the average performance approaches to ∼35% on GPU.

Download Full-text

Mixed precision s ‐step Lanczos and conjugate gradient algorithms

Numerical Linear Algebra with Applications ◽

10.1002/nla.2425 ◽

2021 ◽

Author(s):

Erin Carson ◽

Tomáš Gergelits ◽

Ichitaro Yamazaki

Keyword(s):

Conjugate Gradient ◽

Conjugate Gradient Algorithms ◽

Gradient Algorithms ◽

Mixed Precision

Download Full-text

An Energy-Efficient Deep Reinforcement Learning FPGA Accelerator for Online Fast Adaptation with Selective Mixed-precision Re-training

10.1109/a-sscc53895.2021.9634810 ◽

2021 ◽

Author(s):

Wooyoung Jo ◽

Juhyoung Lee ◽

Seunghyun Park ◽

Hoi-Jun Yoo

Keyword(s):

Reinforcement Learning ◽

Energy Efficient ◽

Mixed Precision ◽

Fast Adaptation

Download Full-text

Mixed Precision Level Based Sparse Grid Quadrature Filter for Nonlinear System

10.1109/rcae53607.2021.9638932 ◽

2021 ◽

Author(s):

Chen Qian ◽

Huikun Pan ◽

Yuan Qian ◽

Qingwei Chen

Keyword(s):

Nonlinear System ◽

Sparse Grid ◽

Mixed Precision ◽

Precision Level ◽

Sparse Grid Quadrature ◽

Quadrature Filter

Download Full-text

Iterative methods with mixed-precision preconditioning for ill-conditioned linear systems in multiphase CFD simulations

10.1109/scala54577.2021.00006 ◽

2021 ◽

Author(s):

T. Ina ◽

Y. Idomura ◽

T. Imamura ◽

S. Yamashita ◽

N. Onodera

Keyword(s):

Linear Systems ◽

Iterative Methods ◽

Cfd Simulations ◽

Mixed Precision ◽

Multiphase Cfd

Download Full-text

Regime Inference for Sound Floating-Point Optimizations

ACM Transactions on Embedded Computing Systems ◽

10.1145/3477012 ◽

2021 ◽

Vol 20 (5s) ◽

pp. 1-23

Author(s):

Robert Rabe ◽

Anastasiia Izycheva ◽

Eva Darulova

Keyword(s):

Floating Point ◽

Improve Performance ◽

Rounding Errors ◽

Proper Functioning ◽

Mixed Precision ◽

Input Domain ◽

Different Parts

Efficient numerical programs are required for proper functioning of many systems. Today’s tools offer a variety of optimizations to generate efficient floating-point implementations that are specific to a program’s input domain. However, sound optimizations are of an “all or nothing” fashion with respect to this input domain—if an optimizer cannot improve a program on the specified input domain, it will conclude that no optimization is possible. In general, though, different parts of the input domain exhibit different rounding errors and thus have different optimization potential. We present the first regime inference technique for sound optimizations that automatically infers an effective subdivision of a program’s input domain such that individual sub-domains can be optimized more aggressively. Our algorithm is general; we have instantiated it with mixed-precision tuning and rewriting optimizations to improve performance and accuracy, respectively. Our evaluation on a standard benchmark set shows that with our inferred regimes, we can, on average, improve performance by 65% and accuracy by 54% with respect to whole-domain optimizations.

Download Full-text

HW-FlowQ: A Multi-Abstraction Level HW-CNN Co-design Quantization Methodology

ACM Transactions on Embedded Computing Systems ◽

10.1145/3476997 ◽

2021 ◽

Vol 20 (5s) ◽

pp. 1-25

Author(s):

Nael Fasfous ◽

Manoj Rohit Vemparala ◽

Alexander Frickenstein ◽

Emanuele Valpreda ◽

Driton Salihu ◽

...

Keyword(s):

Design Space ◽

Solution Space ◽

Search Space ◽

Processing Unit ◽

Cost Models ◽

Levels Of Abstraction ◽

Nsga Ii ◽

Abstraction Level ◽

Model Compression ◽

Mixed Precision

Model compression through quantization is commonly applied to convolutional neural networks (CNNs) deployed on compute and memory-constrained embedded platforms. Different layers of the CNN can have varying degrees of numerical precision for both weights and activations, resulting in a large search space. Together with the hardware (HW) design space, the challenge of finding the globally optimal HW-CNN combination for a given application becomes daunting. To this end, we propose HW-FlowQ, a systematic approach that enables the co-design of the target hardware platform and the compressed CNN model through quantization. The search space is viewed at three levels of abstraction, allowing for an iterative approach for narrowing down the solution space before reaching a high-fidelity CNN hardware modeling tool, capable of capturing the effects of mixed-precision quantization strategies on different hardware architectures (processing unit counts, memory levels, cost models, dataflows) and two types of computation engines (bit-parallel vectorized, bit-serial). To combine both worlds, a multi-objective non-dominated sorting genetic algorithm (NSGA-II) is leveraged to establish a Pareto-optimal set of quantization strategies for the target HW-metrics at each abstraction level. HW-FlowQ detects optima in a discrete search space and maximizes the task-related accuracy of the underlying CNN while minimizing hardware-related costs. The Pareto-front approach keeps the design space open to a range of non-dominated solutions before refining the design to a more detailed level of abstraction. With equivalent prediction accuracy, we improve the energy and latency by 20% and 45% respectively for ResNet56 compared to existing mixed-precision search methods.

Download Full-text

Multiple-Object Detection and Segmentation Based on Deep Learning in High-Resolution Video Using Mask-RCNN

International Journal of Pattern Recognition and Artificial Intelligence ◽

10.1142/s0218001421500385 ◽

2021 ◽

Author(s):

Shaikh Shakil Abdul Rajjak ◽

A. K. Kureshi

Keyword(s):

Deep Learning ◽

High Resolution ◽

Object Detection ◽

Data Set ◽

Training Time ◽

Multiple Object ◽

Proposed Model ◽

Mixed Precision ◽

Multiple Object Detection ◽

Imaging Sensors

Imaging sensors with higher resolution and higher frame rates are becoming more popular for wide-area video surveillance (VS) and other applications as technology advances Using Mask-RCNN, we proposed Multiple-Object Detection and Segmentation in High-Resolution Video based on Deep Learning. The ResNet-50 ResNet-101 is used as the backbone in the proposed R-CNN Mask FPN model. The deep residual network’s design overcomes the problem of lower learning efficiency due to the network’s deepening. To reach the objective of the smallest overall error, the deep residual network divided the training series into one training block, minimizing the error of each block. It is roughly divided into five convolutional layer stages. The output scale is cut in half at each point. We used mixed precision FP16 and FP32 for training the model and achieved great speed in training time reduction in inference time for object. The COCO 2014 data set is used to train and validate the proposed model with mixed precision, leading to faster performance. The results of the experiments show that the proposed model can run at 30–48 frames per second with 85% accuracy.

Download Full-text

mixed precision
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

BISWSRBS: A Winograd-based CNN Accelerator with a Fine-grained Regular Sparsity Pattern and Mixed Precision Quantization

Language Models for the Prediction of SARS-CoV-2 Inhibitors

PhotoNs-GPU: A GPU accelerated cosmological simulation code

Mixed precision s ‐step Lanczos and conjugate gradient algorithms

An Energy-Efficient Deep Reinforcement Learning FPGA Accelerator for Online Fast Adaptation with Selective Mixed-precision Re-training

Mixed Precision Level Based Sparse Grid Quadrature Filter for Nonlinear System

Iterative methods with mixed-precision preconditioning for ill-conditioned linear systems in multiphase CFD simulations

Regime Inference for Sound Floating-Point Optimizations

HW-FlowQ: A Multi-Abstraction Level HW-CNN Co-design Quantization Methodology

Multiple-Object Detection and Segmentation Based on Deep Learning in High-Resolution Video Using Mask-RCNN

Export Citation Format

mixed precisionRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

BISWSRBS: A Winograd-based CNN Accelerator with a Fine-grained Regular Sparsity Pattern and Mixed Precision Quantization

Language Models for the Prediction of SARS-CoV-2 Inhibitors

PhotoNs-GPU: A GPU accelerated cosmological simulation code

Mixed precision s ‐step Lanczos and conjugate gradient algorithms

An Energy-Efficient Deep Reinforcement Learning FPGA Accelerator for Online Fast Adaptation with Selective Mixed-precision Re-training

Mixed Precision Level Based Sparse Grid Quadrature Filter for Nonlinear System

Iterative methods with mixed-precision preconditioning for ill-conditioned linear systems in multiphase CFD simulations

Regime Inference for Sound Floating-Point Optimizations

HW-FlowQ: A Multi-Abstraction Level HW-CNN Co-design Quantization Methodology

Multiple-Object Detection and Segmentation Based on Deep Learning in High-Resolution Video Using Mask-RCNN

mixed precision
Recently Published Documents