Overflow Aware Quantization: Accelerating Neural Network Inference by Low-bit Multiply-Accumulate Operations

The inherent heavy computation of deep neural networks prevents their widespread applications. A widely used method for accelerating model inference is quantization, by replacing the input operands of a network using fixed-point values. Then the majority of computation costs focus on the integer matrix multiplication accumulation. In fact, high-bit accumulator leads to partially wasted computation and low-bit one typically suffers from numerical overflow. To address this problem, we propose an overflow aware quantization method by designing trainable adaptive fixed-point representation, to optimize the number of bits for each input tensor while prohibiting numeric overflow during the computation. With the proposed method, we are able to fully utilize the computing power to minimize the quantization loss and obtain optimized inference performance. To verify the effectiveness of our method, we conduct image classification, object detection, and semantic segmentation tasks on ImageNet, Pascal VOC, and COCO datasets, respectively. Experimental results demonstrate that the proposed method can achieve comparable performance with state-of-the-art quantization methods while accelerating the inference process by about 2 times.

Download Full-text

Information aggregation and fusion in deep neural networks for object interaction exploration for semantic segmentation

Knowledge-Based Systems ◽

10.1016/j.knosys.2021.106843 ◽

2021 ◽

Vol 218 ◽

pp. 106843

Author(s):

Shuang Bai ◽

Congcong Wang

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Information Aggregation ◽

Semantic Segmentation ◽

Object Interaction

Download Full-text

Semantic segmentation of citrus-orchard using deep neural networks and multispectral UAV-based imagery

Precision Agriculture ◽

10.1007/s11119-020-09777-5 ◽

2021 ◽

Author(s):

Lucas Prado Osco ◽

Keiller Nogueira ◽

Ana Paula Marques Ramos ◽

Mayara Maezano Faita Pinheiro ◽

Danielle Elis Garcia Furuya ◽

...

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Semantic Segmentation ◽

Citrus Orchard

Download Full-text

Domain Transfer for Semantic Segmentation of LiDAR Data using Deep Neural Networks

2020 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) ◽

10.1109/iros45743.2020.9341508 ◽

2020 ◽

Author(s):

Ferdinand Langer ◽

Andres Milioto ◽

Alexandre Haag ◽

Jens Behley ◽

Cyrill Stachniss

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Semantic Segmentation ◽

Lidar Data ◽

Domain Transfer

Download Full-text

Fixed-point optimization of deep neural networks with adaptive step size retraining

2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) ◽

10.1109/icassp.2017.7952347 ◽

2017 ◽

Cited By ~ 9

Author(s):

Sungho Shin ◽

Yoonho Boo ◽

Wonyong Sung

Keyword(s):

Neural Networks ◽

Fixed Point ◽

Deep Neural Networks ◽

Step Size ◽

Adaptive Step Size ◽

Adaptive Step

Download Full-text

fixed-point representation

Computer Science and Communications Dictionary ◽

10.1007/1-4020-0613-6_7253 ◽

2000 ◽

pp. 615-615

Author(s):

Martin H. Weik

Keyword(s):

Fixed Point ◽

Point Representation

Download Full-text

Towards Fully 8-bit Integer Inference for the Transformer Model

Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence ◽

10.24963/ijcai.2020/520 ◽

2020 ◽

Author(s):

Ye Lin ◽

Yanyang Li ◽

Tengbo Liu ◽

Tong Xiao ◽

Tongran Liu ◽

...

Keyword(s):

Deep Neural Networks ◽

Floating Point ◽

Memory Footprint ◽

Great Progress ◽

Comparable Performance ◽

Complex Models ◽

Language Modelling ◽

Transformer Model ◽

And Storage ◽

Modelling Task

8-bit integer inference, as a promising direction in reducing both the latency and storage of deep neural networks, has made great progress recently. On the other hand, previous systems still rely on 32-bit floating point for certain functions in complex models (e.g., Softmax in Transformer), and make heavy use of quantization and de-quantization. In this work, we show that after a principled modification on the Transformer architecture, dubbed Integer Transformer, an (almost) fully 8-bit integer inference algorithm Scale Propagation could be derived. De-quantization is adopted when necessary, which makes the network more efficient. Our experiments on WMT16 En<->Ro, WMT14 En<->De and En->Fr translation tasks as well as the WikiText-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.

Download Full-text