scholarly journals Overflow Aware Quantization: Accelerating Neural Network Inference by Low-bit Multiply-Accumulate Operations

Author(s):  
Hongwei Xie ◽  
Yafei Song ◽  
Ling Cai ◽  
Mingyang Li

The inherent heavy computation of deep neural networks prevents their widespread applications. A widely used method for accelerating model inference is quantization, by replacing the input operands of a network using fixed-point values. Then the majority of computation costs focus on the integer matrix multiplication accumulation. In fact, high-bit accumulator leads to partially wasted computation and low-bit one typically suffers from numerical overflow. To address this problem, we propose an overflow aware quantization method by designing trainable adaptive fixed-point representation, to optimize the number of bits for each input tensor while prohibiting numeric overflow during the computation. With the proposed method, we are able to fully utilize the computing power to minimize the quantization loss and obtain optimized inference performance. To verify the effectiveness of our method, we conduct image classification, object detection, and semantic segmentation tasks on ImageNet, Pascal VOC, and COCO datasets, respectively. Experimental results demonstrate that the proposed method can achieve comparable performance with state-of-the-art quantization methods while accelerating the inference process by about 2 times.

Author(s):  
Lucas Prado Osco ◽  
Keiller Nogueira ◽  
Ana Paula Marques Ramos ◽  
Mayara Maezano Faita Pinheiro ◽  
Danielle Elis Garcia Furuya ◽  
...  

Author(s):  
Ye Lin ◽  
Yanyang Li ◽  
Tengbo Liu ◽  
Tong Xiao ◽  
Tongran Liu ◽  
...  

8-bit integer inference, as a promising direction in reducing both the latency and storage of deep neural networks, has made great progress recently. On the other hand, previous systems still rely on 32-bit floating point for certain functions in complex models (e.g., Softmax in Transformer), and make heavy use of quantization and de-quantization. In this work, we show that after a principled modification on the Transformer architecture, dubbed Integer Transformer, an (almost) fully 8-bit integer inference algorithm Scale Propagation could be derived. De-quantization is adopted when necessary, which makes the network more efficient. Our experiments on WMT16 En<->Ro, WMT14 En<->De and En->Fr translation tasks as well as the WikiText-103 language modelling task show that the fully 8-bit Transformer system achieves comparable performance with the floating point baseline but requires nearly 4x less memory footprint.


2021 ◽  
Author(s):  
Rodrigo Leite Prates ◽  
Wilfrido Gomez-Flores ◽  
Wagner Pereira

Sign in / Sign up

Export Citation Format

Share Document