Model compression for on-device inference

Deep convolutional neural network (DNN) has demonstrated phenomenal success and been widely used in many computer vision tasks. However, its enormous model size and high computing complexity prohibits its wide deployment into resource limited embedded system, such as FPGA and mGPU. As the two most widely adopted model compression techniques, weight pruning and quantization compress DNN model through introducing weight sparsity (i.e., forcing partial weights as zeros) and quantizing weights into limited bit-width values, respectively. Although there are works attempting to combine the weight pruning and quantization, we still observe disharmony between weight pruning and quantization, especially when more aggressive compression schemes (e.g., Structured pruning and low bit-width quantization) are used. In this work, taking FPGA as the test computing platform and Processing Elements (PE) as the basic parallel computing unit, we first propose a PE-wise structured pruning scheme, which introduces weight sparsification with considering of the architecture of PE. In addition, we integrate it with an optimized weight ternarization approach which quantizes weights into ternary values ({-1,0,+1}), thus converting the dominant convolution operations in DNN from multiplication-and-accumulation (MAC) to addition-only, as well as compressing the original model (from 32-bit floating point to 2-bit ternary representation) by at least 16 times. Then, we investigate and solve the coexistence issue between PE-wise Structured pruning and ternarization, through proposing a Weight Penalty Clipping (WPC) technique with self-adapting threshold. Our experiment shows that the fusion of our proposed techniques can achieve the best state-of-the-art ∼21× PE-wise structured compression rate with merely 1.74%/0.94% (top-1/top-5) accuracy degradation of ResNet-18 on ImageNet dataset.

Download Full-text

3D Model Compression using Connectivity-guided Adaptive Lifting Transform

2007 IEEE 15th Signal Processing and Communications Applications ◽

10.1109/siu.2007.4298811 ◽

2007 ◽

Cited By ~ 1

Author(s):

Kivanc Kose ◽

A. Enis Cetin ◽

Ugur Gudukbay ◽

Levent Onural

Keyword(s):

3D Model ◽

Model Compression

Download Full-text

A novel trigeminal neuropathic pain model: Compression of the trigeminal nerve root produces prolonged nociception in rats

Progress in Neuro-Psychopharmacology and Biological Psychiatry ◽

10.1016/j.pnpbp.2012.03.002 ◽

2012 ◽

Vol 38 (2) ◽

pp. 149-158 ◽

Cited By ~ 13

Author(s):

Hye J. Jeon ◽

Seung R. Han ◽

Min K. Park ◽

Kui Y. Yang ◽

Yong C. Bae ◽

...

Keyword(s):

Neuropathic Pain ◽

Trigeminal Nerve ◽

Nerve Root ◽

Pain Model ◽

Trigeminal Neuropathic Pain ◽

Model Compression ◽

Trigeminal Nerve Root ◽

Neuropathic Pain Model

Download Full-text

Combine-Net: An Improved Filter Pruning Algorithm

Information ◽

10.3390/info12070264 ◽

2021 ◽

Vol 12 (7) ◽

pp. 264

Author(s):

Jinghan Wang ◽

Guangyue Li ◽

Wenzhao Zhang

Keyword(s):

Structured Model ◽

Compression Algorithms ◽

Pruning Algorithm ◽

Model Compression ◽

The Neural Network ◽

Empirical Determination ◽

Knowledge Distillation ◽

Resource Constrained Devices ◽

Constrained Devices

The powerful performance of deep learning is evident to all. With the deepening of research, neural networks have become more complex and not easily generalized to resource-constrained devices. The emergence of a series of model compression algorithms makes artificial intelligence on edge possible. Among them, structured model pruning is widely utilized because of its versatility. Structured pruning prunes the neural network itself and discards some relatively unimportant structures to compress the model’s size. However, in the previous pruning work, problems such as evaluation errors of networks, empirical determination of pruning rate, and low retraining efficiency remain. Therefore, we propose an accurate, objective, and efficient pruning algorithm—Combine-Net, introducing Adaptive BN to eliminate evaluation errors, the Kneedle algorithm to determine the pruning rate objectively, and knowledge distillation to improve the efficiency of retraining. Results show that, without precision loss, Combine-Net achieves 95% parameter compression and 83% computation compression on VGG16 on CIFAR10, 71% of parameter compression and 41% computation compression on ResNet50 on CIFAR100. Experiments on different datasets and models have proved that Combine-Net can efficiently compress the neural network’s parameters and computation.

Download Full-text

Data-Free Ensemble Knowledge Distillation for Privacy-conscious Multimedia Model Compression

10.1145/3474085.3475329 ◽

2021 ◽

Author(s):

Zhiwei Hao ◽

Yong Luo ◽

Han Hu ◽

Jianping An ◽

Yonggang Wen

Keyword(s):

Multimedia Model ◽

Model Compression ◽

Knowledge Distillation

Download Full-text

Revisiting knowledge distillation for light-weight visual object detection

Transactions of the Institute of Measurement and Control ◽

10.1177/01423312211022877 ◽

2021 ◽

Vol 43 (13) ◽

pp. 2888-2898

Author(s):

Tianze Gao ◽

Yunfeng Gao ◽

Yu Li ◽

Peiyuan Qin

Keyword(s):

Object Detection ◽

Essential Element ◽

Detection Algorithm ◽

Positive Sample ◽

Detection Methods ◽

Visual Object ◽

Light Weight ◽

Model Compression ◽

Novel Approach ◽

Knowledge Distillation

An essential element for intelligent perception in mechatronic and robotic systems (M&RS) is the visual object detection algorithm. With the ever-increasing advance of artificial neural networks (ANN), researchers have proposed numerous ANN-based visual object detection methods that have proven to be effective. However, networks with cumbersome structures do not befit the real-time scenarios in M&RS, necessitating the techniques of model compression. In the paper, a novel approach to training light-weight visual object detection networks is developed by revisiting knowledge distillation. Traditional knowledge distillation methods are oriented towards image classification is not compatible with object detection. Therefore, a variant of knowledge distillation is developed and adapted to a state-of-the-art keypoint-based visual detection method. Two strategies named as positive sample retaining and early distribution softening are employed to yield a natural adaption. The mutual consistency between teacher model and student model is further promoted through a hint-based distillation. By extensive controlled experiments, the proposed method is testified to be effective in enhancing the light-weight network’s performance by a large margin.

Download Full-text

Pruning by leveraging training dynamics

AI Communications ◽

10.3233/aic-210127 ◽

2021 ◽

pp. 1-21

Author(s):

Andrei C. Apostol ◽

Maarten C. Stol ◽

Patrick Forré

Keyword(s):

State Of The Art ◽

Object Classification ◽

Compression Technique ◽

Pruning Method ◽

Model Compression ◽

Art Performance

We propose a novel pruning method which uses the oscillations around 0, i.e. sign flips, that a weight has undergone during training in order to determine its saliency. Our method can perform pruning before the network has converged, requires little tuning effort due to having good default values for its hyperparameters, and can directly target the level of sparsity desired by the user. Our experiments, performed on a variety of object classification architectures, show that it is competitive with existing methods and achieves state-of-the-art performance for levels of sparsity of 99.6 % and above for 2 out of 3 of the architectures tested. Moreover, we demonstrate that our method is compatible with quantization, another model compression technique. For reproducibility, we release our code at https://github.com/AndreiXYZ/flipout.

Download Full-text