weight pruning
Recently Published Documents


TOTAL DOCUMENTS

55
(FIVE YEARS 44)

H-INDEX

5
(FIVE YEARS 3)

Processes ◽  
2021 ◽  
Vol 10 (1) ◽  
pp. 44
Author(s):  
Yuan Liu ◽  
Takahiro Kawaguchi ◽  
Song Xu ◽  
Seiji Hashimoto

Recurrent Neural Networks (RNNs) have been widely applied in various fields. However, in real-world application, because most devices like mobile phones are limited to the storage capacity when processing real-time information, an over-parameterized model always slows down the system speed and is not suitable to be employed. In our proposed temperature control system, the RNN-based control model processes the real-time temperature signals. It is necessary to compress the trained model with acceptable loss of control performance for further implementation in the actual controller when the system resource is limited. Inspired by the layer-wise neuron pruning method, in this paper, we apply the nonlinear reconstruction error (NRE) guided layer-wise weight pruning method on the RNN-based temperature control system. The control system is established based on MATLAB/Simulink. In order to compress the model size to save the memory capacity of temperature controller devices, we first prove the validity of the proposed reference-model (ref-model) guided RNN model for real-time online data processing on an actual temperature object; relative experiments are implemented based on a digital signal processor. On this basis, we then verified the NRE guided layer-wise weight pruning method on the well-trained temperature control model. Compared with the classical pruning method, experiment results indicate that the pruned control model based on NRE guided layer-wise weight pruning can effectively achieve the high accuracy at targeted sparsity of the network.


2021 ◽  
Author(s):  
Tianyun Zhang ◽  
Xiaolong Ma ◽  
Zheng Zhan ◽  
Shanglin Zhou ◽  
Caiwen Ding ◽  
...  

2021 ◽  
Vol 20 (5s) ◽  
pp. 1-25
Author(s):  
Chanyoung Oh ◽  
Junhyuk So ◽  
Sumin Kim ◽  
Youngmin Yi

Over the past several years, the need for on-device deep learning has been rapidly increasing, and efficient CNN inference on mobile platforms has been actively researched. Sparsity exploitation has been one of the most active research themes, but the studies mostly focus on weight sparsity by weight pruning. Activation sparsity, on the contrary, requires compression at runtime for every input tensor. Hence, the research on activation sparsity mainly targets NPUs that can efficiently process this with their own hardware logic. In this paper, we observe that it is difficult to accelerate CNN inference on mobile GPUs with natural activation sparsity and that the widely used CSR-based sparse convolution is not sufficiently effective due to the compression overhead. We propose several novel sparsification methods that can boost activation sparsity without harming accuracy. In particular, we selectively sparsify some layers with an extremely high sparsity and adopt sparse convolution or dense convolution depending on the layers. Further, we present an efficient sparse convolution method without compression and demonstrate that it can be faster than the CSR implementation. With ResNet-50, we achieved 1.88 speedup compared to TFLite on a Mali-G76 GPU.


2021 ◽  
Author(s):  
George Retsinas ◽  
Athena Elafrou ◽  
Georgios Goumas ◽  
Petros Maragos
Keyword(s):  

2021 ◽  
Author(s):  
Yael Ben-Guigui ◽  
Jacob Goldberger ◽  
Tammy Riklin-Raviv

Author(s):  
Yijue Wang ◽  
Chenghong Wang ◽  
Zigeng Wang ◽  
Shanglin Zhou ◽  
Hang Liu ◽  
...  

The large model size, high computational operations, and vulnerability against membership inference attack (MIA) have impeded deep learning or deep neural networks (DNNs) popularity, especially on mobile devices. To address the challenge, we envision that the weight pruning technique will help DNNs against MIA while reducing model storage and computational operation. In this work, we propose a pruning algorithm, and we show that the proposed algorithm can find a subnetwork that can prevent privacy leakage from MIA and achieves competitive accuracy with the original DNNs. We also verify our theoretical insights with experiments. Our experimental results illustrate that the attack accuracy using model compression is up to 13.6% and 10% lower than that of the baseline and Min-Max game, accordingly.


Author(s):  
Xuan Shen ◽  
Geng Yuan ◽  
Wei Niu ◽  
Xiaolong Ma ◽  
Jiexiong Guan ◽  
...  

The rapid development of autonomous driving, abnormal behavior detection, and behavior recognition makes an increasing demand for multi-person pose estimation-based applications, especially on mobile platforms. However, to achieve high accuracy, state-of-the-art methods tend to have a large model size and complex post-processing algorithm, which costs intense computation and long end-to-end latency. To solve this problem, we propose an architecture optimization and weight pruning framework to accelerate inference of multi-person pose estimation on mobile devices. With our optimization framework, we achieve up to 2.51X faster model inference speed with higher accuracy compared to representative lightweight multi-person pose estimator.


Author(s):  
Changsheng Zhao ◽  
Ting Hua ◽  
Yilin Shen ◽  
Qian Lou ◽  
Hongxia Jin

Pre-trained language models such as BERT have shown remarkable effectiveness in various natural language processing tasks. However, these models usually contain millions of parameters, which prevent them from the practical deployment on resource-constrained devices. Knowledge distillation, Weight pruning, and Quantization are known to be the main directions in model compression. However, compact models obtained through knowledge distillation may suffer from significant accuracy drop even for a relatively small compression ratio. On the other hand, there are only a few attempts based on quantization designed for natural language processing tasks, and they usually require manual setting on hyper-parameters. In this paper, we proposed an automatic mixed-precision quantization framework designed for BERT that can conduct quantization and pruning simultaneously. Specifically, our proposed method leverages Differentiable Neural Architecture Search to assign scale and precision for parameters in each sub-group automatically, and at the same pruning out redundant groups of parameters. Extensive evaluations on BERT downstream tasks reveal that our proposed method beats baselines by providing the same performance with much smaller model size. We also show the possibility of obtaining the extremely light-weight model by combining our solution with orthogonal methods such as DistilBERT.


Author(s):  
Deniz Gurevin ◽  
Mikhail Bragin ◽  
Caiwen Ding ◽  
Shanglin Zhou ◽  
Lynn Pepin ◽  
...  

Network pruning is a widely used technique to reduce computation cost and model size for deep neural networks. However, the typical three-stage pipeline, i.e., training, pruning and retraining (fine-tuning) significantly increases the overall training trails. In this paper, we develop a systematic weight-pruning optimization approach based on Surrogate Lagrangian relaxation (SLR), which is tailored to overcome difficulties caused by the discrete nature of the weight-pruning problem while ensuring fast convergence. We further accelerate the convergence of the SLR by using quadratic penalties. Model parameters obtained by SLR during the training phase are much closer to their optimal values as compared to those obtained by other state-of-the-art methods. We evaluate the proposed method on image classification tasks using CIFAR-10 and ImageNet, as well as object detection tasks using COCO 2014 and Ultra-Fast-Lane-Detection using TuSimple lane detection dataset. Experimental results demonstrate that our SLR-based weight-pruning optimization approach achieves higher compression rate than state-of-the-arts under the same accuracy requirement. It also achieves a high model accuracy even at the hard-pruning stage without retraining (reduces the traditional three-stage pruning to two-stage). Given a limited budget of retraining epochs, our approach quickly recovers the model accuracy.


2021 ◽  
Vol 17 (4) ◽  
pp. 1-16
Author(s):  
Qing Yang ◽  
Jiachen Mao ◽  
Zuoguan Wang ◽  
“Helen” Li Hai

When deploying deep neural networks in embedded systems, it is crucial to decrease the model size and computational complexity for improving the execution speed and efficiency. In addition to conventional compression techniques, e.g., weight pruning and quantization, removing unimportant activations can also dramatically reduce the amount of data communication and the computation cost. Unlike weight parameters, the pattern of activations is directly related to input data and thereby changes dynamically. To regulate the dynamic activation sparsity (DAS), in this work, we propose a generic low-cost approach based on winners-take-all (WTA) dropout technique. The network enhanced by the proposed WTA dropout, namely DASNet , features structured activation sparsity with an improved sparsity level. Compared to the static feature map pruning methods, DASNets provide better computation cost reduction. The WTA dropout technique can be easily applied in deep neural networks without incurring additional training variables. More importantly, DASNet can be seamlessly integrated with other compression techniques, such as weight pruning and quantization, without compromising accuracy. Our experiments on various networks and datasets present significant runtime speedups with negligible accuracy losses.


Sign in / Sign up

Export Citation Format

Share Document