Optimized Compression for Implementing Convolutional Neural Networks on FPGA

Min Zhang; Linpeng Li; Hai Wang; Yan Liu; Hongbo Qin; Wei Zhao

doi:10.3390/electronics8030295

Optimized Compression for Implementing Convolutional Neural Networks on FPGA

Electronics ◽

10.3390/electronics8030295 ◽

2019 ◽

Vol 8 (3) ◽

pp. 295 ◽

Cited By ~ 15

Author(s):

Min Zhang ◽

Linpeng Li ◽

Hai Wang ◽

Yan Liu ◽

Hongbo Qin ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Short Term Memory ◽

Graphics Processing Unit ◽

Processing Unit ◽

Central Processing ◽

Pruning Strategy ◽

Field Programmable ◽

Storage Format ◽

Evaluation Board

Field programmable gate array (FPGA) is widely considered as a promising platform for convolutional neural network (CNN) acceleration. However, the large numbers of parameters of CNNs cause heavy computing and memory burdens for FPGA-based CNN implementation. To solve this problem, this paper proposes an optimized compression strategy, and realizes an accelerator based on FPGA for CNNs. Firstly, a reversed-pruning strategy is proposed which reduces the number of parameters of AlexNet by a factor of 13× without accuracy loss on the ImageNet dataset. Peak-pruning is further introduced to achieve better compressibility. Moreover, quantization gives another 4× with negligible loss of accuracy. Secondly, an efficient storage technique, which aims for the reduction of the whole overhead cache of the convolutional layer and the fully connected layer, is presented respectively. Finally, the effectiveness of the proposed strategy is verified by an accelerator implemented on a Xilinx ZCU104 evaluation board. By improving existing pruning techniques and the storage format of sparse data, we significantly reduce the size of AlexNet by 28×, from 243 MB to 8.7 MB. In addition, the overall performance of our accelerator achieves 9.73 fps for the compressed AlexNet. Compared with the central processing unit (CPU) and graphics processing unit (GPU) platforms, our implementation achieves 182.3× and 1.1× improvements in latency and throughput, respectively, on the convolutional (CONV) layers of AlexNet, with an 822.0× and 15.8× improvement for energy efficiency, separately. This novel compression strategy provides a reference for other neural network applications, including CNNs, long short-term memory (LSTM), and recurrent neural networks (RNNs).

Recent Results on the Implementation of a Burst Error and Burst Erasure Channel Emulator Using an FPGA Architecture

Journal of Communications Software and Systems ◽

10.24138/jcomss.v16i1.766 ◽

2020 ◽

Vol 16 (1) ◽

pp. 19-29

Author(s):

Caterina Travan ◽

Francesca Vatta ◽

Fulvio Babich

Keyword(s):

Graphics Processing Unit ◽

General Purpose ◽

Processing Unit ◽

Transmission Channel ◽

Central Processing ◽

Fpga Architecture ◽

Burst Error ◽

Field Programmable ◽

Erasure Channel ◽

Burst Erasure

The behaviour of a transmission channel may be simulated using the performance abilities of current generation multiprocessing hardware, namely, a multicore Central Processing Unit (CPU), a general purpose Graphics Processing Unit (GPU), or a Field Programmable Gate Array (FPGA). These were investigated by Cullinan et al. in a recent paper (published in 2012) where these three devices capabilities were compared to determine which device would be best suited towards which specific task. In particular, it was shown that, for the application which is objective of our work (i.e., for a transmission channel simulation), the FPGA is 26.67 times faster than the GPU and 10.76 times faster than the CPU. Motivated by these results, in this paper we propose and present a direct hardware emulation. In particular, a Cyclone II FPGA architecture is implemented to simulate a burst error channel behaviour, in which errors are clustered together, and a burst erasure channel behaviour, in which the erasures are clustered together. The results presented in the paper are valid for any FPGA architecture that may be considered for this scope.

SWIRL: High-performance many-core CPU code generation for deep neural networks

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019866247 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1275-1289 ◽

Cited By ~ 3

Author(s):

Anand Venkat ◽

Tharindu Rusira ◽

Raj Barik ◽

Mary Hall ◽

Leonard Truong

Keyword(s):

Neural Networks ◽

Language Processing ◽

Code Generation ◽

High Performance ◽

Deep Neural Networks ◽

Graphics Processing Unit ◽

Processing Unit ◽

Data Movement ◽

Central Processing ◽

The Status

Deep neural networks (DNNs) have demonstrated effectiveness in many domains including object recognition, speech recognition, natural language processing, and health care. Typically, the computations involved in DNN training and inferencing are time consuming and require efficient implementations. Existing frameworks such as TensorFlow, Theano, Torch, Cognitive Tool Kit (CNTK), and Caffe enable Graphics Processing Unit (GPUs) as the status quo devices for DNN execution, leaving Central Processing Unit (CPUs) behind. Moreover, existing frameworks forgo or limit cross layer optimization opportunities that have the potential to improve performance by significantly reducing data movement through the memory hierarchy. In this article, we describe an alternative approach called SWIRL, a compiler that provides high-performance CPU implementations for DNNs. SWIRL is built on top of the existing domain-specific language (DSL) for DNNs called LATTE. SWIRL separates DNN specification and its schedule using predefined transformation recipes for tensors and layers commonly found in DNN layers. These recipes synergize with DSL constructs to generate high-quality fused, vectorized, and parallelized code for CPUs. On an Intel Xeon Platinum 8180M CPU, SWIRL achieves performance comparable with Tensorflow integrated with MKL-DNN; on average 1.00× of Tensorflow inference and 0.99× of Tensorflow training. It also outperforms the original LATTE compiler on average by 1.22× and 1.30× on inference and training, respectively.

SatEC: A 5G Satellite Edge Computing Framework Based on Microservice Architecture

Sensors ◽

10.3390/s19040831 ◽

2019 ◽

Vol 19 (4) ◽

pp. 831 ◽

Cited By ~ 12

Author(s):

Lei Yan ◽

Suzhi Cao ◽

Yongsheng Gong ◽

Hao Han ◽

Junyong Wei ◽

...

Keyword(s):

Graphics Processing Unit ◽

Edge Computing ◽

Satellite Network ◽

Processing Unit ◽

Central Processing ◽

5G Network ◽

Field Programmable ◽

High Bandwidth ◽

Basic Services ◽

Computing Framework

As outlined in the 3Gpp Release 16, 5G satellite access is important for 5G network development in the future. A terrestrial-satellite network integrated with 5G has the characteristics of low delay, high bandwidth, and ubiquitous coverage. A few researchers have proposed integrated schemes for such a network; however, these schemes do not consider the possibility of achieving optimization of the delay characteristic by changing the computing mode of the 5G satellite network. We propose a 5G satellite edge computing framework (5GsatEC), which aims to reduce delay and expand network coverage. This framework consists of embedded hardware platforms and edge computing microservices in satellites. To increase the flexibility of the framework in complex scenarios, we unify the resource management of the central processing unit (CPU), graphics processing unit (GPU), and field-programmable gate array (FPGA); we divide the services into three types: system services, basic services, and user services. In order to verify the performance of the framework, we carried out a series of experiments. The results show that 5GsatEC has a broader coverage than the ground 5G network. The results also show that 5GsatEC has lower delay, a lower packet loss rate, and lower bandwidth consumption than the 5G satellite network.

A TensorFlow Extension Framework for Optimized Generation of Hardware CNN Inference Engines

Technologies ◽

10.3390/technologies8010006 ◽

2020 ◽

Vol 8 (1) ◽

pp. 6 ◽

Cited By ~ 1

Author(s):

Vasileios Leon ◽

Spyridon Mouselinos ◽

Konstantina Koliogeorgi ◽

Sotirios Xydis ◽

Dimitrios Soudris ◽

...

Keyword(s):

Integrated Circuit ◽

High Performance ◽

Graphics Processing Unit ◽

Inference Engine ◽

Efficient Solutions ◽

Processing Unit ◽

Central Processing ◽

Inference Engines ◽

Field Programmable ◽

Hardware Description

The workloads of Convolutional Neural Networks (CNNs) exhibit a streaming nature that makes them attractive for reconfigurable architectures such as the Field-Programmable Gate Arrays (FPGAs), while their increased need for low-power and speed has established Application-Specific Integrated Circuit (ASIC)-based accelerators as alternative efficient solutions. During the last five years, the development of Hardware Description Language (HDL)-based CNN accelerators, either for FPGA or ASIC, has seen huge academic interest due to their high-performance and room for optimizations. Towards this direction, we propose a library-based framework, which extends TensorFlow, the well-established machine learning framework, and automatically generates high-throughput CNN inference engines for FPGAs and ASICs. The framework allows software developers to exploit the benefits of FPGA/ASIC acceleration without requiring any expertise on HDL development and low-level design. Moreover, it provides a set of optimization knobs concerning the model architecture and the inference engine generation, allowing the developer to tune the accelerator according to the requirements of the respective use case. Our framework is evaluated by optimizing the LeNet CNN model on the MNIST dataset, and implementing FPGA- and ASIC-based accelerators using the generated inference engine. The optimal FPGA-based accelerator on Zynq-7000 delivers 93% less memory footprint and 54% less Look-Up Table (LUT) utilization, and up to 10× speedup on the inference execution vs. different Graphics Processing Unit (GPU) and Central Processing Unit (CPU) implementations of the same model, in exchange for a negligible accuracy loss, i.e., 0.89%. For the same accuracy drop, the 45 nm standard-cell-based ASIC accelerator provides an implementation which operates at 520 MHz and occupies an area of 0.059 mm 2 , while the power consumption is ∼7.5 mW.

Artificial Intelligence & Machine Learning in Computer Vision Applications

Embedded Selforganising Systems ◽

10.14464/ess71432 ◽

2020 ◽

Vol 7 (1) ◽

pp. 2-3

Author(s):

Shadi Saleh

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Neural Networks ◽

Computer Vision ◽

Deep Learning ◽

Large Scale ◽

Short Term Memory ◽

Graphics Processing Unit ◽

Multimedia Data ◽

Processing Unit

Deep learning and machine learning innovations are at the core of the ongoing revolution in Artificial Intelligence for the interpretation and analysis of multimedia data. The convergence of large-scale datasets and more affordable Graphics Processing Unit (GPU) hardware has enabled the development of neural networks for data analysis problems that were previously handled by traditional handcrafted features. Several deep learning architectures such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Long Short Term Memory (LSTM)/Gated Recurrent Unit (GRU), Deep Believe Networks (DBN), and Deep Stacking Networks (DSNs) have been used with new open source software and libraries options to shape an entirely new scenario in computer vision processing.

Design of Power-Efficient Training Accelerator for Convolution Neural Networks

Electronics ◽

10.3390/electronics10070787 ◽

2021 ◽

Vol 10 (7) ◽

pp. 787

Author(s):

JiUn Hong ◽

Saad Arslan ◽

TaeGeon Lee ◽

HyungWon Kim

Keyword(s):

Neural Network ◽

Neural Networks ◽

Energy Consumption ◽

Low Power ◽

Resource Sharing ◽

Graphics Processing Unit ◽

Processing Unit ◽

Real Time Processing ◽

Power Efficient ◽

Low Area

To realize deep learning techniques, a type of deep neural network (DNN) called a convolutional neural networks (CNN) is among the most widely used models aimed at image recognition applications. However, there is growing demand for light-weight and low-power neural network accelerators, not only for inference but also for training process. In this paper, we propose a training accelerator that provides low power and compact chip size targeted for mobile and edge computing applications. It accelerates to achieve the real-time processing of both inference and training using concurrent floating-point data paths. The proposed accelerator can be externally controlled and employs resource sharing and an integrated convolution-pooling block to achieve low area and low energy consumption. We implemented the proposed training accelerator in an FPGA (Field Programmable Gate Array) and evaluated its training performance using an MNIST CNN example in comparison with a PC with GPU (Graphics Processing Unit). While both methods achieved a similar training accuracy of 95.1%, the proposed accelerator, when implemented in a silicon chip, reduced the energy consumption by 480 times compared to the counterpart. Additionally, when implemented on an FPGA, an energy reduction of over 4.5 times was achieved compared to the existing FPGA training accelerator for the MNIST dataset. Therefore, the proposed accelerator is more suitable for deployment in mobile/edge nodes compared to the existing software and hardware accelerators.

Real-time Detection of Aortic Valve in Echocardiography using Convolutional Neural Networks

Current Medical Imaging Formerly Current Medical Imaging Reviews ◽

10.2174/1573405615666190114151255 ◽

2020 ◽

Vol 16 (5) ◽

pp. 584-591 ◽

Cited By ~ 1

Author(s):

Muhammad Hanif Ahmad Nizar ◽

Chow Khuen Chan ◽

Azira Khalil ◽

Ahmad Khairuddin Mohamed Yusof ◽

Khin Wee Lai

Keyword(s):

Neural Network ◽

Neural Networks ◽

Heart Disease ◽

Aortic Valve ◽

Real Time ◽

Convolutional Neural Networks ◽

Valvular Heart Disease ◽

Detection System ◽

Processing Unit ◽

Real Time Detection

Background: Valvular heart disease is a serious disease leading to mortality and increasing medical care cost. The aortic valve is the most common valve affected by this disease. Doctors rely on echocardiogram for diagnosing and evaluating valvular heart disease. However, the images from echocardiogram are poor in comparison to Computerized Tomography and Magnetic Resonance Imaging scan. This study proposes the development of Convolutional Neural Networks (CNN) that can function optimally during a live echocardiographic examination for detection of the aortic valve. An automated detection system in an echocardiogram will improve the accuracy of medical diagnosis and can provide further medical analysis from the resulting detection. Methods: Two detection architectures, Single Shot Multibox Detector (SSD) and Faster Regional based Convolutional Neural Network (R-CNN) with various feature extractors were trained on echocardiography images from 33 patients. Thereafter, the models were tested on 10 echocardiography videos. Results: Faster R-CNN Inception v2 had shown the highest accuracy (98.6%) followed closely by SSD Mobilenet v2. In terms of speed, SSD Mobilenet v2 resulted in a loss of 46.81% in framesper- second (fps) during real-time detection but managed to perform better than the other neural network models. Additionally, SSD Mobilenet v2 used the least amount of Graphic Processing Unit (GPU) but the Central Processing Unit (CPU) usage was relatively similar throughout all models. Conclusion: Our findings provide a foundation for implementing a convolutional detection system to echocardiography for medical purposes.

Classification of Skin Disease Using Deep Learning Neural Networks with MobileNet V2 and LSTM

Sensors ◽

10.3390/s21082852 ◽

2021 ◽

Vol 21 (8) ◽

pp. 2852

Author(s):

Parvathaneni Naga Srinivasu ◽

Jalluri Gnana SivaSai ◽

Muhammad Fazal Ijaz ◽

Akash Kumar Bhoi ◽

Wonjoon Kim ◽

...

Keyword(s):

Neural Network ◽

Neural Networks ◽

Deep Learning ◽

Convolutional Neural Network ◽

Skin Disease ◽

Network Architecture ◽

Large Scale ◽

Short Term Memory ◽

Convolutional Networks ◽

Occurrence Matrix

Deep learning models are efficient in learning the features that assist in understanding complex patterns precisely. This study proposed a computerized process of classifying skin disease through deep learning based MobileNet V2 and Long Short Term Memory (LSTM). The MobileNet V2 model proved to be efficient with a better accuracy that can work on lightweight computational devices. The proposed model is efficient in maintaining stateful information for precise predictions. A grey-level co-occurrence matrix is used for assessing the progress of diseased growth. The performance has been compared against other state-of-the-art models such as Fine-Tuned Neural Networks (FTNN), Convolutional Neural Network (CNN), Very Deep Convolutional Networks for Large-Scale Image Recognition developed by Visual Geometry Group (VGG), and convolutional neural network architecture that expanded with few changes. The HAM10000 dataset is used and the proposed method has outperformed other methods with more than 85% accuracy. Its robustness in recognizing the affected region much faster with almost 2× lesser computations than the conventional MobileNet model results in minimal computational efforts. Furthermore, a mobile application is designed for instant and proper action. It helps the patient and dermatologists identify the type of disease from the affected region’s image at the initial stage of the skin disease. These findings suggest that the proposed system can help general practitioners efficiently and effectively diagnose skin conditions, thereby reducing further complications and morbidity.

Numerical simulation of flattened heat pipe with double heat sources for CPU and GPU cooling application in laptop computers

Journal of Computational Design and Engineering ◽

10.1093/jcde/qwaa091 ◽

2020 ◽

Author(s):

Wisoot Sanhan ◽

Kambiz Vafai ◽

Niti Kammuang-Lue ◽

Pradit Terdtoon ◽

Phrut Sakulchangsatjatai

Keyword(s):

Experimental Data ◽

Heat Pipe ◽

Graphics Processing Unit ◽

Processing Unit ◽

Heat Sources ◽

Final Thickness ◽

Laptop Computers ◽

Central Processing ◽

Graphics Processing ◽

Good Agreement

Abstract An investigation of the effect of the thermal performance of the flattened heat pipe on its double heat sources acting as central processing unit and graphics processing unit in laptop computers is presented in this work. A finite element method is used for predicting the flattening effect of the heat pipe. The cylindrical heat pipe with a diameter of 6 mm and the total length of 200 mm is flattened into three final thicknesses of 2, 3, and 4 mm. The heat pipe is placed under a horizontal configuration and heated with heater 1 and heater 2, 40 W in combination. The numerical model shows good agreement compared with the experimental data with the standard deviation of 1.85%. The results also show that flattening the cylindrical heat pipe to 66.7 and 41.7% of its original diameter could reduce its normalized thermal resistance by 5.2%. The optimized final thickness or the best design final thickness for the heat pipe is found to be 2.5 mm.

A Parallel-Computing Approach for Vector Road-Network Matching Using GPU Architecture

ISPRS International Journal of Geo-Information ◽

10.3390/ijgi7120472 ◽

2018 ◽

Vol 7 (12) ◽

pp. 472 ◽

Cited By ~ 1

Author(s):

Bo Wan ◽

Lin Yang ◽

Shunping Zhou ◽

Run Wang ◽

Dezhi Wang ◽

...

Keyword(s):

Road Network ◽

Large Scale ◽

Graphics Processing Unit ◽

Road Networks ◽

Processing Unit ◽

Data Partition ◽

Matching Method ◽

The Road ◽

Central Processing ◽

Relaxation Matching

The road-network matching method is an effective tool for map integration, fusion, and update. Due to the complexity of road networks in the real world, matching methods often contain a series of complicated processes to identify homonymous roads and deal with their intricate relationship. However, traditional road-network matching algorithms, which are mainly central processing unit (CPU)-based approaches, may have performance bottleneck problems when facing big data. We developed a particle-swarm optimization (PSO)-based parallel road-network matching method on graphics-processing unit (GPU). Based on the characteristics of the two main stages (similarity computation and matching-relationship identification), data-partition and task-partition strategies were utilized, respectively, to fully use GPU threads. Experiments were conducted on datasets with 14 different scales. Results indicate that the parallel PSO-based matching algorithm (PSOM) could correctly identify most matching relationships with an average accuracy of 84.44%, which was at the same level as the accuracy of a benchmark—the probability-relaxation-matching (PRM) method. The PSOM approach significantly reduced the road-network matching time in dealing with large amounts of data in comparison with the PRM method. This paper provides a common parallel algorithm framework for road-network matching algorithms and contributes to integration and update of large-scale road-networks.