scholarly journals Hardware Implementation and Quantization of Tiny-Yolo-v2 using Open CL

The trend of increasingly model size in Deep Neural Network (DNN) algorithms boost the performance of visual recognition tasks. These gains in performance have come at a cost of increase in computational complexity and memory bandwidth. Recent studies have explored the fixed-point implementation of DNN algorithms such as AlexNet and VGG on Field Programmable Gate Array (FPGA) to facilitate the potential of deployment on embedded system. However, there are still lacking research on DNN object detection algorithms on FPGA. Consequently, we propose the implementation of Tiny-Yolo-v2 on Cyclone V PCIe FPGA board using the High-Level Synthesis Tool: Intel FPGA Software Development Kit (SDK) for OpenCL. In this work, a systematic approach is proposed to convert the floating point Tiny-Yolo-v2 algorithms into 8-bit fixed-point. Our experiments show that the 8-bit fixed-point Tiny-Yolo-v2 have significantly reduce the hardware consumption with only 0.3% loss in accuracy. Finally, our implementation achieves peak performance of 31.34 Giga Operation per Second (GOPS) and comparable performance density of 0.28GOPs/DSP to prior works under 120MHz working frequency.

Electronics ◽  
2021 ◽  
Vol 10 (22) ◽  
pp. 2859
Author(s):  
Mannhee Cho ◽  
Youngmin Kim

Convolutional neural networks (CNNs) are widely used in modern applications for their versatility and high classification accuracy. Field-programmable gate arrays (FPGAs) are considered to be suitable platforms for CNNs based on their high performance, rapid development, and reconfigurability. Although many studies have proposed methods for implementing high-performance CNN accelerators on FPGAs using optimized data types and algorithm transformations, accelerators can be optimized further by investigating more efficient uses of FPGA resources. In this paper, we propose an FPGA-based CNN accelerator using multiple approximate accumulation units based on a fixed-point data type. We implemented the LeNet-5 CNN architecture, which performs classification of handwritten digits using the MNIST handwritten digit dataset. The proposed accelerator was implemented, using a high-level synthesis tool on a Xilinx FPGA. The proposed accelerator applies an optimized fixed-point data type and loop parallelization to improve performance. Approximate operation units are implemented using FPGA logic resources instead of high-precision digital signal processing (DSP) blocks, which are inefficient for low-precision data. Our accelerator model achieves 66% less memory usage and approximately 50% reduced network latency, compared to a floating point design and its resource utilization is optimized to use 78% fewer DSP blocks, compared to general fixed-point designs.


2020 ◽  
Vol 63 (8) ◽  
pp. 1216-1230 ◽  
Author(s):  
Wei Guo ◽  
Sujuan Qin ◽  
Jun Lu ◽  
Fei Gao ◽  
Zhengping Jin ◽  
...  

Abstract For a high level of data availability and reliability, a common strategy for cloud service providers is to rely on replication, i.e. storing several replicas onto different servers. To provide cloud users with a strong guarantee that all replicas required by them are actually stored, many multi-replica integrity auditing schemes were proposed. However, most existing solutions are not resource economical since users need to create and upload replicas of their files by themselves. A multi-replica solution called Mirror is presented to overcome the problems, but we find that it is vulnerable to storage saving attack, by which a dishonest provider can considerably save storage costs compared to the costs of storing all the replicas honestly—while still can pass any challenge successfully. In addition, we also find that Mirror is easily subject to substitution attack and forgery attack, which pose new security risks for cloud users. To address the problems, we propose some simple yet effective countermeasures and an improved proofs of retrievability and replication scheme, which can resist the aforesaid attacks and maintain the advantages of Mirror, such as economical bandwidth and efficient verification. Experimental results show that our scheme exhibits comparable performance with Mirror while achieving high security.


Electronics ◽  
2021 ◽  
Vol 10 (4) ◽  
pp. 469
Author(s):  
Hyun Woo Oh ◽  
Ji Kwang Kim ◽  
Gwan Beom Hwang ◽  
Seung Eun Lee

Recently, advances in technology have enabled embedded systems to be adopted for a variety of applications. Some of these applications require real-time 2D graphics processing running on limited design specifications such as low power consumption and a small area. In order to satisfy such conditions, including a specific 2D graphics accelerator in the embedded system is an effective method. This method reduces the workload of the processor in the embedded system by exploiting the accelerator. The accelerator assists the system to perform 2D graphics processing in real-time. Therefore, a variety of applications that require 2D graphics processing can be implemented with an embedded processor. In this paper, we present a 2D graphics accelerator for tiny embedded systems. The accelerator includes an optimized line-drawing operation based on Bresenham’s algorithm. The optimized operation enables the accelerator to deal with various kinds of 2D graphics processing and to perform the line-drawing instead of the system processor. Moreover, the accelerator also distributes the workload of the processor core by removing the need for the core to access the frame buffer memory. We measure the performance of the accelerator by implementing the processor, including the accelerator, on a field-programmable gate array (FPGA), and ascertaining the possibility of realization by synthesizing using the 180 nm CMOS process.


Nanophotonics ◽  
2020 ◽  
Vol 9 (10) ◽  
pp. 3271-3278 ◽  
Author(s):  
Qian Ma ◽  
Qiao Ru Hong ◽  
Xin Xin Gao ◽  
Hong Bo Jing ◽  
Che Liu ◽  
...  

AbstractFor the intelligence of metamaterials, the -sensing mechanism and programmable reaction units are two important components for self-recognition and -determination. However, their realization still face great challenges. Here, we propose a smart sensing metasurface to achieve self-defined functions in the framework of digital coding metamaterials. A sensing unit that can simultaneously process the sensing channel and realize phase-programmable capability is designed by integrating radio frequency (RF) power detector and PIN diodes. Four sensing units distributed on the metasurface aperture can detect the microwave incidences in the x- and y-polarizations, while the other elements can modulate the reflected phase patterns under the control of a field programmable gate array (FPGA). To validate the performance, three schemes containing six coding patterns are presented and simulated, after which two of them are measured, showing good agreements with designs. We envision that this work may motivate studies on smart metamaterials with high-level recognition and manipulation.


2020 ◽  
Vol 34 (04) ◽  
pp. 6623-6630
Author(s):  
Li Yang ◽  
Zhezhi He ◽  
Deliang Fan

Deep convolutional neural network (DNN) has demonstrated phenomenal success and been widely used in many computer vision tasks. However, its enormous model size and high computing complexity prohibits its wide deployment into resource limited embedded system, such as FPGA and mGPU. As the two most widely adopted model compression techniques, weight pruning and quantization compress DNN model through introducing weight sparsity (i.e., forcing partial weights as zeros) and quantizing weights into limited bit-width values, respectively. Although there are works attempting to combine the weight pruning and quantization, we still observe disharmony between weight pruning and quantization, especially when more aggressive compression schemes (e.g., Structured pruning and low bit-width quantization) are used. In this work, taking FPGA as the test computing platform and Processing Elements (PE) as the basic parallel computing unit, we first propose a PE-wise structured pruning scheme, which introduces weight sparsification with considering of the architecture of PE. In addition, we integrate it with an optimized weight ternarization approach which quantizes weights into ternary values ({-1,0,+1}), thus converting the dominant convolution operations in DNN from multiplication-and-accumulation (MAC) to addition-only, as well as compressing the original model (from 32-bit floating point to 2-bit ternary representation) by at least 16 times. Then, we investigate and solve the coexistence issue between PE-wise Structured pruning and ternarization, through proposing a Weight Penalty Clipping (WPC) technique with self-adapting threshold. Our experiment shows that the fusion of our proposed techniques can achieve the best state-of-the-art ∼21× PE-wise structured compression rate with merely 1.74%/0.94% (top-1/top-5) accuracy degradation of ResNet-18 on ImageNet dataset.


2010 ◽  
Vol E93-C (3) ◽  
pp. 361-368
Author(s):  
Benjamin CARRION SCHAFER ◽  
Yusuke IGUCHI ◽  
Wataru TAKAHASHI ◽  
Shingo NAGATANI ◽  
Kazutoshi WAKABAYASHI

2016 ◽  
Vol 78 (7-5) ◽  
Author(s):  
Muhammad Amin Hashim ◽  
Yuan Wen Hau ◽  
Rabia Baktheri

This paper studies two different Electrocardiography (ECG) preprocessing algorithms, namely Pan and Tompkins (PT) and Derivative Based (DB) algorithm, which is crucial of QRS complex detection in cardiovascular disease detection. Both algorithms are compared in terms of QRS detection accuracy and computation timing performance, with implementation on System-on-Chip (SoC) based embedded system that prototype on Altera DE2-115 Field Programmable Gate Array (FPGA) platform as embedded software. Both algorithms are tested with 30 minutes ECG data from each of 48 different patient records obtain from MIT-BIH arrhythmia database. Results show that PT algorithm achieve 98.15% accuracy with 56.33 seconds computation while DB algorithm achieve 96.74% with only 22.14 seconds processing time. Based on the study, an optimized PT algorithm with improvement on Moving Windows Integrator (MWI) has been proposed to accelerate its computation. Result shows that the proposed optimized Moving Windows Integrator algorithm achieves 9.5 times speed up than original MWI while retaining its QRS detection accuracy. 


2016 ◽  
Vol 25 (04) ◽  
pp. 1650029 ◽  
Author(s):  
Adam Ziebinski ◽  
Stanwlaw Swierc

Currently embedded system designs aim to improve areas such as speed, energy efficiency and the cost of an application. Application-specific instruction set extensions on reconfigurable hardware provide such opportunities. The article presents a new approach for generating soft core processors that are optimized for specific tasks. In this work, we describe an automatic method for selecting custom instructions for generating software core processors that are based on the machine code of the application program. As the result, a soft core processor will contain the logic that is absolutely necessary. This solution requires fewer gates to be synthesized in the field programmable gate arrays (FPGA) and has a potential to increase the speed of the information processing that is performed by the system in the target FPGA. Experiments have confirmed the correct operation of the method that was used. After the reduction mechanism was enabled, the total number of slices blocks that were occupied decreased to 47% of its initial value in the best case for the Xilinx Spartan3 (xc3s200) and the maximum frequency increased approximately 44% in the best case for Xilinx Spartan6 (xc6slx4).


Sign in / Sign up

Export Citation Format

Share Document