A Hybrid Vision Processing Unit with a Pipelined Workflow for Convolutional Neural Network Accelerating and Image Signal Processing

Peng Liu; Yan Song

doi:10.3390/electronics10232989

A Hybrid Vision Processing Unit with a Pipelined Workflow for Convolutional Neural Network Accelerating and Image Signal Processing

Electronics ◽

10.3390/electronics10232989 ◽

2021 ◽

Vol 10 (23) ◽

pp. 2989

Author(s):

Peng Liu ◽

Yan Song

Keyword(s):

Signal Processing ◽

Data Transmission ◽

High Speed ◽

High Efficiency ◽

State Of The Art ◽

Vision System ◽

Processing Unit ◽

Processing Elements ◽

Vision Processing ◽

Field Programmable

Vision processing chips have been widely used in image processing and recognition tasks. They are conventionally designed based on the image signal processing (ISP) units directly connected with the sensors. In recent years, convolutional neural networks (CNNs) have become the dominant tools for many state-of-the-art vision processing tasks. However, CNNs cannot be processed by a conventional vision processing unit (VPU) with a high speed. On the other side, the CNN processing units cannot process the RAW images from the sensors directly and an ISP unit is required. This makes a vision system inefficient with a lot of data transmission and redundant hardware resources. Additionally, many CNN processing units suffer from a low flexibility for various CNN operations. To solve this problem, this paper proposed an efficient vision processing unit based on a hybrid processing elements array for both CNN accelerating and ISP. Resources are highly shared in this VPU, and a pipelined workflow is introduced to accelerate the vision tasks. We implement the proposed VPU on the Field-Programmable Gate Array (FPGA) platform and various vision tasks are tested on it. The results show that this VPU achieves a high efficiency for both CNN processing and ISP and shows a significant reduction in energy consumption for vision tasks consisting of CNNs and ISP. For various CNN tasks, it maintains an average multiply accumulator utilization of over 94% and achieves a performance of 163.2 GOPS with a frequency of 200 MHz.

Download Full-text

High speed multi-channel data acquisition technique for efficient hardware utilization using quad data rate approach

International Journal of Engineering & Technology ◽

10.14419/ijet.v7i4.16061 ◽

2018 ◽

Vol 7 (4) ◽

pp. 2569

Author(s):

Priyanka Chauhan ◽

Dippal Israni ◽

Karan Jasani ◽

Ashwin Makwana

Keyword(s):

Power Consumption ◽

Data Acquisition ◽

Resource Utilization ◽

Field Programmable Gate Array ◽

High Speed ◽

State Of The Art ◽

Data Rate ◽

Field Programmable ◽

Sensor Signals ◽

Acquisition Technique

Data acquisition is the most demanding application for the acquisition and monitoring of various sensor signals. The data received are processed in real-time environment. This paper proposes a novel Data Acquisition (DAQ) technique for better resource utilization with less power consumption. Present work has designed and compared advanced Quad Data Rate (QDR) technique with traditional Dual Data Rate (DDR) technique in terms of resource utilization and power consumption of Field Programmable Gate Array (FPGA) hardware. Xilinx ISE is used to verify results of FPGA resource utilization by QDR with state of the art DDR approach. The paper ratiocinates that QDR technique outperforms traditional DDR technique in terms of FPGA resource utilization.

Download Full-text

BPR-TCAM—Block and Partial Reconfiguration based TCAM on Xilinx FPGAs

Electronics ◽

10.3390/electronics9020353 ◽

2020 ◽

Vol 9 (2) ◽

pp. 353 ◽

Cited By ~ 1

Author(s):

Anees Ullah ◽

Ali Zahir ◽

Noaman A. Khan ◽

Waleed Ahmad ◽

Alexis Ramos ◽

...

Keyword(s):

Resource Utilization ◽

High Speed ◽

State Of The Art ◽

Field Programmable Gate Arrays ◽

Partial Reconfiguration ◽

Gate Arrays ◽

Content Addressable Memories ◽

Field Programmable ◽

Programmable Gate Arrays

Field Programmable Gate Arrays (FPGAs) based Ternary Content Addressable Memories (TCAMs) are widely used in high-speed networking applications.However, TCAMs are not present on state-of-the-art FPGAs and need to be emulated on SRAM-based memories (i.e., LUTRAMs and Block RAMs) which requires a large amount of FPGA resources. In this paper, we present an efficient methodology to implement FPGA-based TCAMs with significant resource savings compared to existing schemes. The proposed methodology exploits the fracturable nature of Look Up Tables (LUTs) and the built-in slice carry-chains for simultaneous mapping of two rules and its matching logic to a single FPGA slice. Multiple slices can be stacked together to build deeper and wider TCAMs in a modular way. The combination of all these techniques results in significant savings in resource utilization compared to existing approaches.

Download Full-text

ICE-Based Custom Full-Mesh Network for the CHIME High Bandwidth Radio Astronomy Correlator

Journal of Astronomical Instrumentation ◽

10.1142/s225117171641004x ◽

2016 ◽

Vol 05 (04) ◽

pp. 1641004 ◽

Cited By ~ 6

Author(s):

K. Bandura ◽

J. F. Cliche ◽

M. A. Dobbs ◽

A. J. Gilbert ◽

D. Ittah ◽

...

Keyword(s):

High Speed ◽

Graphics Processing Unit ◽

Mesh Network ◽

Processing Unit ◽

Intensity Mapping ◽

Large Bandwidth ◽

Data Links ◽

Data Rates ◽

Combining Data ◽

Field Programmable

New generation radio interferometers encode signals from thousands of antenna feeds across large bandwidth. Channelizing and correlating this data requires networking capabilities that can handle unprecedented data rates with reasonable cost. The Canadian Hydrogen Intensity Mapping Experiment (CHIME) correlator processes 8-bits from [Formula: see text] digitizer inputs across 400[Formula: see text]MHz of bandwidth. Measured in [Formula: see text] bandwidth, it is the largest radio correlator that is currently commissioning. Its digital back-end must exchange and reorganize the 6.6[Formula: see text]terabit/s produced by its 128 digitizing and channelizing nodes, and feed it to the 256 graphics processing unit (GPU) node spatial correlator in a way that each node obtains data from all digitizer inputs but across a small fraction of the bandwidth (i.e. ‘corner-turn’). In order to maximize performance and reliability of the corner-turn system while minimizing cost, a custom networking solution has been implemented. The system makes use of Field Programmable Gate Array (FPGA) transceivers to implement direct, passive copper, full-mesh, high speed serial connections between sixteen circuit boards in a crate, to exchange data between crates, and to offload the data to a cluster of 256 GPU nodes using standard 10[Formula: see text]Gbit/s Ethernet links. The GPU nodes complete the corner-turn by combining data from all crates and then computing visibilities. Eye diagrams and frame error counters confirm error-free operation of the corner-turn network in both the currently operating CHIME Pathfinder telescope (a prototype for the full CHIME telescope) and a representative fraction of the full CHIME hardware providing an end-to-end system validation. An analysis of an equivalent corner-turn system built with Ethernet switches instead of custom passive data links is provided.

Download Full-text

Application of FPGAs to High-Speed Condition Based Maintenance of Rolling Element Bearings

Volume 3 ◽

10.1115/esda2004-58372 ◽

2004 ◽

Author(s):

Mark Harriman ◽

Farbod Zorriassatine ◽

Rob Parkin ◽

Mike Jackson ◽

Jo Coy

Keyword(s):

Signal Processing ◽

Condition Monitoring ◽

Execution Time ◽

Field Programmable Gate Array ◽

High Speed ◽

Mechanical Engineering ◽

Electronic Engineering ◽

Bearing Condition Monitoring ◽

Field Programmable ◽

Gate Array

Field-Programmable Gate Array (FPGA) technology has been applied widely in electronic engineering and computing industries, but it has not had the same level of reception in other disciplines including mechanical engineering [1]. The purpose of this paper is to examine FPGA implementations of signal processing techniques that are used in the context of bearing condition monitoring. As the number of bearings can be large sparse sensor arrays are used to locate and detect their condition. The demands of realtime process monitoring [2] [3] can place a heavy burden upon the monitoring system. Field-Programmable Gate Array (FPGA) technology [4] in this application makes it possible to implement more sophisticated algorithms. These exploit its high-speed, parallel, reconfigurable architecture. Bring forth the advantages of FPGA technology to condition monitoring. The techniques covered are: cross-correlation, digital signal processing (DSP) Infinite Impulse Response (IIR) filters, neural networks and signature matching. The implemented designs are optimised for both execution time and the amount of logic area consumed. Results were obtained from each technique and were assessed and compared in terms of execution time and also the amount of logic consumed on the FPGA. Over the past 15 years FPGA technology has been applied extensively in electronic engineering but its scope has not been as vastly in mechanical engineering. The objective of this paper was to examine an application in mechanical engineering. Ideally this would be done with a mechanical engineering compatible approach, giving rise to a methodology, which would allow FPGA programming [5] to become a transferable skill.

Download Full-text

High-speed optical signal processing and data transmission

Proceedings of 2004 6th International Conference on Transparent Optical Networks (IEEE Cat. No.04EX804) ◽

10.1109/icton.2004.1360271 ◽

2005 ◽

Author(s):

H.G. Weber ◽

R. Ludwig ◽

S. Ferber ◽

C. Boerner ◽

C. Schubert ◽

...

Keyword(s):

Signal Processing ◽

Data Transmission ◽

High Speed ◽

Optical Signal Processing ◽

Optical Signal

Download Full-text

Wireless Power and Data Transmission System of Submarine Cable-Inspecting Robot Fish and Its Time-Sharing Multiplexing Method

Electronics ◽

10.3390/electronics8080838 ◽

2019 ◽

Vol 8 (8) ◽

pp. 838 ◽

Cited By ~ 3

Author(s):

Chen ◽

Sun ◽

Huang ◽

Zhou ◽

Meng ◽

...

Keyword(s):

Data Transmission ◽

Orthogonal Frequency Division Multiplexing ◽

High Speed ◽

High Efficiency ◽

Transmission Mode ◽

Modulation Method ◽

Wireless Charging ◽

Time Sharing ◽

Submarine Cable ◽

Robot Fish

In this paper, a hybrid system topology with one-way wireless charging function and the function of the bi-directional data communication is proposed for the problem of electric energy replenishment and data transmission faced by robot fish in the implementation of autonomous submarine cable inspection. Three working modes of the system and the time-sharing multiplexing method are studied. In the power transmission mode, high-efficiency wireless charging is realized by utilizing the transmission characteristics of a series–series (SS)-type resonant network which involves series resonant networks in both the primary side and the secondary side. In the alignment detection and handshake communication mode, the charging platform distance recognition and the handshake signal transmission are implemented through a series–parallel (SP)-type resonant network based on the ASK (amplitude shift keying) modulation method. In the high-speed data transmission mode, the reverse (secondary to primary) high-speed transmission of the inspection data is achieved through a SP-type resonant network based on the OFDM (orthogonal frequency division multiplexing) modulation method. The three modes share the same coupled coils via a reconfigurable resonant network. The working principle of the system is expounded, the system characteristics under each working mode are analyzed, and the time-division multiplexing control strategy is given. The rationality and effectiveness of the scheme are verified by experiments.

Download Full-text

3D Body Scanning Measurement System Associated with RF Imaging, Zero-padding and Parallel Processing

Measurement Science Review ◽

10.1515/msr-2016-0011 ◽

2016 ◽

Vol 16 (2) ◽

pp. 77-86 ◽

Cited By ~ 1

Author(s):

Hyung Tae Kim ◽

Kyung Chan Jin ◽

Seung Taek Kim ◽

Jongseok Kim ◽

Seung-Bok Choi

Keyword(s):

Signal Processing ◽

Fourier Transform ◽

Fast Fourier Transform ◽

Measurement System ◽

High Speed ◽

Peak Frequency ◽

Processing Unit ◽

Zero Padding ◽

Scanning Time ◽

Rf Antenna

Abstract This work presents a novel signal processing method for high-speed 3D body measurements using millimeter waves with a general processing unit (GPU) and zero-padding fast Fourier transform (ZPFFT). The proposed measurement system consists of a radio-frequency (RF) antenna array for a penetrable measurement, a high-speed analog-to-digital converter (ADC) for significant data acquisition, and a general processing unit for fast signal processing. The RF waves of the transmitter and the receiver are converted to real and imaginary signals that are sampled by a high-speed ADC and synchronized with the kinematic positions of the scanner. Because the distance between the surface and the antenna is related to the peak frequency of the conjugate signals, a fast Fourier transform (FFT) is applied to the signal processing after the sampling. The sampling time is finite owing to a short scanning time, and the physical resolution needs to be increased; further, zero-padding is applied to interpolate the spectra of the sampled signals to consider a 1/m floating point frequency. The GPU and parallel algorithm are applied to accelerate the speed of the ZPFFT because of the large number of additional mathematical operations of the ZPFFT. 3D body images are finally obtained by spectrograms that are the arrangement of the ZPFFT in a 3D space.

Download Full-text

Design of High Speed and Low Area Confined Multiplier on FPGA

Revista Gestão Inovação e Tecnologias ◽

10.47059/revistageintec.v11i4.2315 ◽

2021 ◽

Vol 11 (4) ◽

pp. 2736-2746

Author(s):

Kandagatla Ravi Kumar ◽

Cheeli Priyadarshini ◽

Kanakam Bhavani ◽

Ankam Varun Sundar Kumar ◽

Palanki Naga Nanda Sai

Keyword(s):

Signal Processing ◽

Digital Signal Processing ◽

Field Programmable Gate Array ◽

High Speed ◽

Digital Signal ◽

Low Area ◽

Field Programmable ◽

Dsp Applications ◽

Advanced Applications ◽

Main Factor

In this Advanced world, Technology is playing the major role. Most importantly development in Electronics field has a large impact on the improved life style. Among the advanced applications, DSP ranks first in place. Multipliers are the most basic elements that are widely used in the Digital Signal Processing (DSP) applications. Therefore, the design of the multiplier is the main factor for the performance of the device. Using RTL simulation and a Field Programmable Gate Array (FPGA), we compare the performance of a serial multiplier with an advanced multiplier. Many single bit adders are removed and replaced with multiplexers in this project. So that the less often used FPGAs are fully used by occupying fewer divisions and slices. The use of multiplier architecture results in significant reductions in FPGA resources, latency, area, and power. These multiplication approaches are created utilizing RTL simulation in Xilinx ISE simulator and synthesis in Xilinx ISE 14.7. Finally, the Spartan 3E FPGA is used to implement the design.

Download Full-text

Design Space Exploration for High-Speed Implementation of the MISTY1 Block Cipher

Mathematical Problems in Engineering ◽

10.1155/2021/2599500 ◽

2021 ◽

Vol 2021 ◽

pp. 1-14

Author(s):

Raza Hasan ◽

Yasir Khizar ◽

Salman Mahmood ◽

Muhammad Kashif Sheikh

Keyword(s):

High Speed ◽

Design Space Exploration ◽

Block Cipher ◽

Design Space ◽

High Efficiency ◽

Space Exploration ◽

Wireless Applications ◽

Low Area ◽

Field Programmable ◽

Very High

This paper proposes 2 × unrolled high-speed architectures of the MISTY1 block cipher for wireless applications including sensor networks and image encryption. Design space exploration is carried out for 8-round MISTY1 utilizing dual-edge trigger (DET) and single-edge trigger (SET) pipelines to analyze the tradeoff w.r.t. speed/area. The design is primarily based on the optimized implementation of lookup tables (LUTs) for MISTY1 and its core transformation functions. The LUTs are designed by logically formulating S9/S7 s-boxes and FI and {FO + 32-bit XOR} functions with the fine placement of pipelines. Highly efficient and high-speed MISTY1 architectures are thus obtained and implemented on the field-programmable gate array (FPGA), Virtex-7, XC7VX690T. The high-speed/very high-speed MISTY1 architectures acquire throughput values of 25.2/43 Gbps covering an area of 1331/1509 CLB slices, respectively. The proposed MISTY1 architecture outperforms all previous MISTY1 implementations indicating high speed with low area achieving high efficiency value. The proposed architecture had higher efficiency values than the existing AES and Camellia architectures. This signifies the optimizations made for proposed high-speed MISTY1 architectures.

Download Full-text

Implementasi Rangkaian CRC (Cyclic Redundancy Check) Generator pada FPGA (Field Programmable Gate Array)

IJEIS (Indonesian Journal of Electronics and Instrumentation Systems) ◽

10.22146/ijeis.43906 ◽

2019 ◽

Vol 9 (1) ◽

pp. 65

Author(s):

Nia Gella Augoestien ◽

Ryan Aditya

Keyword(s):

Data Transmission ◽

High Speed ◽

Clock Cycle ◽

Computational Time ◽

Storage Process ◽

Cyclic Redundancy Check ◽

Field Programmable ◽

High Speed Data ◽

And Storage ◽

Speed Data Transmission

Data integrity in high speed data transmission process is a major requerment that can not be ignored. High speed data transmission is prone to data errors. CRC (Cyclic Redundancy Check) is a mechanism that is often used as a detector errors in data transmission and storage process. When CRC is implemented using embedded software or processor, CRC requires many clock cycles. If CRC Generator implemented in special dedicated hardware, computational time reduced so that it can be met the high speed system communication requirement. This paper propose the design and implementation of CRC generator on FPGA that capable to minimaze computational time. The method is to reduce calculation latency by separating the coefficients of certain digits and calculating directly the result of polinomial key modulo. CRC Generator in this paper was implemented on Xilinx Spartan®-6 Series (XC6LX16-CS324). The modeling results have succeeded to finish computation on 1 clock cycle. Hardware eficiency is achieved 0.38 Gbps/Slice, while the throughput is 3,758 Gbps.

Download Full-text