Low-precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for
CNNs and (2) needing 16-bit floating-point or 8-bit fixed-point for a good accuracy. In this article, we propose a low-precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitations. Without any re-training, LPFP finds an optimal 8-bit data representation with negligible top-1/top-5 accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly better than existing methods for
CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder and one 3-bit adder, and therefore implement
8-bit LPFP multiplications using one DSP48E1 of Xilinx Kintex-7 family or DSP48E2 of Xilinx Ultrascale/Ultrascale+ family, whereas one DSP can implement only
8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by
over existing FPGA accelerators. Particularly for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput by 3.5
and average throughput per DSP by 4.1
Abstract: In this paper, we present the design and implementation of Floating point addition and Floating point Multiplication. There are many multipliers in existence in which Floating point Multiplication and Floating point addition offers a high precision and more accuracy for the data representation of the image. This project is designed and simulated on Xilinx ISE 14.7 version software using verilog. Simulation results show area reduction and delay reduction as compared to the conventional method. Keywords: FIR Filter, Floating point Addition, Floating point Multiplication, Carry Look Ahead Adder
Under IEEE-754 standard, for the current situation of excessive time and power consumption of multiplication operations in single-precision floating-point operations, the expanded boothwallace algorithm is used, and the partial product caused by booth coding is rounded and predicted with the symbolic expansion idea, and the partial product caused by single-precision floating-point multiplication and the accumulation of partial products are optimized, and the flowing water is used to improve the throughput. Based on this, a series of verification and synthesis simulations are performed using the SMIC-7 nm standard cell process. It is verified that the new single-precision floating-point multiplier can achieve a smaller power share compared to the conventional single-precision floating-point multiplier.
The increased demand for better accuracy and precision and wider data size has strained current the floating point system and motivated the development of the POSIT system. The POSIT system supports flexible formats and tapered precision and provides equivalent accuracy with fewer bits. This paper examines the POSIT and floating point systems, comparing the performance of 32-bit POSIT and 32-bit floating point systems using IIR notch filter implementation. Given that the bulk of the calculations in the filter are multiplication operations, an Enhanced Radix-4 Modified Booth Multiplier (ERMBM) is implemented to increase the calculation speed and efficiency. ERMBM enhances area, speed, power, and energy compared to the POSIT regular multiplier by 26.80%, 51.97%, 0.54%, and 52.22%, respectively, without affecting the accuracy. Moreover, the Taylor series technique is adopted to implement the division operation along with cosine arithmetic unit for POSIT numbers. After comparing POSIT with floating point, the accuracy of POSIT is 92.31%, which is better than floating point’s accuracy of 23.08%. Moreover, POSIT reduces area by 21.77% while increasing the delay. However, when the ERMBM is utilized instead of the POSIT regular multiplier in implementing the filter, POSIT outperforms floating point in all the performance metrics including area, speed, power, and energy by 35.68%, 20.66%, 31.49%, and 45.64%, respectively.
This paper describes three different approaches for the implementation of an online signature verification system on a low-cost FPGA. The system is based on an algorithm, which operates on real numbers using the double-precision floating-point IEEE 754 format. The double-precision computations are replaced by simpler formats, without affecting the biometrics performance, in order to permit efficient implementations on low-cost FPGA families. The first approach is an embedded system based on MicroBlaze, a 32-bit soft-core microprocessor designed for Xilinx FPGAs, which can be configured by including a single-precision floating-point unit (FPU). The second implementation attaches a hardware accelerator to the embedded system to reduce the execution time on floating-point vectors. The last approach is a custom computing system, which is built from a large set of arithmetic circuits that replace the floating-point data with a more efficient representation based on fixed-point format. The latter system provides a very high runtime acceleration factor at the expense of using a large number of FPGA resources, a complex development cycle and no flexibility since it cannot be adapted to other biometric algorithms. By contrast, the first system provides just the opposite features, while the second approach is a mixed solution between both of them. The experimental results show that both the hardware accelerator and the custom computing system reduce the execution time by a factor ×7.6 and ×201 but increase the logic FPGA resources by a factor ×2.3 and ×5.2, respectively, in comparison with the MicroBlaze embedded system.
In the face of a complex observation environment, the solution of the reference station of the ambiguity of network real-time kinematic (RTK) will be affected. The joint solution of multiple systems makes the ambiguity dimension increase steeply, which makes it difficult to estimate all the ambiguity. In addition, when receiving satellite observation signals in the environment with many occlusions, the received satellite observation values are prone to gross errors, resulting in obvious deviations in the solution. In this paper, a new network RTK fixation algorithm for partial ambiguity among the reference stations is proposed. It first estimates the floating-point ambiguity using the robust extended Kalman filtering (EKF) technique based on mean estimation, then finds the optimal ambiguity subset by the optimized partial ambiguity solving method. Finally, fixing the floating-point solution by the least-squares ambiguity decorrelation adjustment (LAMBDA) algorithm and the joint test of ratio (R-ratio) and bootstrapping success rate index solver. The experimental results indicate that the new method can significantly improve the fixation rate of ambiguity among network RTK reference stations and thus effectively improve the reliability of positioning results.