floating point Latest Research Papers

Low-precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for deep CNNs and (2) needing 16-bit floating-point or 8-bit fixed-point for a good accuracy. In this article, we propose a low-precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitations. Without any re-training, LPFP finds an optimal 8-bit data representation with negligible top-1/top-5 accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder and one 3-bit adder, and therefore implement four 8-bit LPFP multiplications using one DSP48E1 of Xilinx Kintex-7 family or DSP48E2 of Xilinx Ultrascale/Ultrascale+ family, whereas one DSP can implement only two 8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by over existing FPGA accelerators. Particularly for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput by 3.5 and 27.5 and average throughput per DSP by 4.1 and 5 , respectively.

Download Full-text

Design and Implementation of Floating-Point Addition and Floating-Point Multiplication

International Journal for Research in Applied Science and Engineering Technology ◽

10.22214/ijraset.2022.39742 ◽

2022 ◽

Vol 10 (1) ◽

pp. 98-101

Author(s):

Nagireddy Kavya

Keyword(s):

High Precision ◽

Conventional Method ◽

Data Representation ◽

Floating Point ◽

Delay Reduction ◽

Point Multiplication ◽

Design And Implementation ◽

Area Reduction ◽

Look Ahead ◽

Simulation Results

Abstract: In this paper, we present the design and implementation of Floating point addition and Floating point Multiplication. There are many multipliers in existence in which Floating point Multiplication and Floating point addition offers a high precision and more accuracy for the data representation of the image. This project is designed and simulated on Xilinx ISE 14.7 version software using verilog. Simulation results show area reduction and delay reduction as compared to the conventional method. Keywords: FIR Filter, Floating point Addition, Floating point Multiplication, Carry Look Ahead Adder

Download Full-text

Logic Design and Power Optimization of Floating-Point Multipliers

Computational Intelligence and Neuroscience ◽

10.1155/2022/6949846 ◽

2022 ◽

Vol 2022 ◽

pp. 1-10

Author(s):

Na Bai ◽

Hang Li ◽

Jiming Lv ◽

Shuai Yang ◽

Yaohua Xu

Keyword(s):

Power Consumption ◽

Power Optimization ◽

Current Situation ◽

Floating Point ◽

Cell Process ◽

Logic Design ◽

Flowing Water ◽

Partial Product ◽

Standard Cell ◽

Single Precision

Under IEEE-754 standard, for the current situation of excessive time and power consumption of multiplication operations in single-precision floating-point operations, the expanded boothwallace algorithm is used, and the partial product caused by booth coding is rounded and predicted with the symbolic expansion idea, and the partial product caused by single-precision floating-point multiplication and the accumulation of partial products are optimized, and the flowing water is used to improve the throughput. Based on this, a series of verification and synthesis simulations are performed using the SMIC-7 nm standard cell process. It is verified that the new single-precision floating-point multiplier can achieve a smaller power share compared to the conventional single-precision floating-point multiplier.

Download Full-text

POSIT vs. Floating Point in Implementing IIR Notch Filter by Enhancing Radix-4 Modified Booth Multiplier

Electronics ◽

10.3390/electronics11010163 ◽

2022 ◽

Vol 11 (1) ◽

pp. 163

Author(s):

Anwar A. Esmaeel ◽

Sa’ed Abed ◽

Bassam J. Mohd ◽

Abbas A. Fairouz

Keyword(s):

Performance Metrics ◽

Notch Filter ◽

Floating Point ◽

Point System ◽

Booth Multiplier ◽

Accuracy And Precision ◽

Power And Energy ◽

Division Operation ◽

Series Technique ◽

Better Than

The increased demand for better accuracy and precision and wider data size has strained current the floating point system and motivated the development of the POSIT system. The POSIT system supports flexible formats and tapered precision and provides equivalent accuracy with fewer bits. This paper examines the POSIT and floating point systems, comparing the performance of 32-bit POSIT and 32-bit floating point systems using IIR notch filter implementation. Given that the bulk of the calculations in the filter are multiplication operations, an Enhanced Radix-4 Modified Booth Multiplier (ERMBM) is implemented to increase the calculation speed and efficiency. ERMBM enhances area, speed, power, and energy compared to the POSIT regular multiplier by 26.80%, 51.97%, 0.54%, and 52.22%, respectively, without affecting the accuracy. Moreover, the Taylor series technique is adopted to implement the division operation along with cosine arithmetic unit for POSIT numbers. After comparing POSIT with floating point, the accuracy of POSIT is 92.31%, which is better than floating point’s accuracy of 23.08%. Moreover, POSIT reduces area by 21.77% while increasing the delay. However, when the ERMBM is utilized instead of the POSIT regular multiplier in implementing the filter, POSIT outperforms floating point in all the performance metrics including area, speed, power, and energy by 35.68%, 20.66%, 31.49%, and 45.64%, respectively.

Download Full-text

Reducing rational polynomial: a proposition of a strategy to deal with floating point numbers using singular value decomposition

Soft Computing ◽

10.1007/s00500-021-06451-4 ◽

2022 ◽

Author(s):

Ahmad Deeb ◽

Rafik Belarbi

Keyword(s):

Singular Value Decomposition ◽

Singular Value ◽

Floating Point ◽

Rational Polynomial ◽

Value Decomposition ◽

Floating Point Numbers

Download Full-text

Problems of the Commutative and Grouping Properties of the Addition of Floating Point Numbers in Modern Programming Languages

Springer Proceedings in Earth and Environmental Sciences - Proceedings of the 3rd International Conference on BioGeoSciences ◽

10.1007/978-3-030-88919-7_18 ◽

2022 ◽

pp. 237-246

Author(s):

Vladimir Mochalov ◽

Anastasia Mochalova

Keyword(s):

Programming Languages ◽

Floating Point ◽

Floating Point Numbers

Download Full-text

Clock Skew Compensation Algorithm Immune to Floating-Point Precision Loss

IEEE Communications Letters ◽

10.1109/lcomm.2022.3142904 ◽

2022 ◽

pp. 1-1

Author(s):

Kyeong Soo Kim ◽

Seungyeop Kang

Keyword(s):

Floating Point ◽

Clock Skew ◽

Compensation Algorithm

Download Full-text

IEEE754 Binary32 Floating-Point Logarithmic Algorithms based on Taylor-Series Expansion with Mantissa Region Conversion and Division

IEICE Transactions on Fundamentals of Electronics Communications and Computer Sciences ◽

10.1587/transfun.2021eap1076 ◽

2022 ◽

Author(s):

Jianglin WEI ◽

Anna KUWANA ◽

Haruo KOBAYASHI ◽

Kazuyoshi KUBO

Keyword(s):

Series Expansion ◽

Taylor Series ◽

Taylor Series Expansion ◽

Floating Point

Download Full-text

Online Signature Verification Systems on a Low-Cost FPGA

Applied Sciences ◽

10.3390/app12010378 ◽

2021 ◽

Vol 12 (1) ◽

pp. 378

Author(s):

Enrique Cantó Navarro ◽

Rafael Ramos Lara ◽

Mariano López García

Keyword(s):

Embedded System ◽

Execution Time ◽

Low Cost ◽

Computing System ◽

Floating Point ◽

Signature Verification ◽

Double Precision ◽

Hardware Accelerator ◽

Online Signature ◽

Online Signature Verification

This paper describes three different approaches for the implementation of an online signature verification system on a low-cost FPGA. The system is based on an algorithm, which operates on real numbers using the double-precision floating-point IEEE 754 format. The double-precision computations are replaced by simpler formats, without affecting the biometrics performance, in order to permit efficient implementations on low-cost FPGA families. The first approach is an embedded system based on MicroBlaze, a 32-bit soft-core microprocessor designed for Xilinx FPGAs, which can be configured by including a single-precision floating-point unit (FPU). The second implementation attaches a hardware accelerator to the embedded system to reduce the execution time on floating-point vectors. The last approach is a custom computing system, which is built from a large set of arithmetic circuits that replace the floating-point data with a more efficient representation based on fixed-point format. The latter system provides a very high runtime acceleration factor at the expense of using a large number of FPGA resources, a complex development cycle and no flexibility since it cannot be adapted to other biometric algorithms. By contrast, the first system provides just the opposite features, while the second approach is a mixed solution between both of them. The experimental results show that both the hardware accelerator and the custom computing system reduce the execution time by a factor ×7.6 and ×201 but increase the logic FPGA resources by a factor ×2.3 and ×5.2, respectively, in comparison with the MicroBlaze embedded system.

Download Full-text

A Fixed Algorithm of Ambiguity among the Network RTK Reference Stations

Sensors ◽

10.3390/s22010165 ◽

2021 ◽

Vol 22 (1) ◽

pp. 165

Author(s):

Shouhua Wang ◽

Zhiqi You ◽

Xiyan Sun

Keyword(s):

Satellite Observation ◽

Reference Station ◽

Floating Point ◽

Extended Kalman Filtering ◽

Fixation Rate ◽

Multiple Systems ◽

Network Rtk ◽

R Ratio ◽

Point Solution ◽

Ambiguity Decorrelation

In the face of a complex observation environment, the solution of the reference station of the ambiguity of network real-time kinematic (RTK) will be affected. The joint solution of multiple systems makes the ambiguity dimension increase steeply, which makes it difficult to estimate all the ambiguity. In addition, when receiving satellite observation signals in the environment with many occlusions, the received satellite observation values are prone to gross errors, resulting in obvious deviations in the solution. In this paper, a new network RTK fixation algorithm for partial ambiguity among the reference stations is proposed. It first estimates the floating-point ambiguity using the robust extended Kalman filtering (EKF) technique based on mean estimation, then finds the optimal ambiguity subset by the optimized partial ambiguity solving method. Finally, fixing the floating-point solution by the least-squares ambiguity decorrelation adjustment (LAMBDA) algorithm and the joint test of ratio (R-ratio) and bootstrapping success rate index solver. The experimental results indicate that the new method can significantly improve the fixation rate of ambiguity among network RTK reference stations and thus effectively improve the reliability of positioning results.

Download Full-text

floating point
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Low-precision Floating-point Arithmetic for High-performance FPGA-based CNN Acceleration

Design and Implementation of Floating-Point Addition and Floating-Point Multiplication

Logic Design and Power Optimization of Floating-Point Multipliers

POSIT vs. Floating Point in Implementing IIR Notch Filter by Enhancing Radix-4 Modified Booth Multiplier

Reducing rational polynomial: a proposition of a strategy to deal with floating point numbers using singular value decomposition

Problems of the Commutative and Grouping Properties of the Addition of Floating Point Numbers in Modern Programming Languages

Clock Skew Compensation Algorithm Immune to Floating-Point Precision Loss

IEEE754 Binary32 Floating-Point Logarithmic Algorithms based on Taylor-Series Expansion with Mantissa Region Conversion and Division

Online Signature Verification Systems on a Low-Cost FPGA

A Fixed Algorithm of Ambiguity among the Network RTK Reference Stations

Export Citation Format

floating pointRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Low-precision Floating-point Arithmetic for High-performance FPGA-based CNN Acceleration

Design and Implementation of Floating-Point Addition and Floating-Point Multiplication

Logic Design and Power Optimization of Floating-Point Multipliers

POSIT vs. Floating Point in Implementing IIR Notch Filter by Enhancing Radix-4 Modified Booth Multiplier

Reducing rational polynomial: a proposition of a strategy to deal with floating point numbers using singular value decomposition

Problems of the Commutative and Grouping Properties of the Addition of Floating Point Numbers in Modern Programming Languages

Clock Skew Compensation Algorithm Immune to Floating-Point Precision Loss

IEEE754 Binary32 Floating-Point Logarithmic Algorithms based on Taylor-Series Expansion with Mantissa Region Conversion and Division

Online Signature Verification Systems on a Low-Cost FPGA

A Fixed Algorithm of Ambiguity among the Network RTK Reference Stations

floating point
Recently Published Documents