Low-precision Floating-point Arithmetic for High-performance FPGA-based CNN Acceleration

Low-precision data representation is important to reduce storage size and memory access for convolutional neural networks (CNNs). Yet, existing methods have two major limitations: (1) requiring re-training to maintain accuracy for deep CNNs and (2) needing 16-bit floating-point or 8-bit fixed-point for a good accuracy. In this article, we propose a low-precision (8-bit) floating-point (LPFP) quantization method for FPGA-based acceleration to overcome the above limitations. Without any re-training, LPFP finds an optimal 8-bit data representation with negligible top-1/top-5 accuracy loss (within 0.5%/0.3% in our experiments, respectively, and significantly better than existing methods for deep CNNs). Furthermore, we implement one 8-bit LPFP multiplication by one 4-bit multiply-adder and one 3-bit adder, and therefore implement four 8-bit LPFP multiplications using one DSP48E1 of Xilinx Kintex-7 family or DSP48E2 of Xilinx Ultrascale/Ultrascale+ family, whereas one DSP can implement only two 8-bit fixed-point multiplications. Experiments on six typical CNNs for inference show that on average, we improve throughput by over existing FPGA accelerators. Particularly for VGG16 and YOLO, compared to six recent FPGA accelerators, we improve average throughput by 3.5 and 27.5 and average throughput per DSP by 4.1 and 5 , respectively.

Download Full-text

Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2019.0052 ◽

2020 ◽

Vol 378 (2166) ◽

pp. 20190052 ◽

Cited By ~ 4

Author(s):

Michael Hopkins ◽

Mantas Mikaitis ◽

Dave R. Lester ◽

Steve Furber

Keyword(s):

Fixed Point ◽

Differential Equations ◽

Ordinary Differential Equations ◽

High Performance ◽

Floating Point ◽

Double Precision ◽

Least Significant Bit ◽

Fixed Point Arithmetic ◽

Solution Algorithms ◽

Point Arithmetic

Although double-precision floating-point arithmetic currently dominates high-performance computing, there is increasing interest in smaller and simpler arithmetic types. The main reasons are potential improvements in energy efficiency and memory footprint and bandwidth. However, simply switching to lower-precision types typically results in increased numerical errors. We investigate approaches to improving the accuracy of reduced-precision fixed-point arithmetic types, using examples in an important domain for numerical computation in neuroscience: the solution of ordinary differential equations (ODEs). The Izhikevich neuron model is used to demonstrate that rounding has an important role in producing accurate spike timings from explicit ODE solution algorithms. In particular, fixed-point arithmetic with stochastic rounding consistently results in smaller errors compared to single-precision floating-point and fixed-point arithmetic with round-to-nearest across a range of neuron behaviours and ODE solvers. A computationally much cheaper alternative is also investigated, inspired by the concept of dither that is a widely understood mechanism for providing resolution below the least significant bit in digital signal processing. These results will have implications for the solution of ODEs in other subject areas, and should also be directly relevant to the huge range of practical problems that are represented by partial differential equations. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Download Full-text

Numerical algorithms for high-performance computational science

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2019.0066 ◽

2020 ◽

Vol 378 (2166) ◽

pp. 20190066 ◽

Cited By ~ 2

Author(s):

Jack Dongarra ◽

Laura Grigori ◽

Nicholas J. Higham

Keyword(s):

High Performance ◽

Numerical Algorithms ◽

Computational Science ◽

Floating Point ◽

Important Criterion ◽

Data Movement ◽

Floating Point Arithmetic ◽

High Performance Computers ◽

Point Arithmetic ◽

Speed And Accuracy

A number of features of today’s high-performance computers make it challenging to exploit these machines fully for computational science. These include increasing core counts but stagnant clock frequencies; the high cost of data movement; use of accelerators (GPUs, FPGAs, coprocessors), making architectures increasingly heterogeneous; and multi- ple precisions of floating-point arithmetic, including half-precision. Moreover, as well as maximizing speed and accuracy, minimizing energy consumption is an important criterion. New generations of algorithms are needed to tackle these challenges. We discuss some approaches that we can take to develop numerical algorithms for high-performance computational science, with a view to exploiting the next generation of supercomputers. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Download Full-text

High Performance and Fault Tolerance Double Precision Floating Point Arithmetic Units

Journal of Artificial Intelligence ◽

10.3923/jai.2013.154.160 ◽

2013 ◽

Vol 6 (2) ◽

pp. 154-160

Author(s):

N. Vinothkuma ◽

M.S. Ravi ◽

Kittur Harish Maillikarj

Keyword(s):

Fault Tolerance ◽

High Performance ◽

Floating Point ◽

Double Precision ◽

Floating Point Arithmetic ◽

Arithmetic Units ◽

Point Arithmetic

Download Full-text

A fixed-point implementation of tone mapping operation for HDR images expressed in floating-point format

APSIPA Transactions on Signal and Information Processing ◽

10.1017/atsip.2014.9 ◽

2014 ◽

Vol 3 ◽

Cited By ~ 10

Author(s):

Toshiyuki Dobashi ◽

Atsushi Tashiro ◽

Masahiro Iwahashi ◽

Hitoshi Kiya

Keyword(s):

Fixed Point ◽

Dynamic Range ◽

Floating Point ◽

Tone Mapping ◽

Fixed Point Arithmetic ◽

Memory Cost ◽

Point Data ◽

Point Arithmetic ◽

Intermediate Format ◽

Hdr Image

A tone mapping operation (TMO) for HDR images with fixed-point arithmetic is proposed. A TMO generates a low dynamic range (LDR) image from a high dynamic range (HDR) image by compressing its dynamic range. Since HDR images are generally expressed in a floating-point data format, a TMO also deals with floating-point data even though resulting LDR images have integer data. As a result, conventional TMOs require many resources such as computational and memory cost. To reduce the resources, an integer TMO which treats a floating-point number as two 8-bit integer numbers was proposed. However, this method has the limitation of available input HDR image formats. The proposed method introduces an intermediate format to relieve the limitation of input formats, and expands the integer TMO for the intermediate format. The proposed integer TMO can be applied for multiple formats such as the RGBE and the OpenEXR. Moreover, the method can conduct all calculations in the TMO with fixed-point arithmetic. Using both integer data and fixed-point arithmetic, the method reduces not only the memory cost, but also the computational cost. The experimental and evaluation results show that the proposed method reduces the computational and memory cost, and gives almost same quality of LDR images, compared with the conventional method with floating-point arithmetic.

Download Full-text

High-Level Synthesis under Fixed-Point Accuracy Constraint

Journal of Electrical and Computer Engineering ◽

10.1155/2012/906350 ◽

2012 ◽

Vol 2012 ◽

pp. 1-14 ◽

Cited By ~ 6

Author(s):

Daniel Menard ◽

Nicolas Herve ◽

Olivier Sentieys ◽

Hai-Nam Nguyen

Keyword(s):

Signal Processing ◽

Fixed Point ◽

Word Length ◽

High Level Synthesis ◽

Floating Point ◽

Fixed Point Arithmetic ◽

Resource Binding ◽

High Level ◽

Point Arithmetic ◽

Length Optimization

Implementing signal processing applications in embedded systems generally requires the use of fixed-point arithmetic. The main problem slowing down the hardware implementation flow is the lack of high-level development tools to target these architectures from algorithmic specification language using floating-point data types. In this paper, a new method to automatically implement a floating-point algorithm into an FPGA or an ASIC using fixed-point arithmetic is proposed. An iterative process on high-level synthesis and data word-length optimization is used to improve both of these dependent processes. Indeed, high-level synthesis requires operator word-length knowledge to correctly execute its allocation, scheduling, and resource binding steps. Moreover, the word-length optimization requires resource binding and scheduling information to correctly group operations. To dramatically reduce the optimization time compared to fixed-point simulation-based methods, the accuracy evaluation is done through an analytical method. Different experiments on signal processing algorithms are presented to show the efficiency of the proposed method. Compared to classical methods, the average architecture area reduction is between 10% and 28%.

Download Full-text

Implementation of Embedded Floating Point Arithmetic Units on FPGA

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.550.126 ◽

2014 ◽

Vol 550 ◽

pp. 126-136

Author(s):

N. Ramya Rani

Keyword(s):

High Speed ◽

High Performance ◽

Floating Point ◽

Double Precision ◽

Embedded Computing ◽

Floating Point Arithmetic ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Arithmetic Units ◽

Point Arithmetic

:Floating point arithmetic plays a major role in scientific and embedded computing applications. But the performance of field programmable gate arrays (FPGAs) used for floating point applications is poor due to the complexity of floating point arithmetic. The implementation of floating point units on FPGAs consumes a large amount of resources and that leads to the development of embedded floating point units in FPGAs. Embedded applications like multimedia, communication and DSP algorithms use floating point arithmetic in processing graphics, Fourier transformation, coding, etc. In this paper, methodologies are presented for the implementation of embedded floating point units on FPGA. The work is focused with the aim of achieving high speed of computations and to reduce the power for evaluating expressions. An application that demands high performance floating point computation can achieve better speed and density by incorporating embedded floating point units. Additionally this paper describes a comparative study of the design of single precision and double precision pipelined floating point arithmetic units for evaluating expressions. The modules are designed using VHDL simulation in Xilinx software and implemented on VIRTEX and SPARTAN FPGAs.

Download Full-text

Low Precision Floating Point Arithmetic for High Performance FPGA-based CNN Acceleration

The 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays ◽

10.1145/3373087.3375361 ◽

2020 ◽

Author(s):

Chen Wu ◽

Mingyu Wang ◽

Xinyuan Chu ◽

Kun Wang ◽

Lei He

Keyword(s):

High Performance ◽

Floating Point ◽

Floating Point Arithmetic ◽

Point Arithmetic

Download Full-text

A High-throughput Parallel Viterbi Algorithm via Bitslicing

ACM Transactions on Parallel Computing ◽

10.1145/3470642 ◽

2021 ◽

Vol 8 (4) ◽

pp. 1-25

Author(s):

Saleh Khalaj Monfared ◽

Omid Hajihassani ◽

Vahid Mohsseni ◽

Dara Rahmati ◽

Saeid Gorgin

Keyword(s):

High Throughput ◽

High Performance ◽

Viterbi Algorithm ◽

Data Representation ◽

Processing Unit ◽

Viterbi Decoder ◽

Soft Decision ◽

Content Type ◽

Data Intensive ◽

Representation Scheme

In this work, we present a novel bitsliced high-performance Viterbi algorithm suitable for high-throughput and data-intensive communication. A new column-major data representation scheme coupled with the bitsliced architecture is employed in our proposed Viterbi decoder that enables the maximum utilization of the parallel processing units in modern parallel accelerators. With the help of the proposed alteration of the data scheme, instead of the conventional bit-by-bit operations, 32-bit chunks of data are processed by each processing unit. This means that a single bitsliced parallel Viterbi decoder is capable of decoding 32 different chunks of data simultaneously. Here, the Viterbi’s Add-Compare-Select procedure is implemented with our proposed bitslicing technique, where it is shown that the bitsliced operations for the Viterbi internal functionalities are efficient in terms of their performance and complexity. We have achieved this level of high parallelism while keeping an acceptable bit error rate performance for our proposed methodology. Our suggested hard and soft-decision Viterbi decoder implementations on GPU platforms outperform the fastest previously proposed works by 4.3{\times } and 2.3{\times } , achieving 21.41 and 8.24 Gbps on Tesla V100, respectively.

Download Full-text