A General-Purpose Method for Faithfully Rounded Floating-Point Function Approximation in FPGAs

Author(s):  
David B. Thomas
AIAA Journal ◽  
10.2514/2.337 ◽  
1998 ◽  
Vol 36 (12) ◽  
pp. 2269-2275 ◽  
Author(s):  
Suqiang Xu ◽  
Ramana V. Grandhi

2011 ◽  
Vol 58-60 ◽  
pp. 1037-1042
Author(s):  
Sheng Long Li ◽  
Zhao Lin Li ◽  
Qing Wei Zheng

Double precision floating point matrix operations are wildly used in a variety of engineering and scientific computing applications. However, it’s inefficient to achieve these operations using software approaches on general purpose processors. In order to reduce the processing time and satisfy the real-time demand, a reconfigurable coprocessor for double precision floating point matrix algorithms is proposed in this paper. The coprocessor is embedded in a Multi-Processor System on Chip (MPSoC), cooperates with an ARM core and a DSP core for high-performance control and calculation. One algorithm in GPS applications is taken for example to illustrate the efficiency of the coprocessor proposed in this paper. The experiment result shows that the coprocessor can achieve speedup a factor of 50 for the quaternion algorithm of attitude solution in inertial navigation application compare with software execution time of a TI C6713 DSP. The coprocessor is implemented in SMIC 0.13μm CMOS technology, the synthesis time delay is 9.75ns, and the power consumption is 63.69 mW when it works at 100MHz.


2017 ◽  
Vol 27 (03n04) ◽  
pp. 1750006 ◽  
Author(s):  
Farhad Merchant ◽  
Anupam Chattopadhyay ◽  
Soumyendu Raha ◽  
S. K. Nandy ◽  
Ranjani Narayan

Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the HPC applications. Performance in such tuned packages is attained through tuning of several algorithmic and architectural parameters such as number of parallel operations in the Directed Acyclic Graph of the BLAS/LAPACK routines, sizes of the memories in the memory hierarchy of the underlying platform, bandwidth of the memory, and structure of the compute resources in the underlying platform. In this paper, we closely investigate the impact of the Floating Point Unit (FPU) micro-architecture for performance tuning of BLAS and LAPACK. We present theoretical analysis for pipeline depth of different floating point operations like multiplier, adder, square root, and divider followed by characterization of BLAS and LAPACK to determine several parameters required in the theoretical framework for deciding optimum pipeline depth of the floating operations. A simple design of a Processing Element (PE) is presented and shown that the PE outperforms the most recent custom realizations of BLAS and LAPACK by 1.1X to 1.5X in GFlops/W, and 1.9X to 2.1X in Gflops/mm2. Compared to multicore, General Purpose Graphics Processing Unit (GPGPU), Field Programmable Gate Array (FPGA), and ClearSpeed CSX700, performance improvement of 1.8-80x is reported in PE.


AIAA Journal ◽  
1998 ◽  
Vol 36 ◽  
pp. 2269-2275 ◽  
Author(s):  
Suqiang Xu ◽  
Ramana V. Grandhi

Sign in / Sign up

Export Citation Format

Share Document