Design of high performance double precision hybrid ALU for SoC applications

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.

Download Full-text

High Performance and Fault Tolerance Double Precision Floating Point Arithmetic Units

Journal of Artificial Intelligence ◽

10.3923/jai.2013.154.160 ◽

2013 ◽

Vol 6 (2) ◽

pp. 154-160

Author(s):

N. Vinothkuma ◽

M.S. Ravi ◽

Kittur Harish Maillikarj

Keyword(s):

Fault Tolerance ◽

High Performance ◽

Floating Point ◽

Double Precision ◽

Floating Point Arithmetic ◽

Arithmetic Units ◽

Point Arithmetic

Download Full-text

Double-Precision FPUs in High-Performance Computing: An Embarrassment of Riches?

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS) ◽

10.1109/ipdps.2019.00019 ◽

2019 ◽

Cited By ~ 2

Author(s):

Jens Domke ◽

Kazuaki Matsumura ◽

Mohamed Wahib ◽

Haoyu Zhang ◽

Keita Yashima ◽

...

Keyword(s):

High Performance Computing ◽

High Performance ◽

Double Precision ◽

Performance Computing

Download Full-text

High performance and energy efficient single‐precision and double‐precision merged floating‐point adder on FPGA

IET Computers & Digital Techniques ◽

10.1049/iet-cdt.2016.0200 ◽

2017 ◽

Vol 12 (1) ◽

pp. 20-29 ◽

Cited By ~ 4

Author(s):

Hao Zhang ◽

Dongdong Chen ◽

Seok‐Bum Ko

Keyword(s):

Energy Efficient ◽

High Performance ◽

Floating Point ◽

Double Precision ◽

Single Precision

Download Full-text

FPGA Based High Performance Double-Precision Matrix Multiplication

2009 22nd International Conference on VLSI Design ◽

10.1109/vlsi.design.2009.13 ◽

2009 ◽

Cited By ~ 20

Author(s):

Vinay B.Y. Kumar ◽

Siddharth Joshi ◽

Sachin B. Patkar ◽

H. Narayanan

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Double Precision ◽

Precision Matrix

Download Full-text

Implementation of Embedded Floating Point Arithmetic Units on FPGA

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.550.126 ◽

2014 ◽

Vol 550 ◽

pp. 126-136

Author(s):

N. Ramya Rani

Keyword(s):

High Speed ◽

High Performance ◽

Floating Point ◽

Double Precision ◽

Embedded Computing ◽

Floating Point Arithmetic ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Arithmetic Units ◽

Point Arithmetic

:Floating point arithmetic plays a major role in scientific and embedded computing applications. But the performance of field programmable gate arrays (FPGAs) used for floating point applications is poor due to the complexity of floating point arithmetic. The implementation of floating point units on FPGAs consumes a large amount of resources and that leads to the development of embedded floating point units in FPGAs. Embedded applications like multimedia, communication and DSP algorithms use floating point arithmetic in processing graphics, Fourier transformation, coding, etc. In this paper, methodologies are presented for the implementation of embedded floating point units on FPGA. The work is focused with the aim of achieving high speed of computations and to reduce the power for evaluating expressions. An application that demands high performance floating point computation can achieve better speed and density by incorporating embedded floating point units. Additionally this paper describes a comparative study of the design of single precision and double precision pipelined floating point arithmetic units for evaluating expressions. The modules are designed using VHDL simulation in Xilinx software and implemented on VIRTEX and SPARTAN FPGAs.

Download Full-text

Design of a Reconfigurable Coprocessor for Double Precision Floating Point Matrix Algorithms

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.58-60.1037 ◽

2011 ◽

Vol 58-60 ◽

pp. 1037-1042

Author(s):

Sheng Long Li ◽

Zhao Lin Li ◽

Qing Wei Zheng

Keyword(s):

High Performance ◽

Cmos Technology ◽

General Purpose ◽

Floating Point ◽

Double Precision ◽

Synthesis Time ◽

Matrix Algorithms ◽

Matrix Operations ◽

On Chip ◽

Software Execution

Double precision floating point matrix operations are wildly used in a variety of engineering and scientific computing applications. However, it’s inefficient to achieve these operations using software approaches on general purpose processors. In order to reduce the processing time and satisfy the real-time demand, a reconfigurable coprocessor for double precision floating point matrix algorithms is proposed in this paper. The coprocessor is embedded in a Multi-Processor System on Chip (MPSoC), cooperates with an ARM core and a DSP core for high-performance control and calculation. One algorithm in GPS applications is taken for example to illustrate the efficiency of the coprocessor proposed in this paper. The experiment result shows that the coprocessor can achieve speedup a factor of 50 for the quaternion algorithm of attitude solution in inertial navigation application compare with software execution time of a TI C6713 DSP. The coprocessor is implemented in SMIC 0.13μm CMOS technology, the synthesis time delay is 9.75ns, and the power consumption is 63.69 mW when it works at 100MHz.

Download Full-text

LAPACK95 - HIGH PERFORMANCE LINEAR ALGEBRA PACKAGE

Mathematical Modelling and Analysis ◽

10.3846/13926292.2000.9637127 ◽

2000 ◽

Vol 5 (1) ◽

pp. 44-54 ◽

Cited By ~ 2

Author(s):

J. Dongarra ◽

J. Waśniewski

Keyword(s):

Linear Algebra ◽

High Performance ◽

Complex Data ◽

Double Precision ◽

Data Types

LAPACK95 is a set of FORTRAN95 subroutines which interfaces FORTRAN95 with LAPACK. All LAPACK driver subroutines (including expert drivers) and some LAPACK computationals have both generic LAPACK95 interfaces and generic LAPACK77 interfaces. The remaining computationals have only generic LAPACK77 interfaces. In both types of interfaces no distinction is made between single and double precision or between real and complex data types.

Download Full-text

Stochastic rounding and reduced-precision fixed-point arithmetic for solving neural ordinary differential equations

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2019.0052 ◽

2020 ◽

Vol 378 (2166) ◽

pp. 20190052 ◽

Cited By ~ 4

Author(s):

Michael Hopkins ◽

Mantas Mikaitis ◽

Dave R. Lester ◽

Steve Furber

Keyword(s):

Fixed Point ◽

Differential Equations ◽

Ordinary Differential Equations ◽

High Performance ◽

Floating Point ◽

Double Precision ◽

Least Significant Bit ◽

Fixed Point Arithmetic ◽

Solution Algorithms ◽

Point Arithmetic

Although double-precision floating-point arithmetic currently dominates high-performance computing, there is increasing interest in smaller and simpler arithmetic types. The main reasons are potential improvements in energy efficiency and memory footprint and bandwidth. However, simply switching to lower-precision types typically results in increased numerical errors. We investigate approaches to improving the accuracy of reduced-precision fixed-point arithmetic types, using examples in an important domain for numerical computation in neuroscience: the solution of ordinary differential equations (ODEs). The Izhikevich neuron model is used to demonstrate that rounding has an important role in producing accurate spike timings from explicit ODE solution algorithms. In particular, fixed-point arithmetic with stochastic rounding consistently results in smaller errors compared to single-precision floating-point and fixed-point arithmetic with round-to-nearest across a range of neuron behaviours and ODE solvers. A computationally much cheaper alternative is also investigated, inspired by the concept of dither that is a widely understood mechanism for providing resolution below the least significant bit in digital signal processing. These results will have implications for the solution of ODEs in other subject areas, and should also be directly relevant to the huge range of practical problems that are represented by partial differential equations. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Download Full-text