Multiple-Precision BLAS Library for Graphics Processing Units

10.36227/techrxiv.12580301 ◽

2020 ◽

Author(s):

Konstantin Isupov ◽

Vladimir Knyazkov

Keyword(s):

Graphics Processing Units ◽

Arithmetic Operation ◽

Number System ◽

Residue Number System ◽

Floating Point ◽

Data Types ◽

Rounding Errors ◽

Multiple Precision ◽

Graphics Processing ◽

Point Arithmetic

The binary32 and binary64 floating-point formats provide good performance on current hardware, but also introduce a rounding error in almost every arithmetic operation. Consequently, the accumulation of rounding errors in large computations can cause accuracy issues. One way to prevent these issues is to use multiple-precision floating-point arithmetic. This preprint, submitted to Russian Supercomputing Days 2020, presents a new library of basic linear algebra operations with multiple precision for graphics processing units. The library is written in CUDA C/C++ and uses the residue number system to represent multiple-precision significands of floating-point numbers. The supported data types, memory layout, and main features of the library are considered. Experimental results are presented showing the performance of the library.

Download Full-text

High-Performance Computation in Residue Number System Using Floating-Point Arithmetic

Computation ◽

10.3390/computation9020009 ◽

2021 ◽

Vol 9 (2) ◽

pp. 9

Author(s):

Konstantin Isupov

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Dynamic Range ◽

Practical Interest ◽

Number System ◽

Residue Number System ◽

Floating Point ◽

Mixed Radix Conversion ◽

Multiple Precision ◽

Residue Number

Residue number system (RNS) is known for its parallel arithmetic and has been used in recent decades in various important applications, from digital signal processing and deep neural networks to cryptography and high-precision computation. However, comparison, sign identification, overflow detection, and division are still hard to implement in RNS. For such operations, most of the methods proposed in the literature only support small dynamic ranges (up to several tens of bits), so they are only suitable for low-precision applications. We recently proposed a method that supports arbitrary moduli sets with cryptographically sized dynamic ranges, up to several thousands of bits. The practical interest of our method compared to existing methods is that it relies only on very fast standard floating-point operations, so it is suitable for multiple-precision applications and can be efficiently implemented on many general-purpose platforms that support IEEE 754 arithmetic. In this paper, we make further improvements to this method and demonstrate that it can successfully be applied to implement efficient data-parallel primitives operating in the RNS domain, namely finding the maximum element of an array of RNS numbers on graphics processing units. Our experimental results on an NVIDIA RTX 2080 GPU show that for random residues and a 128-moduli set with 2048-bit dynamic range, the proposed implementation reduces the running time by a factor of 39 and the memory consumption by a factor of 13 compared to an implementation based on mixed-radix conversion.

Download Full-text

Multiple-precision matrix-vector multiplication on graphics processing units

Program systems theory and applications ◽

10.25209/2079-3316-2020-11-3-61-84 ◽

2020 ◽

Vol 11 (3) ◽

pp. 61-84

Author(s):

Konstantin Isupov ◽

Vladimir Knyazkov

Keyword(s):

Graphics Processing Units ◽

High Efficiency ◽

Parallel Implementation ◽

Number System ◽

Residue Number System ◽

Global Memory ◽

Matrix Vector Multiplication ◽

Multiple Precision ◽

Graphics Processing ◽

Matrix Vector

We are considering a parallel implementation of matrix-vector multiplication (GEMV, Level 2 of the BLAS) for graphics processing units (GPUs) using multiple-precision arithmetic based on the residue number system. In our GEMV implementation, element-wise operations with multiple-precision vectors and matrices consist of several parts, each of which is calculated by a separate CUDA kernel. This feature eliminates branch divergence when performing sequential parts of multiple-precision operations and allows the full utilization of the GPU’s resources. An efficient data structure for storing arrays with multiple-precision entries provides a coalesced access pattern to the GPU global memory. We have performed a rounding error analysis and derived error bounds for the proposed GEMV implementation. Experimental results show the high efficiency of the proposed solution compared to existing high-precision packages deployed on GPU.

Download Full-text

Floating-Point Arithmetic Algorithms in the Symmetric Residue Number System

IEEE Transactions on Computers ◽

10.1109/t-c.1974.223772 ◽

1974 ◽

Vol C-23 (1) ◽

pp. 9-20 ◽

Cited By ~ 13

Author(s):

E. Kinoshita ◽

H. Kosako ◽

Y. Kojima

Keyword(s):

Number System ◽

Residue Number System ◽

Floating Point ◽

Floating Point Arithmetic ◽

Residue Number ◽

Point Arithmetic

Download Full-text

The Multiplication Method with Scaling the Result for High-Precision Residue Positional Interval Logarithmic Computations

Engineering Technologies and Systems ◽

10.15507/2658-4123.029.201902.187-204 ◽

2019 ◽

pp. 302-320

Author(s):

Anastasia S. Korzhavina ◽

Vladimir S. Knyazkov

Keyword(s):

High Precision ◽

Number System ◽

Residue Number System ◽

Floating Point ◽

Number Representation ◽

Rounding Errors ◽

Logarithmic Number System ◽

Residue Number ◽

Fast Multiplication ◽

Logarithmic Number

Introduction. The solution of the simulation problems critical to rounding errors, including the problems of computational mathematics, mathematical physics, optimal control, biochemistry, quantum mechanics, mathematical programming and cryptography, requires the accuracy from 100 to 1 000 decimal digits and more. The main lack of high-precision software libraries is a significant decrease of the speed-in-action, unacceptable for practical problems, in particular, when performing multiplication. A way to increase computation performance over very long numbers is using the residue number system. In this work, we discuss a new fast multiplication method with scaling the result using original hybrid residue positional interval logarithmic floating-point number representation. Materials and Methods. The new way of the organizing numerical information is a residue positional interval logarithmic number representation in which the mantissa is presented in the residue number system, and information on an absolute value (the characteristic) in the interval logarithmic number system that makes it possible to accelerate performance of comparison and scaling is developed to increase the speed of calculations; to compare modular numbers, the provisions of interval analysis are used; to scale modular numbers, the properties of the logarithmic number system and approximate interval calculations using the Chinese reminder theorem are used. Results. A new fast multiplication method of floating-point residue-represented numbers is developed and patented; the authors evaluated the developed method speed-in action, compared the developed method with classical and pipelined multiplication methods of long numbers. Discussion and Conclusion. The developed method is 2.4–4.0 times faster than the pipelined multiplication method, and is 6.4–12.9 times faster than classical multiplication methods.

Download Full-text

Implementation of Multiple-Precision Floating-Point Arithmetic Library for GPU Computing

Parallel and Distributed Computing and Systems ◽

10.2316/p.2011.757-041 ◽

2011 ◽

Cited By ~ 10

Author(s):

Takatoshi Nakayama ◽

Daisuke Takahashi

Keyword(s):

Gpu Computing ◽

Floating Point ◽

Floating Point Arithmetic ◽

Multiple Precision ◽

Point Arithmetic

Download Full-text

Decimal Hardware Multiplier

Encyclopedia of Information Science and Technology, Fourth Edition ◽

10.4018/978-1-5225-2255-3.ch400 ◽

2018 ◽

pp. 4607-4618

Author(s):

Mário Pereira Vestias

Keyword(s):

Number System ◽

Floating Point ◽

Binary Number ◽

Correct Result ◽

Fundamental Operation ◽

Floating Point Arithmetic ◽

Decimal Arithmetic ◽

Decimal Numbers ◽

Decimal Multiplication ◽

Point Arithmetic

IEEE-754 2008 has extended the standard with decimal floating point arithmetic. Human-centric applications, like financial and commercial, depend on decimal arithmetic since the results must match exactly those obtained by human calculations without being subject to errors caused by decimal to binary conversions. Decimal Multiplication is a fundamental operation utilized in many algorithms and it is referred in the standard IEEE-754 2008. Decimal multiplication has an inherent difficulty associated with the representation of decimal numbers using a binary number system. Both bit and digit carries, as well as invalid results, must be considered in decimal multiplication in order to produce the correct result. This article focuses on algorithms for hardware implementation of decimal multiplication. Both decimal fixed-point and floating-point multiplication are described, including iterative and parallel solutions.

Download Full-text

Decimal Hardware Multiplier

Advanced Methodologies and Technologies in Artificial Intelligence, Computer Simulation, and Human-Computer Interaction - Advances in Computer and Electrical Engineering ◽

10.4018/978-1-5225-7368-5.ch054 ◽

2019 ◽

pp. 722-736

Author(s):

Mário Pereira Vestias

Keyword(s):

Number System ◽

Floating Point ◽

Binary Number ◽

Correct Result ◽

Fundamental Operation ◽

Floating Point Arithmetic ◽

Decimal Arithmetic ◽

Decimal Numbers ◽

Decimal Multiplication ◽

Point Arithmetic

IEEE-754 2008 has extended the standard with decimal floating-point arithmetic. Human-centric applications, like financial and commercial, depend on decimal arithmetic since the results must match exactly those obtained by human calculations without being subject to errors caused by decimal to binary conversions. Decimal multiplication is a fundamental operation utilized in many algorithms, and it is referred in the standard IEEE-754 2008. Decimal multiplication has an inherent difficulty associated with the representation of decimal numbers using a binary number system. Both bit and digit carries, as well as invalid results, must be considered in decimal multiplication in order to produce the correct result. This chapter focuses on algorithms for hardware implementation of decimal multiplication. Both decimal fixed-point and floating-point multiplication are described, including iterative and parallel solutions.

Download Full-text

Implementation of Low Power Pipelined 64-bit RISC Processor with Unbiased FPU on CPLD

International Journal of Reconfigurable and Embedded Systems (IJRES) ◽

10.11591/ijres.v5.i2.pp118-123 ◽

2016 ◽

Vol 5 (2) ◽

pp. 118

Author(s):

J. Vijay Kumar ◽

B. Naga Raju ◽

M. Vasu Babu ◽

T. Ramanjappa

Keyword(s):

Low Power ◽

Arithmetic Operation ◽

Floating Point ◽

Double Precision ◽

Verilog Hdl ◽

Logical Function ◽

Floating Point Arithmetic ◽

Risc Processor ◽

Operation Results ◽

Point Arithmetic

This article represents the implementation of low power pipelined 64-bit RISC processor on Altera MAXV CPLD device. The design is verified for arithmetic operations of both fixed and floating point numbers, branch and logical function of RISC processor. For all the jump instruction, the processor architecture will automatically flush the data in the pipeline, so as to avoid any misbehavior. This processor contains FPU unit, which supports double precision IEEE-754 format operations very accurately. The simulation results have been verified by using ModelSim software. The ALU operations and double precision floating point arithmetic operation results are displayed on 7-Segments. The necessary code is written in Verilog HDL.

Download Full-text