scholarly journals Multiple-Precision BLAS Library for Graphics Processing Units

Author(s):  
Konstantin Isupov ◽  
Vladimir Knyazkov

The binary32 and binary64 floating-point formats provide good performance on current hardware, but also introduce a rounding error in almost every arithmetic operation. Consequently, the accumulation of rounding errors in large computations can cause accuracy issues. One way to prevent these issues is to use multiple-precision floating-point arithmetic. This preprint, submitted to Russian Supercomputing Days 2020, presents a new library of basic linear algebra operations with multiple precision for graphics processing units. The library is written in CUDA C/C++ and uses the residue number system to represent multiple-precision significands of floating-point numbers. The supported data types, memory layout, and main features of the library are considered. Experimental results are presented showing the performance of the library.

2020 ◽  
Author(s):  
Konstantin Isupov ◽  
Vladimir Knyazkov

The binary32 and binary64 floating-point formats provide good performance on current hardware, but also introduce a rounding error in almost every arithmetic operation. Consequently, the accumulation of rounding errors in large computations can cause accuracy issues. One way to prevent these issues is to use multiple-precision floating-point arithmetic. This preprint, submitted to Russian Supercomputing Days 2020, presents a new library of basic linear algebra operations with multiple precision for graphics processing units. The library is written in CUDA C/C++ and uses the residue number system to represent multiple-precision significands of floating-point numbers. The supported data types, memory layout, and main features of the library are considered. Experimental results are presented showing the performance of the library.


Computation ◽  
2021 ◽  
Vol 9 (2) ◽  
pp. 9
Author(s):  
Konstantin Isupov

Residue number system (RNS) is known for its parallel arithmetic and has been used in recent decades in various important applications, from digital signal processing and deep neural networks to cryptography and high-precision computation. However, comparison, sign identification, overflow detection, and division are still hard to implement in RNS. For such operations, most of the methods proposed in the literature only support small dynamic ranges (up to several tens of bits), so they are only suitable for low-precision applications. We recently proposed a method that supports arbitrary moduli sets with cryptographically sized dynamic ranges, up to several thousands of bits. The practical interest of our method compared to existing methods is that it relies only on very fast standard floating-point operations, so it is suitable for multiple-precision applications and can be efficiently implemented on many general-purpose platforms that support IEEE 754 arithmetic. In this paper, we make further improvements to this method and demonstrate that it can successfully be applied to implement efficient data-parallel primitives operating in the RNS domain, namely finding the maximum element of an array of RNS numbers on graphics processing units. Our experimental results on an NVIDIA RTX 2080 GPU show that for random residues and a 128-moduli set with 2048-bit dynamic range, the proposed implementation reduces the running time by a factor of 39 and the memory consumption by a factor of 13 compared to an implementation based on mixed-radix conversion.


2020 ◽  
Vol 11 (3) ◽  
pp. 61-84
Author(s):  
Konstantin Isupov ◽  
Vladimir Knyazkov

We are considering a parallel implementation of matrix-vector multiplication (GEMV, Level 2 of the BLAS) for graphics processing units (GPUs) using multiple-precision arithmetic based on the residue number system. In our GEMV implementation, element-wise operations with multiple-precision vectors and matrices consist of several parts, each of which is calculated by a separate CUDA kernel. This feature eliminates branch divergence when performing sequential parts of multiple-precision operations and allows the full utilization of the GPU’s resources. An efficient data structure for storing arrays with multiple-precision entries provides a coalesced access pattern to the GPU global memory. We have performed a rounding error analysis and derived error bounds for the proposed GEMV implementation. Experimental results show the high efficiency of the proposed solution compared to existing high-precision packages deployed on GPU.


Author(s):  
Anastasia S. Korzhavina ◽  
Vladimir S. Knyazkov

Introduction. The solution of the simulation problems critical to rounding errors, including the problems of computational mathematics, mathematical physics, optimal control, biochemistry, quantum mechanics, mathematical programming and cryptography, requires the accuracy from 100 to 1 000 decimal digits and more. The main lack of high-precision software libraries is a significant decrease of the speed-in-action, unacceptable for practical problems, in particular, when performing multiplication. A way to increase computation performance over very long numbers is using the residue number system. In this work, we discuss a new fast multiplication method with scaling the result using original hybrid residue positional interval logarithmic floating-point number representation. Materials and Methods. The new way of the organizing numerical information is a residue positional interval logarithmic number representation in which the mantissa is presented in the residue number system, and information on an absolute value (the characteristic) in the interval logarithmic number system that makes it possible to accelerate performance of comparison and scaling is developed to increase the speed of calculations; to compare modular numbers, the provisions of interval analysis are used; to scale modular numbers, the properties of the logarithmic number system and approximate interval calculations using the Chinese reminder theorem are used. Results. A new fast multiplication method of floating-point residue-represented numbers is developed and patented; the authors evaluated the developed method speed-in action, compared the developed method with classical and pipelined multiplication methods of long numbers. Discussion and Conclusion. The developed method is 2.4–4.0 times faster than the pipelined multiplication method, and is 6.4–12.9 times faster than classical multiplication methods.


Author(s):  
Mário Pereira Vestias

IEEE-754 2008 has extended the standard with decimal floating point arithmetic. Human-centric applications, like financial and commercial, depend on decimal arithmetic since the results must match exactly those obtained by human calculations without being subject to errors caused by decimal to binary conversions. Decimal Multiplication is a fundamental operation utilized in many algorithms and it is referred in the standard IEEE-754 2008. Decimal multiplication has an inherent difficulty associated with the representation of decimal numbers using a binary number system. Both bit and digit carries, as well as invalid results, must be considered in decimal multiplication in order to produce the correct result. This article focuses on algorithms for hardware implementation of decimal multiplication. Both decimal fixed-point and floating-point multiplication are described, including iterative and parallel solutions.


Author(s):  
Mário Pereira Vestias

IEEE-754 2008 has extended the standard with decimal floating-point arithmetic. Human-centric applications, like financial and commercial, depend on decimal arithmetic since the results must match exactly those obtained by human calculations without being subject to errors caused by decimal to binary conversions. Decimal multiplication is a fundamental operation utilized in many algorithms, and it is referred in the standard IEEE-754 2008. Decimal multiplication has an inherent difficulty associated with the representation of decimal numbers using a binary number system. Both bit and digit carries, as well as invalid results, must be considered in decimal multiplication in order to produce the correct result. This chapter focuses on algorithms for hardware implementation of decimal multiplication. Both decimal fixed-point and floating-point multiplication are described, including iterative and parallel solutions.


Author(s):  
J. Vijay Kumar ◽  
B. Naga Raju ◽  
M. Vasu Babu ◽  
T. Ramanjappa

This article represents the implementation of low power pipelined 64-bit RISC processor on Altera MAXV CPLD device.  The design is verified for arithmetic operations of both fixed and floating point numbers, branch and logical function of RISC processor. For all the jump instruction, the processor architecture will automatically flush the data in the pipeline, so as to avoid any misbehavior. This processor contains FPU unit, which supports double precision IEEE-754 format operations very accurately. The simulation results have been verified by using ModelSim software. The ALU operations and double precision floating point arithmetic operation results are displayed on 7-Segments. The necessary code is written in Verilog HDL.


Sign in / Sign up

Export Citation Format

Share Document