scholarly journals Analysis of floating-point round-off error in linear algebra routines for graph clustering

Author(s):  
L. Minah Yang ◽  
Alyson Fox
2017 ◽  
Vol 27 (03n04) ◽  
pp. 1750006 ◽  
Author(s):  
Farhad Merchant ◽  
Anupam Chattopadhyay ◽  
Soumyendu Raha ◽  
S. K. Nandy ◽  
Ranjani Narayan

Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the HPC applications. Performance in such tuned packages is attained through tuning of several algorithmic and architectural parameters such as number of parallel operations in the Directed Acyclic Graph of the BLAS/LAPACK routines, sizes of the memories in the memory hierarchy of the underlying platform, bandwidth of the memory, and structure of the compute resources in the underlying platform. In this paper, we closely investigate the impact of the Floating Point Unit (FPU) micro-architecture for performance tuning of BLAS and LAPACK. We present theoretical analysis for pipeline depth of different floating point operations like multiplier, adder, square root, and divider followed by characterization of BLAS and LAPACK to determine several parameters required in the theoretical framework for deciding optimum pipeline depth of the floating operations. A simple design of a Processing Element (PE) is presented and shown that the PE outperforms the most recent custom realizations of BLAS and LAPACK by 1.1X to 1.5X in GFlops/W, and 1.9X to 2.1X in Gflops/mm2. Compared to multicore, General Purpose Graphics Processing Unit (GPGPU), Field Programmable Gate Array (FPGA), and ClearSpeed CSX700, performance improvement of 1.8-80x is reported in PE.


2021 ◽  
Vol 47 (3) ◽  
pp. 1-23
Author(s):  
Ahmad Abdelfattah ◽  
Timothy Costa ◽  
Jack Dongarra ◽  
Mark Gates ◽  
Azzam Haidar ◽  
...  

This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.


2014 ◽  
Vol 77 (1-2) ◽  
pp. 169-190 ◽  
Author(s):  
Ardavan Pedram ◽  
John D. McCalpin ◽  
Andreas Gerstlauer

Author(s):  
Anastasiia Izycheva ◽  
Eva Darulova ◽  
Helmut Seidl

AbstractWe present an automated procedure for synthesizing sound inductive invariants for floating-point numerical loops. Our procedure generates invariants of the form of a convex polynomial inequality that tightly bounds the values of loop variables. Such invariants are a prerequisite for reasoning about the safety and roundoff errors of floating-point programs. Unlike previous approaches that rely on policy iteration, linear algebra or semi-definite programming, we propose a heuristic procedure based on simulation and counterexample-guided refinement. We observe that this combination is remarkably effective and general and can handle both linear and nonlinear loop bodies, nondeterministic values as well as conditional statements. Our evaluation shows that our approach can efficiently synthesize loop invariants for existing benchmarks from literature, but that it is also able to find invariants for nonlinear loops that today’s tools cannot handle.


Sign in / Sign up

Export Citation Format

Share Document