Analysis of floating-point round-off error in linear algebra routines for graph clustering

Basic Linear Algebra Subprograms (BLAS) and Linear Algebra Package (LAPACK) form basic building blocks for several High Performance Computing (HPC) applications and hence dictate performance of the HPC applications. Performance in such tuned packages is attained through tuning of several algorithmic and architectural parameters such as number of parallel operations in the Directed Acyclic Graph of the BLAS/LAPACK routines, sizes of the memories in the memory hierarchy of the underlying platform, bandwidth of the memory, and structure of the compute resources in the underlying platform. In this paper, we closely investigate the impact of the Floating Point Unit (FPU) micro-architecture for performance tuning of BLAS and LAPACK. We present theoretical analysis for pipeline depth of different floating point operations like multiplier, adder, square root, and divider followed by characterization of BLAS and LAPACK to determine several parameters required in the theoretical framework for deciding optimum pipeline depth of the floating operations. A simple design of a Processing Element (PE) is presented and shown that the PE outperforms the most recent custom realizations of BLAS and LAPACK by 1.1X to 1.5X in GFlops/W, and 1.9X to 2.1X in Gflops/mm2. Compared to multicore, General Purpose Graphics Processing Unit (GPGPU), Field Programmable Gate Array (FPGA), and ClearSpeed CSX700, performance improvement of 1.8-80x is reported in PE.

Download Full-text

A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines

ACM Transactions on Mathematical Software ◽

10.1145/3431921 ◽

2021 ◽

Vol 47 (3) ◽

pp. 1-23

Author(s):

Ahmad Abdelfattah ◽

Timothy Costa ◽

Jack Dongarra ◽

Mark Gates ◽

Azzam Haidar ◽

...

Keyword(s):

Machine Learning ◽

Linear Algebra ◽

High Performance ◽

Large Scale ◽

Floating Point ◽

Equal Size ◽

Hardware Accelerators ◽

Double Precision ◽

Basic Linear Algebra Subprograms ◽

Many Core

This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.

Download Full-text

Building Blocks – Floating Point Numbers and Basic Linear Algebra

An Introduction to Parallel and Vector Scientific Computing ◽

10.1017/cbo9780511617935.005 ◽

2010 ◽

pp. 103-125

Author(s):

Ronald W. Shonkwiler ◽

Lew Lefton

Keyword(s):

Linear Algebra ◽

Building Blocks ◽

Floating Point ◽

Floating Point Numbers

Download Full-text

A Highly Efficient Multicore Floating-Point FFT Architecture Based on Hybrid Linear Algebra/FFT Cores

Journal of Signal Processing Systems ◽

10.1007/s11265-014-0896-x ◽

2014 ◽

Vol 77 (1-2) ◽

pp. 169-190 ◽

Cited By ~ 4

Author(s):

Ardavan Pedram ◽

John D. McCalpin ◽

Andreas Gerstlauer

Keyword(s):

Linear Algebra ◽

Floating Point ◽

Highly Efficient

Download Full-text

Counterexample- and Simulation-Guided Floating-Point Loop Invariant Synthesis

Static Analysis - Lecture Notes in Computer Science ◽

10.1007/978-3-030-65474-0_8 ◽

2020 ◽

pp. 156-177

Author(s):

Anastasiia Izycheva ◽

Eva Darulova ◽

Helmut Seidl

Keyword(s):

Linear Algebra ◽

Policy Iteration ◽

Floating Point ◽

Heuristic Procedure ◽

Polynomial Inequality ◽

Loop Invariant ◽

Roundoff Errors ◽

Conditional Statements ◽

Linear And Nonlinear ◽

Inductive Invariants

AbstractWe present an automated procedure for synthesizing sound inductive invariants for floating-point numerical loops. Our procedure generates invariants of the form of a convex polynomial inequality that tightly bounds the values of loop variables. Such invariants are a prerequisite for reasoning about the safety and roundoff errors of floating-point programs. Unlike previous approaches that rely on policy iteration, linear algebra or semi-definite programming, we propose a heuristic procedure based on simulation and counterexample-guided refinement. We observe that this combination is remarkably effective and general and can handle both linear and nonlinear loop bodies, nondeterministic values as well as conditional statements. Our evaluation shows that our approach can efficiently synthesize loop invariants for existing benchmarks from literature, but that it is also able to find invariants for nonlinear loops that today’s tools cannot handle.

Download Full-text