basic linear algebra subprograms Latest Research Papers

This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.

Download Full-text

Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures

Electronics ◽

10.3390/electronics7120359 ◽

2018 ◽

Vol 7 (12) ◽

pp. 359 ◽

Cited By ~ 2

Author(s):

Xing Su ◽

Fei Lei

Keyword(s):

Dynamic Load ◽

High Performance ◽

Dynamic Load Balancing ◽

General Matrix ◽

Basic Linear Algebra Subprograms ◽

Numerical Software ◽

Multiple Threads ◽

Synchronization Overhead ◽

Many Core ◽

Computational Kernel

The Basic Linear Algebra Subprograms (BLAS) is a fundamental numerical software and GEneral Matrix Multiply (GEMM) is the most important computational kernel routine in the BLAS library. On multi-core and many-core processors, the whole workload of GEMM is partitioned and scheduled to multiple threads to exploit the parallel hardware. Generally, the workload is equally partitioned among threads and all threads are expected to accomplish their work in roughly the same time. However, this is not the case on Non-Uniform Memory Access (NUMA) architectures. The NUMA effect may cause threads to run at different speeds, and the overall executing times of GEMM is determined by the slowest thread. In this paper, we propose a hybrid-grained dynamic load-balancing method to reduce the harm of the NUMA effect by allowing fast threads to steal work from slow ones. We evaluate the proposed method on Phytium 2000+, an emerging 64-core high-performance processor based on Arm’s AArch64 architecture. Results show that our method reduces the synchronization overhead by 51.5% and achieves an improvement of GEMM performance by 1.9%.

Download Full-text

An Efficient Reanalysis Method for Topological Optimization of Vibrating Continuum Structures for Simple and Multiple Eigenfrequencies

Mathematical Problems in Engineering ◽

10.1155/2018/2728408 ◽

2018 ◽

Vol 2018 ◽

pp. 1-10

Author(s):

Youhong Sun ◽

Xuqi Zhao ◽

Yongping Yu ◽

Shaopeng Zheng

Keyword(s):

Computational Effort ◽

Topological Optimization ◽

Optimal Result ◽

Continuum Structures ◽

Cpu Time ◽

Approximate Reanalysis ◽

Computation Efficiency ◽

Vibrating Structures ◽

Basic Linear Algebra Subprograms ◽

Level 3

This paper shows how the reanalysis method can be utilized for speeding up the optimized process in topology optimization related to vibrating structures for simple and multiple eigenfrequencies. The block combined approximation with shifting (BCAS) method is used for reducing the computational effort included in repeated solution of eigenvalue problem, which will dominate a lot of the CPU time, especially for large problems. By utilizing Level 3 Basic Linear Algebra Subprograms (BLAS), the computation efficiency of the BCAS method is improved. For achieving an accurate optimal result, two indicators are presented to control the approximate reanalysis procedure. The effectiveness of the proposed method is demonstrated by three numerical examples.

Download Full-text

CPU versus GPU: which can perform matrix computation faster—performance comparison for basic linear algebra subprograms

Neural Computing and Applications ◽

10.1007/s00521-018-3354-z ◽

2018 ◽

Vol 31 (8) ◽

pp. 4353-4365 ◽

Cited By ~ 4

Author(s):

Feng Li ◽

Yunming Ye ◽

Zhaoyang Tian ◽

Xiaofeng Zhang

Keyword(s):

Linear Algebra ◽

Performance Comparison ◽

Matrix Computation ◽

Basic Linear Algebra Subprograms

Download Full-text

Basic Linear Algebra Subprograms - BLAS

Introduction to Computational Linear Algebra ◽

10.1201/b18662-7 ◽

2015 ◽

pp. 25-50

Keyword(s):

Linear Algebra ◽

Basic Linear Algebra Subprograms

Download Full-text

Towards Reversible Basic Linear Algebra Subprograms: A Performance Study

Transactions on Computational Science XXIV - Lecture Notes in Computer Science ◽

10.1007/978-3-662-45711-5_4 ◽

2014 ◽

pp. 56-73

Author(s):

Kalyan S. Perumalla ◽

Srikanth B. Yoginath

Keyword(s):

Linear Algebra ◽

Performance Study ◽

Basic Linear Algebra Subprograms ◽

A Performance

Download Full-text

A New Vectorization Technique for Expression Templates in C++

American Journal of Undergraduate Research ◽

10.33697/ajur.2012.003 ◽

2012 ◽

Vol 10 (4) ◽

Author(s):

J Progsch ◽

Y Ineichen ◽

A Adelmann

Keyword(s):

High Performance ◽

Performance Gap ◽

New Approach ◽

Loop Unrolling ◽

Basic Linear Algebra Subprograms ◽

Template Library ◽

Abstract Interface ◽

Expression Templates ◽

Performance Computing ◽

Expression Template

Vector operations play an important role in high performance computing and are typically provided by highly optimized libraries that implement the Basic Linear Algebra Subprograms (BLAS) interface. In C++ templates and operator overloading allow the implementation of these vector operations as expression templates which construct custom loops at compile time and providing a more abstract interface. Unfortunately existing expression template libraries lack the performance of fast BLAS implementations. This paper presents a new approach - Statically Accelerated Loop Templates (SALT) - to close this performance gap by combining expression templates with an aggressive loop unrolling technique. Benchmarks were conducted using the Intel C++ compiler and GNU Compiler Collection to assess the performance of our library relative to Intel's Math Kernel Library as well as the Eigen template library. The results show that the approach is able to provide optimization comparable to the fastest available BLAS implementations, while retaining the convenience and flexibility of a template library.

Download Full-text