basic linear algebra subprograms
Recently Published Documents


TOTAL DOCUMENTS

43
(FIVE YEARS 2)

H-INDEX

15
(FIVE YEARS 0)

2021 ◽  
Vol 47 (3) ◽  
pp. 1-23
Author(s):  
Ahmad Abdelfattah ◽  
Timothy Costa ◽  
Jack Dongarra ◽  
Mark Gates ◽  
Azzam Haidar ◽  
...  

This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.


Electronics ◽  
2018 ◽  
Vol 7 (12) ◽  
pp. 359 ◽  
Author(s):  
Xing Su ◽  
Fei Lei

The Basic Linear Algebra Subprograms (BLAS) is a fundamental numerical software and GEneral Matrix Multiply (GEMM) is the most important computational kernel routine in the BLAS library. On multi-core and many-core processors, the whole workload of GEMM is partitioned and scheduled to multiple threads to exploit the parallel hardware. Generally, the workload is equally partitioned among threads and all threads are expected to accomplish their work in roughly the same time. However, this is not the case on Non-Uniform Memory Access (NUMA) architectures. The NUMA effect may cause threads to run at different speeds, and the overall executing times of GEMM is determined by the slowest thread. In this paper, we propose a hybrid-grained dynamic load-balancing method to reduce the harm of the NUMA effect by allowing fast threads to steal work from slow ones. We evaluate the proposed method on Phytium 2000+, an emerging 64-core high-performance processor based on Arm’s AArch64 architecture. Results show that our method reduces the synchronization overhead by 51.5% and achieves an improvement of GEMM performance by 1.9%.


2018 ◽  
Vol 2018 ◽  
pp. 1-10
Author(s):  
Youhong Sun ◽  
Xuqi Zhao ◽  
Yongping Yu ◽  
Shaopeng Zheng

This paper shows how the reanalysis method can be utilized for speeding up the optimized process in topology optimization related to vibrating structures for simple and multiple eigenfrequencies. The block combined approximation with shifting (BCAS) method is used for reducing the computational effort included in repeated solution of eigenvalue problem, which will dominate a lot of the CPU time, especially for large problems. By utilizing Level 3 Basic Linear Algebra Subprograms (BLAS), the computation efficiency of the BCAS method is improved. For achieving an accurate optimal result, two indicators are presented to control the approximate reanalysis procedure. The effectiveness of the proposed method is demonstrated by three numerical examples.


2012 ◽  
Vol 10 (4) ◽  
Author(s):  
J Progsch ◽  
Y Ineichen ◽  
A Adelmann

Vector operations play an important role in high performance computing and are typically provided by highly optimized libraries that implement the Basic Linear Algebra Subprograms (BLAS) interface. In C++ templates and operator overloading allow the implementation of these vector operations as expression templates which construct custom loops at compile time and providing a more abstract interface. Unfortunately existing expression template libraries lack the performance of fast BLAS implementations. This paper presents a new approach - Statically Accelerated Loop Templates (SALT) - to close this performance gap by combining expression templates with an aggressive loop unrolling technique. Benchmarks were conducted using the Intel C++ compiler and GNU Compiler Collection to assess the performance of our library relative to Intel's Math Kernel Library as well as the Eigen template library. The results show that the approach is able to provide optimization comparable to the fastest available BLAS implementations, while retaining the convenience and flexibility of a template library.


Author(s):  
Thilo Kielmann ◽  
Sergei Gorlatch ◽  
Utpal Banerjee ◽  
Rocco De Nicola ◽  
Jack Dongarra ◽  
...  

Author(s):  
Thilo Kielmann ◽  
Sergei Gorlatch ◽  
Utpal Banerjee ◽  
Rocco De Nicola ◽  
Jack Dongarra ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document