scholarly journals LAPACK95 - HIGH PERFORMANCE LINEAR ALGEBRA PACKAGE

2000 ◽  
Vol 5 (1) ◽  
pp. 44-54 ◽  
Author(s):  
J. Dongarra ◽  
J. Waśniewski

LAPACK95 is a set of FORTRAN95 subroutines which interfaces FORTRAN95 with LAPACK. All LAPACK driver subroutines (including expert drivers) and some LAPACK computationals have both generic LAPACK95 interfaces and generic LAPACK77 interfaces. The remaining computationals have only generic LAPACK77 interfaces. In both types of interfaces no distinction is made between single and double precision or between real and complex data types.

2021 ◽  
Vol 47 (3) ◽  
pp. 1-23
Author(s):  
Ahmad Abdelfattah ◽  
Timothy Costa ◽  
Jack Dongarra ◽  
Mark Gates ◽  
Azzam Haidar ◽  
...  

This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.


Author(s):  
James Morrison ◽  
David Christie ◽  
Charles Greenwood ◽  
Ruairi Maciver ◽  
Arne Vogler

This paper presents a set of software tools for interrogating and processing time series data. The functionality of this toolset will be demonstrated using data from a specific deployment involving multiple sensors deployed for a specific time period. The approach was developed initially for Datawell Waverider MKII/MKII buoys [1] and expanded to include data from acoustic devices in this case Nortek AWACs. Tools of this nature are important to address a specific lack of features in the sensor manufacturers own tools. It also helps to develop standard approaches for dealing with anomalous data from sensors. These software tools build upon an effective modern interpreted programming language in this case Python which has access to high performance low level libraries. This paper demonstrates the use of these tools applied to a sensor network based on the North West coast of Scotland as described in [2,3]. Examples can be seen of computationally complex data being easily calculated for monthly averages. Analysis down to a wave by wave basis will also be demonstrated form the same source dataset. The tools make use of a flexible data structure called a DataFrame which supports mixed data types, hierarchical and time indexing and is also integrated with modern plotting libraries. This allows sub second querying and the ability for dynamic plotting of large datasets. By using modern compression techniques and file formats it is possible to process datasets which are larger than memory datasets without the need for a traditional relational database. The software library shall be of use to a wide variety of industry involved in offshore engineering along with any scientists interested in the coastal environment.


2021 ◽  
Vol 47 (2) ◽  
pp. 1-28
Author(s):  
Goran Flegar ◽  
Hartwig Anzt ◽  
Terry Cojean ◽  
Enrique S. Quintana-Ortí

The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator–like a preconditioner–in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.


2014 ◽  
Vol 40 (10) ◽  
pp. 559-573 ◽  
Author(s):  
Li Tan ◽  
Shashank Kothapalli ◽  
Longxiang Chen ◽  
Omar Hussaini ◽  
Ryan Bissiri ◽  
...  

2012 ◽  
Author(s):  
Marty Kraimer, ◽  
John dalesio
Keyword(s):  

Electronics ◽  
2021 ◽  
Vol 10 (16) ◽  
pp. 1984
Author(s):  
Wei Zhang ◽  
Zihao Jiang ◽  
Zhiguang Chen ◽  
Nong Xiao ◽  
Yang Ou

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.


2021 ◽  
Vol 4 ◽  
pp. 78-87
Author(s):  
Yury Yuschenko

In the Address Programming Language (1955), the concept of indirect addressing of higher ranks (Pointers) was introduced, which allows the arbitrary connection of the computer’s RAM cells. This connection is based on standard sequences of the cell addresses in RAM and addressing sequences, which is determined by the programmer with indirect addressing. Two types of sequences allow programmers to determine an arbitrary connection of RAM cells with the arbitrary content: data, addresses, subroutines, program labels, etc. Therefore, the formed connections of cells can relate to each other. The result of connecting cells with the arbitrary content and any structure is called tree-shaped formats. Tree-shaped formats allow programmers to combine data into complex data structures that are like abstract data types. For tree-shaped formats, the concept of “review scheme” is defined, which is like the concept of “bypassing” trees. Programmers can define multiple overview diagrams for the one tree-shaped format. Programmers can create tree-shaped formats over the connected cells to define the desired overview schemes for these connected cells. The work gives a modern interpretation of the concept of tree-shaped formats in Address Programming. Tree-shaped formats are based on “stroke-operation” (pointer dereference), which was hardware implemented in the command system of computer “Kyiv”. Group operations of modernization of computer “Kyiv” addresses accelerate the processing of tree-shaped formats and are designed as organized cycles, like those in high-level imperative programming languages. The commands of computer “Kyiv”, due to operations with indirect addressing, have more capabilities than the first high-level programming language – Plankalkül. Machine commands of the computer “Kyiv” allow direct access to the i-th element of the “list” by its serial number in the same way as such access is obtained to the i-th element of the array by its index. Given examples of singly linked lists show the features of tree-shaped formats and their differences from abstract data types. The article opens a new branch of theoretical research, the purpose of which is to analyze the expe- diency of partial inclusion of Address Programming in modern programming languages.


Sign in / Sign up

Export Citation Format

Share Document