Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs

General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA’s current CUBLAS implementation delivers only a fraction of the potential performance as indicated by the roofline model in this case. We describe the challenges and key characteristics of an implementation that can achieve close to optimal performance. We further evaluate different strategies of parallelization and thread distribution and devise a flexible, configurable mapping scheme. To ensure flexibility and allow for highly tailored implementations we use code generation combined with autotuning. For a large range of matrix sizes in the domain of interest we achieve at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.

Download Full-text

NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

Electronics ◽

10.3390/electronics10161984 ◽

2021 ◽

Vol 10 (16) ◽

pp. 1984

Author(s):

Wei Zhang ◽

Zihao Jiang ◽

Zhiguang Chen ◽

Nong Xiao ◽

Yang Ou

Keyword(s):

Energy Efficiency ◽

High Performance ◽

Multicore Processors ◽

Matrix Multiplication ◽

Memory Access ◽

Double Precision ◽

Competitive Performance ◽

General Matrix ◽

Remarkable Improvement ◽

Task Independence

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.

Download Full-text

Register-based implementation of the sparse general matrix-matrix multiplication on GPUs

ACM SIGPLAN Notices ◽

10.1145/3200691.3178529 ◽

2018 ◽

Vol 53 (1) ◽

pp. 407-408

Author(s):

Junhong Liu ◽

Xin He ◽

Weifeng Liu ◽

Guangming Tan

Keyword(s):

Matrix Multiplication ◽

General Matrix

Download Full-text

Analysis of the Practical Implementation of Flicker Measurement Coprocessor for AMI Meters

Energies ◽

10.3390/en14061589 ◽

2021 ◽

Vol 14 (6) ◽

pp. 1589

Author(s):

Krzysztof Kołek ◽

Andrzej Firlit ◽

Krzysztof Piątek ◽

Krzysztof Chmielowiec

Keyword(s):

Code Generation ◽

State Of The Art ◽

Power Grids ◽

Practical Implementation ◽

System A ◽

Measuring Devices ◽

Point Representation ◽

Measurement Algorithm ◽

On Chip ◽

Vhdl Code

Monitoring power quality (PQ) indicators is an important part of modern power grids’ maintenance. Among different PQ indicators, flicker severity coefficients Pst and Plt are measures of voltage fluctuations. In state-of-the-art PQ measuring devices, the flicker measurement channel is usually implemented as a dedicated processor subsystem. Implementation of the IEC 61000-4-15 compliant flicker measurement algorithm requires a significant amount of computational power. In typical PQ analysers, the flicker measurement is usually implemented as a part of the meter’s algorithm performed by the main processor. This paper considers the implementation of the flicker measurement as an FPGA module to offload the processor subsystem or operate as an IP core in FPGA-based system-on-chip units. The measurement algorithm is developed and validated as a Simulink diagram, which is then converted to a fixed-point representation. Parts of the diagram are applied for automatic VHDL code generation, and the classifier block is implemented as a local soft-processor system. A simple eight-bit processor operates within the flicker measurement coprocessor and performs statistical operations. Finally, an IP module is created that can be considered as a flicker coprocessor module. When using the coprocessor, the main processor’s only role is to trigger the coprocessor and read the results, while the coprocessor independently calculates the flicker coefficients.

Download Full-text

TRANSFORMATION OF NESTED LOOPS WITH MODULO INDEXING TO AFFINE RECURRENCES

Parallel Processing Letters ◽

10.1142/s0129626494000260 ◽

1994 ◽

Vol 04 (03) ◽

pp. 271-280 ◽

Cited By ~ 6

Author(s):

FLORIN BALASA ◽

FRANK H.M. FRANSSEN ◽

FRANCKY V.M. CATTHOOR ◽

HUGO J. DE MAN

Keyword(s):

Code Generation ◽

State Of The Art ◽

Transformation Method ◽

Control Flow ◽

Code Optimization ◽

Transformation Techniques ◽

Hermite Normal Form ◽

Nested Loops ◽

Affine Functions ◽

Systems Transformation

For multi-dimensional (M-D) signal and data processing systems, transformation of algorithmic specifications is a major instrument both in code optimization and code generation for parallelizing compilers and in control flow optimization as a preprocessor for architecture synthesis. State-of-the-art transformation techniques are limited to affine index expressions. This is however not sufficient for many important applications in image, speech and numerical processing. In this paper, a novel transformation method is introduced, oriented to the subclass of algorithm specifications that contains modulo expressions of affine functions to index M-D signals. The method employs extensively the concept of Hermite normal form. The transformation method can be carried out in polynomial time, applying only integer arithmetic.

Download Full-text

Comparing the Performance of General Matrix Multiplication Routine on Heterogeneous Computing Systems

Journal of Parallel and Distributed Computing ◽

10.1016/j.jpdc.2021.10.002 ◽

2021 ◽

Author(s):

Aleksei Sorokin ◽

Sergey Malkovsky ◽

Georgiy Tsoy

Keyword(s):

Heterogeneous Computing ◽

Matrix Multiplication ◽

Computing Systems ◽

General Matrix ◽

Heterogeneous Computing Systems

Download Full-text

Insertion-based Decoding with Automatically Inferred Generation Order

Transactions of the Association for Computational Linguistics ◽

10.1162/tacl_a_00292 ◽

2019 ◽

Vol 7 ◽

pp. 661-676 ◽

Cited By ~ 3

Author(s):

Jiatao Gu ◽

Qi Liu ◽

Kyunghyun Cho

Keyword(s):

Machine Translation ◽

Real World ◽

Word Order ◽

Code Generation ◽

State Of The Art ◽

Generation Model ◽

Beam Search ◽

Input Information ◽

Sequence Generation ◽

Image Caption

Conventional neural autoregressive decoding commonly assumes a fixed left-to-right generation order, which may be sub-optimal. In this work, we propose a novel decoding algorithm— InDIGO—which supports flexible sequence generation in arbitrary orders through insertion operations. We extend Transformer, a state-of-the-art sequence generation model, to efficiently implement the proposed approach, enabling it to be trained with either a pre-defined generation order or adaptive orders obtained from beam-search. Experiments on four real-world tasks, including word order recovery, machine translation, image caption, and code generation, demonstrate that our algorithm can generate sequences following arbitrary orders, while achieving competitive or even better performance compared with the conventional left-to-right generation. The generated sequences show that InDIGO adopts adaptive generation orders based on input information.

Download Full-text

Adaptation of Double-Precision Matrix Multiplication to the Cell Broadband Engine Architecture

Parallel Processing and Applied Mathematics - Lecture Notes in Computer Science ◽

10.1007/978-3-642-14390-8_56 ◽

2010 ◽

pp. 535-546 ◽

Cited By ~ 1

Author(s):

Krzysztof Rojek ◽

Łukasz Szustak

Keyword(s):

Matrix Multiplication ◽

Double Precision ◽

Precision Matrix ◽

Cell Broadband Engine

Download Full-text

Effects of Data's Bit-Width on Digital Logic Design

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.411-414.1670 ◽

2013 ◽

Vol 411-414 ◽

pp. 1670-1673

Author(s):

Sheng Chang ◽

Heng Cai ◽

Hao Wang ◽

Jin He ◽

Qi Jun Huang

Keyword(s):

Circuit Design ◽

Matrix Multiplication ◽

Optimization Method ◽

Digital Logic ◽

Logic Design ◽

Double Precision ◽

Single Precision ◽

Obvious Effect ◽

Multiplication Operation ◽

Resource Cost

Single precision can only achieve 6-7 decimal places, which does not satisfy accuracy demand in many calculations. Double precision can get 13-14 decimal places, but the resource cost is high. In this paper, effects of data bit-width on digital logic design are studied. The accuracy of different bit-width is determined. Then, addition operation, multiplication operation and matrix multiplication with different bit-width are tested on FPGA platform. The results show bit width and circuit design platform have obvious effect on resource cost and circuit efficiency. Finally, a bit-width based circuit design optimization method is proposed.

Download Full-text

Dedicated architecture for double precision matrix multiplication in supercomputing environment

2007 IEEE Design and Diagnostics of Electronic Circuits and Systems ◽

10.1109/ddecs.2007.4295303 ◽

2007 ◽

Cited By ~ 1

Author(s):

P. Russek ◽

K. Wiatr

Keyword(s):

Matrix Multiplication ◽

Double Precision ◽

Precision Matrix

Download Full-text

Performance Driven Development Framework for Web Applications

Global Journal of Enterprise Information System ◽

10.18311/gjeis/2017/15870 ◽

2017 ◽

Vol 9 (1) ◽

pp. 75

Author(s):

K. S. Shailesh ◽

P. V. Suresh

Keyword(s):

Customer Loyalty ◽

Performance Optimization ◽

Web Search ◽

Web Applications ◽

State Of The Art ◽

End User ◽

Web Performance ◽

Performance Engineering ◽

Development Framework ◽

Performance Patterns

The performance of web applications is of paramount importance as it can impact end-user experience and the business revenue. Web Performance Optimization (WPO) deals with front-end performance engineering. Web performance would impact customer loyalty, SEO, web search ranking, SEO, site traffic, repeat visitors and overall online revenue. In this paper we have conducted the survey of state of the art tools, techniques, methodologies of various aspects of web performance optimization. We have identified key web performance patterns and proposed novel web performance driven development framework. We have elaborated on various techniques related to different phases of web performance driven development framework.

Download Full-text