scholarly journals Performance engineering for real and complex tall & skinny matrix multiplication kernels on GPUs

Author(s):  
Dominik Ernst ◽  
Georg Hager ◽  
Jonas Thies ◽  
Gerhard Wellein

General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA’s current CUBLAS implementation delivers only a fraction of the potential performance as indicated by the roofline model in this case. We describe the challenges and key characteristics of an implementation that can achieve close to optimal performance. We further evaluate different strategies of parallelization and thread distribution and devise a flexible, configurable mapping scheme. To ensure flexibility and allow for highly tailored implementations we use code generation combined with autotuning. For a large range of matrix sizes in the domain of interest we achieve at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.

Electronics ◽  
2021 ◽  
Vol 10 (16) ◽  
pp. 1984
Author(s):  
Wei Zhang ◽  
Zihao Jiang ◽  
Zhiguang Chen ◽  
Nong Xiao ◽  
Yang Ou

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.


2018 ◽  
Vol 53 (1) ◽  
pp. 407-408
Author(s):  
Junhong Liu ◽  
Xin He ◽  
Weifeng Liu ◽  
Guangming Tan

Energies ◽  
2021 ◽  
Vol 14 (6) ◽  
pp. 1589
Author(s):  
Krzysztof Kołek ◽  
Andrzej Firlit ◽  
Krzysztof Piątek ◽  
Krzysztof Chmielowiec

Monitoring power quality (PQ) indicators is an important part of modern power grids’ maintenance. Among different PQ indicators, flicker severity coefficients Pst and Plt are measures of voltage fluctuations. In state-of-the-art PQ measuring devices, the flicker measurement channel is usually implemented as a dedicated processor subsystem. Implementation of the IEC 61000-4-15 compliant flicker measurement algorithm requires a significant amount of computational power. In typical PQ analysers, the flicker measurement is usually implemented as a part of the meter’s algorithm performed by the main processor. This paper considers the implementation of the flicker measurement as an FPGA module to offload the processor subsystem or operate as an IP core in FPGA-based system-on-chip units. The measurement algorithm is developed and validated as a Simulink diagram, which is then converted to a fixed-point representation. Parts of the diagram are applied for automatic VHDL code generation, and the classifier block is implemented as a local soft-processor system. A simple eight-bit processor operates within the flicker measurement coprocessor and performs statistical operations. Finally, an IP module is created that can be considered as a flicker coprocessor module. When using the coprocessor, the main processor’s only role is to trigger the coprocessor and read the results, while the coprocessor independently calculates the flicker coefficients.


1994 ◽  
Vol 04 (03) ◽  
pp. 271-280 ◽  
Author(s):  
FLORIN BALASA ◽  
FRANK H.M. FRANSSEN ◽  
FRANCKY V.M. CATTHOOR ◽  
HUGO J. DE MAN

For multi-dimensional (M-D) signal and data processing systems, transformation of algorithmic specifications is a major instrument both in code optimization and code generation for parallelizing compilers and in control flow optimization as a preprocessor for architecture synthesis. State-of-the-art transformation techniques are limited to affine index expressions. This is however not sufficient for many important applications in image, speech and numerical processing. In this paper, a novel transformation method is introduced, oriented to the subclass of algorithm specifications that contains modulo expressions of affine functions to index M-D signals. The method employs extensively the concept of Hermite normal form. The transformation method can be carried out in polynomial time, applying only integer arithmetic.


2019 ◽  
Vol 7 ◽  
pp. 661-676 ◽  
Author(s):  
Jiatao Gu ◽  
Qi Liu ◽  
Kyunghyun Cho

Conventional neural autoregressive decoding commonly assumes a fixed left-to-right generation order, which may be sub-optimal. In this work, we propose a novel decoding algorithm— InDIGO—which supports flexible sequence generation in arbitrary orders through insertion operations. We extend Transformer, a state-of-the-art sequence generation model, to efficiently implement the proposed approach, enabling it to be trained with either a pre-defined generation order or adaptive orders obtained from beam-search. Experiments on four real-world tasks, including word order recovery, machine translation, image caption, and code generation, demonstrate that our algorithm can generate sequences following arbitrary orders, while achieving competitive or even better performance compared with the conventional left-to-right generation. The generated sequences show that InDIGO adopts adaptive generation orders based on input information.


2013 ◽  
Vol 411-414 ◽  
pp. 1670-1673
Author(s):  
Sheng Chang ◽  
Heng Cai ◽  
Hao Wang ◽  
Jin He ◽  
Qi Jun Huang

Single precision can only achieve 6-7 decimal places, which does not satisfy accuracy demand in many calculations. Double precision can get 13-14 decimal places, but the resource cost is high. In this paper, effects of data bit-width on digital logic design are studied. The accuracy of different bit-width is determined. Then, addition operation, multiplication operation and matrix multiplication with different bit-width are tested on FPGA platform. The results show bit width and circuit design platform have obvious effect on resource cost and circuit efficiency. Finally, a bit-width based circuit design optimization method is proposed.


2017 ◽  
Vol 9 (1) ◽  
pp. 75
Author(s):  
K. S. Shailesh ◽  
P. V. Suresh

The performance of web applications is of paramount importance as it can impact end-user experience and the business revenue. Web Performance Optimization (WPO) deals with front-end performance engineering. Web performance would impact customer loyalty, SEO, web search ranking, SEO, site traffic, repeat visitors and overall online revenue. In this paper we have conducted the survey of state of the art tools, techniques, methodologies of various aspects of web performance optimization. We have identified key web performance patterns and proposed novel web performance driven development framework. We have elaborated on various techniques related to different phases of web performance driven development framework.


Sign in / Sign up

Export Citation Format

Share Document