roofline model
Recently Published Documents


TOTAL DOCUMENTS

54
(FIVE YEARS 21)

H-INDEX

9
(FIVE YEARS 3)

Author(s):  
Zhengbo Chen ◽  
Fang Zheng ◽  
Qi Yu ◽  
Rujun Sun ◽  
Feng Guo ◽  
...  
Keyword(s):  

Author(s):  
Diogo Marques ◽  
Aleksandar Ilic ◽  
Leonel Sousa

Continuous enhancements and diversity in modern multi-core hardware, such as wider and deeper core pipelines and memory subsystems, bring to practice a set of hard-to-solve challenges when modeling their upper-bound capabilities and identifying the main application bottlenecks. Insightful roofline models are widely used for this purpose, but the existing approaches overly abstract the micro-architecture complexity, thus providing unrealistic performance bounds that lead to a misleading characterization of real-world applications. To address this problem, the Mansard Roofline Model (MaRM), proposed in this work, uncovers a minimum set of architectural features that must be considered to provide insightful, but yet accurate and realistic, modeling of performance upper bounds for modern processors. By encapsulating the retirement constraints due to the amount of retirement slots, Reorder-Buffer and Physical Register File sizes, the proposed model accurately models the capabilities of a real platform (average rRMSE of 5.4%) and characterizes 12 application kernels from standard benchmark suites. By following a herein proposed MaRM interpretation methodology and guidelines, speed-ups of up to 5× are obtained when optimizing real-world bioinformatic application, as well as a super-linear speedup of 18.5× when parallelized.


2021 ◽  
pp. 1-1
Author(s):  
Marco Siracusa ◽  
Emanuele Delsozzo ◽  
Marco Rabozzi ◽  
Lorenzo Di Tucci ◽  
Samuel Williams ◽  
...  
Keyword(s):  

Author(s):  
Marco Siracusa ◽  
Lorenzo Di Tucci ◽  
Marco Rabozzi ◽  
Samuel Williams ◽  
Emanuele Del Sozzo ◽  
...  
Keyword(s):  

Author(s):  
Konstantin Volovich

The article is devoted to methods of calculation and evaluation of the effectiveness of the functioning of hybrid computing systems. The article proposes a method of calculating the value of the workload using peak values of the cluster performance. The results and the quality of the functioning of cloud scientific services of high-performance computing using the roofline model are analyzed.


Author(s):  
Dominik Ernst ◽  
Georg Hager ◽  
Jonas Thies ◽  
Gerhard Wellein

General matrix-matrix multiplications with double-precision real and complex entries (DGEMM and ZGEMM) in vendor-supplied BLAS libraries are best optimized for square matrices but often show bad performance for tall & skinny matrices, which are much taller than wide. NVIDIA’s current CUBLAS implementation delivers only a fraction of the potential performance as indicated by the roofline model in this case. We describe the challenges and key characteristics of an implementation that can achieve close to optimal performance. We further evaluate different strategies of parallelization and thread distribution and devise a flexible, configurable mapping scheme. To ensure flexibility and allow for highly tailored implementations we use code generation combined with autotuning. For a large range of matrix sizes in the domain of interest we achieve at least 2/3 of the roofline performance and often substantially outperform state-of-the art CUBLAS results on an NVIDIA Volta GPGPU.


Sign in / Sign up

Export Citation Format

Share Document