Profiling high performance dense linear algebra algorithms on multicore architectures for power and energy efficiency

2011 ◽  
Vol 27 (4) ◽  
pp. 277-287 ◽  
Author(s):  
Hatem Ltaief ◽  
Piotr Luszczek ◽  
Jack Dongarra
2014 ◽  
Vol 40 (10) ◽  
pp. 559-573 ◽  
Author(s):  
Li Tan ◽  
Shashank Kothapalli ◽  
Longxiang Chen ◽  
Omar Hussaini ◽  
Ryan Bissiri ◽  
...  

Computation ◽  
2020 ◽  
Vol 8 (1) ◽  
pp. 20 ◽  
Author(s):  
Enrico Calore ◽  
Alessandro Gabbana ◽  
Sebastiano Fabio Schifano ◽  
Raffaele Tripiccione

In the last years, the energy efficiency of HPC systems is increasingly becoming of paramount importance for environmental, technical, and economical reasons. Several projects have investigated the use of different processors and accelerators in the quest of building systems able to achieve high energy efficiency levels for data centers and HPC installations. In this context, Arm CPU architecture has received a lot of attention given its wide use in low-power and energy-limited applications, but server grade processors have appeared on the market just recently. In this study, we targeted the Marvell ThunderX2, one of the latest Arm-based processors developed to fit the requirements of high performance computing applications. Our interest is mainly focused on the assessment in the context of large HPC installations, and thus we evaluated both computing performance and energy efficiency, using the ERT benchmark and two HPC production ready applications. We finally compared the results with other processors commonly used in large parallel systems and highlight the characteristics of applications which could benefit from the ThunderX2 architecture, in terms of both computing performance and energy efficiency. Pursuing this aim, we also describe how ERT has been modified and optimized for ThunderX2, and how to monitor power drain while running applications on this processor.


2014 ◽  
Vol 22 (4) ◽  
pp. 273-283 ◽  
Author(s):  
Robert Schöne ◽  
Jan Treibig ◽  
Manuel F. Dolz ◽  
Carla Guillen ◽  
Carmen Navarrete ◽  
...  

Energy costs nowadays represent a significant share of the total costs of ownership of High Performance Computing (HPC) systems. In this paper we provide an overview on different aspects of energy efficiency measurement and optimization. This includes metrics that define energy efficiency and a description of common power and energy measurement tools. We discuss performance measurement and analysis suites that use these tools and provide users the possibility to analyze energy efficiency weaknesses in their code. We also demonstrate how the obtained power and performance data can be used to locate inefficient resource usage or to create a model to predict optimal operation points. We further present interfaces in these suites that allow an automated tuning for energy efficiency and how these interfaces are used. We finally discuss how a hard power limit will change our view on energy efficient HPC in the future.


2015 ◽  
Vol 85 ◽  
pp. 32-46 ◽  
Author(s):  
Mathieu Faverge ◽  
Julien Herrmann ◽  
Julien Langou ◽  
Bradley Lowery ◽  
Yves Robert ◽  
...  

2010 ◽  
Vol 18 (1) ◽  
pp. 35-50 ◽  
Author(s):  
Hatem Ltaief ◽  
Jakub Kurzak ◽  
Jack Dongarra ◽  
Rosa M. Badia

The objective of this paper is to describe, in the context of multicore architectures, three different scheduler implementations for the two-sided linear algebra transformations, in particular the Hessenberg and Bidiagonal reductions which are the first steps for the standard eigenvalue problems and the singular value decompositions respectively. State-of-the-art dense linear algebra softwares, such as the LAPACK and ScaLAPACK libraries, suffer performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time the fine-grain dataflow model gains popularity as a paradigm for programming multicore architectures. Buttari et al. (Parellel Comput. Syst. Appl. 35 (2009), 38–53) introduced the concept oftile algorithmsin which parallelism is no longer hidden inside Basic Linear Algebra Subprograms but is brought to the fore to yield much better performance. Along with efficient scheduling mechanisms for data-driven execution, these tile two-sided reductions achieve high performance computing by reaching up to 75% of the DGEMM peak on a 12000×12000 matrix with 16 Intel Tigerton 2.4 GHz processors. The main drawback of thetile algorithmsapproach for two-sided transformations is that the full reduction cannot be obtained in one stage. Other methods have to be considered to further reduce the band matrices to the required forms.


Sign in / Sign up

Export Citation Format

Share Document