Scheduling Two-Sided Transformations Using Tile Algorithms on Multicore Architectures

The objective of this paper is to describe, in the context of multicore architectures, three different scheduler implementations for the two-sided linear algebra transformations, in particular the Hessenberg and Bidiagonal reductions which are the first steps for the standard eigenvalue problems and the singular value decompositions respectively. State-of-the-art dense linear algebra softwares, such as the LAPACK and ScaLAPACK libraries, suffer performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time the fine-grain dataflow model gains popularity as a paradigm for programming multicore architectures. Buttari et al. (Parellel Comput. Syst. Appl. 35 (2009), 38–53) introduced the concept oftile algorithmsin which parallelism is no longer hidden inside Basic Linear Algebra Subprograms but is brought to the fore to yield much better performance. Along with efficient scheduling mechanisms for data-driven execution, these tile two-sided reductions achieve high performance computing by reaching up to 75% of the DGEMM peak on a 12000×12000 matrix with 16 Intel Tigerton 2.4 GHz processors. The main drawback of thetile algorithmsapproach for two-sided transformations is that the full reduction cannot be obtained in one stage. Other methods have to be considered to further reduce the band matrices to the required forms.

Download Full-text

Profiling high performance dense linear algebra algorithms on multicore architectures for power and energy efficiency

Computer Science - Research and Development ◽

10.1007/s00450-011-0191-z ◽

2011 ◽

Vol 27 (4) ◽

pp. 277-287 ◽

Cited By ~ 17

Author(s):

Hatem Ltaief ◽

Piotr Luszczek ◽

Jack Dongarra

Keyword(s):

Energy Efficiency ◽

Linear Algebra ◽

High Performance ◽

Multicore Architectures ◽

Dense Linear Algebra ◽

Power And Energy

Download Full-text

The Parallel Tiled WZ Factorization Algorithm for Multicore Architectures

International Journal of Applied Mathematics and Computer Science ◽

10.2478/amcs-2019-0030 ◽

2019 ◽

Vol 29 (2) ◽

pp. 407-419

Author(s):

Beata Bylina ◽

Jarosław Bylina

Keyword(s):

Shared Memory ◽

Linear Algebra ◽

Multicore Architectures ◽

Numerical Accuracy ◽

Factorization Algorithm ◽

Computational Performance ◽

Parallel Implementations ◽

Diagonally Dominant Matrices ◽

Diagonally Dominant ◽

Level Parallelism

Abstract The aim of this paper is to investigate dense linear algebra algorithms on shared memory multicore architectures. The design and implementation of a parallel tiled WZ factorization algorithm which can fully exploit such architectures are presented. Three parallel implementations of the algorithm are studied. The first one relies only on exploiting multithreaded BLAS (basic linear algebra subprograms) operations. The second implementation, except for BLAS operations, employs the OpenMP standard to use the loop-level parallelism. The third implementation, except for BLAS operations, employs the OpenMP task directive with the depend clause. We report the computational performance and the speedup of the parallel tiled WZ factorization algorithm on shared memory multicore architectures for dense square diagonally dominant matrices. Then we compare our parallel implementations with the respective LU factorization from a vendor implemented LAPACK library. We also analyze the numerical accuracy. Two of our implementations can be achieved with near maximal theoretical speedup implied by Amdahl’s law.

Download Full-text

GraphPEG

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3450440 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1-24

Author(s):

Yashuai Lü ◽

Hui Guo ◽

Libo Huang ◽

Qi Yu ◽

Li Shen ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Graph Algorithm ◽

Graph Processing ◽

Graph Traversal ◽

Fine Grain ◽

Large Scale Data ◽

Load Imbalance ◽

Work Distribution ◽

Level Parallelism

Due to massive thread-level parallelism, GPUs have become an attractive platform for accelerating large-scale data parallel computations, such as graph processing. However, achieving high performance for graph processing with GPUs is non-trivial. Processing graphs on GPUs introduces several problems, such as load imbalance, low utilization of hardware unit, and memory divergence. Although previous work has proposed several software strategies to optimize graph processing on GPUs, there are several issues beyond the capability of software techniques to address. In this article, we present GraphPEG, a graph processing engine for efficient graph processing on GPUs. Inspired by the observation that many graph algorithms have a common pattern on graph traversal, GraphPEG improves the performance of graph processing by coupling automatic edge gathering with fine-grain work distribution. GraphPEG can also adapt to various input graph datasets and simplify the software design of graph processing with hardware-assisted graph traversal. Simulation results show that, in comparison with two representative highly efficient GPU graph processing software framework Gunrock and SEP-Graph, GraphPEG improves graph processing throughput by 2.8× and 2.5× on average, and up to 7.3× and 7.0× for six graph algorithm benchmarks on six graph datasets, with marginal hardware cost.

Download Full-text

Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

ACM Transactions on Mathematical Software ◽

10.1145/3441850 ◽

2021 ◽

Vol 47 (2) ◽

pp. 1-28

Author(s):

Goran Flegar ◽

Hartwig Anzt ◽

Terry Cojean ◽

Enrique S. Quintana-Ortí

Keyword(s):

Linear Algebra ◽

Graphics Processing Units ◽

High Performance ◽

Numerical Algorithms ◽

Mixed Precision ◽

Before And After ◽

Memory Accesses ◽

Specialized Hardware ◽

The Individual ◽

Graphics Processing

The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator–like a preconditioner–in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.

Download Full-text

High Performance Parallelization of COMPSYN on a Cluster of Multicore Processors with GPUs

Procedia Computer Science ◽

10.1016/j.procs.2012.04.103 ◽

2012 ◽

Vol 9 ◽

pp. 966-975

Author(s):

Ferdinando Alessi ◽

Annalisa Massini ◽

Roberto Basili

Keyword(s):

High Performance ◽

Multicore Processors

Download Full-text

An explicit method to calculate implicit spatial finite differences

Geophysics ◽

10.1190/geo2021-0001.1 ◽

2021 ◽

pp. 1-71

Author(s):

Hongwei Liu ◽

Yi Luo

Keyword(s):

Finite Difference ◽

High Performance ◽

Computational Cost ◽

New Method ◽

Seismic Exploration ◽

Explicit Method ◽

Lu Decomposition ◽

Implicit Methods ◽

Band Matrices ◽

Explicit Finite Difference

The finite-difference solution of the second-order acoustic wave equation is a fundamental algorithm in seismic exploration for seismic forward modeling, imaging, and inversion. Unlike the standard explicit finite difference (EFD) methods that usually suffer from the so-called "saturation effect", the implicit FD methods can obtain much higher accuracy with relatively short operator length. Unfortunately, these implicit methods are not widely used because band matrices need to be solved implicitly, which is not suitable for most high-performance computer architectures. We introduce an explicit method to overcome this limitation by applying explicit causal and anti-causal integrations. We can prove that the explicit solution is equivalent to the traditional implicit LU decomposition method in analytical and numerical ways. In addition, we also compare the accuracy of the new methods with the traditional EFD methods up to 32nd order, and numerical results indicate that the new method is more accurate. In terms of the computational cost, the newly proposed method is standard 8th order EFD plus two causal and anti-causal integrations, which can be applied recursively, and no extra memory is needed. In summary, compared to the standard EFD methods, the new method has a spectral-like accuracy; compared to the traditional LU-decomposition implicit methods, the new method is explicit. It is more suitable for high-performance computing without losing any accuracy.

Download Full-text

A survey of power and energy efficient techniques for high performance numerical linear algebra operations

Parallel Computing ◽

10.1016/j.parco.2014.09.001 ◽

2014 ◽

Vol 40 (10) ◽

pp. 559-573 ◽

Cited By ~ 19

Author(s):

Li Tan ◽

Shashank Kothapalli ◽

Longxiang Chen ◽

Omar Hussaini ◽

Ryan Bissiri ◽

...

Keyword(s):

Linear Algebra ◽

Energy Efficient ◽

High Performance ◽

Numerical Linear Algebra ◽

Power And Energy

Download Full-text

Static Analysis of Run-Time Inter-Core Interferences for Concurrent Programs in Shared Cache Multicore Architectures

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.198-199.523 ◽

2012 ◽

Vol 198-199 ◽

pp. 523-527

Author(s):

Fang Yuan Chen ◽

Dong Song Zhang ◽

Zhi Ying Wang

Keyword(s):

Multicore Processors ◽

Mapping Method ◽

Concurrent Programs ◽

Multicore Architectures ◽

Shared Resources ◽

Worst Case ◽

Shared Cache ◽

Address Mapping ◽

Novel Approach ◽

Time Systems

Worst-Case Execution Time (WCET) is crucial in real-time systems and is very challenging in multicore processors due to the possible runtime inter-thread interferences caused by shared resources. This paper proposes a novel approach to analyze runtime inter-core interferences for consecutive or inconsecutive concurrent programs. Our approach can reasonably estimate runtime inter-core interferences in shared cache by introducing lifetime and instruction fetching timing relations analysis into address mapping method. Compared with the method based on lifetime alone, our proposed approach efficiently improves the tightness of WCET estimation.

Download Full-text

High Performance Topology-Aware Communication in Multicore Processors

Chapman & Hall/CRC Computational Science - Scientific Computing with Multicore and Accelerators ◽

10.1201/b10376-30 ◽

2010 ◽

pp. 443-460

Author(s):

Hari Subramoni ◽

Fabrizio Petrini ◽

Virat Agarwal ◽

Davide Pasetto

Keyword(s):

High Performance ◽

Multicore Processors

Download Full-text

NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

Electronics ◽

10.3390/electronics10161984 ◽

2021 ◽

Vol 10 (16) ◽

pp. 1984

Author(s):

Wei Zhang ◽

Zihao Jiang ◽

Zhiguang Chen ◽

Nong Xiao ◽

Yang Ou

Keyword(s):

Energy Efficiency ◽

High Performance ◽

Multicore Processors ◽

Matrix Multiplication ◽

Memory Access ◽

Double Precision ◽

Competitive Performance ◽

General Matrix ◽

Remarkable Improvement ◽

Task Independence

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.

Download Full-text