scholarly journals Scheduling Two-Sided Transformations Using Tile Algorithms on Multicore Architectures

2010 ◽  
Vol 18 (1) ◽  
pp. 35-50 ◽  
Author(s):  
Hatem Ltaief ◽  
Jakub Kurzak ◽  
Jack Dongarra ◽  
Rosa M. Badia

The objective of this paper is to describe, in the context of multicore architectures, three different scheduler implementations for the two-sided linear algebra transformations, in particular the Hessenberg and Bidiagonal reductions which are the first steps for the standard eigenvalue problems and the singular value decompositions respectively. State-of-the-art dense linear algebra softwares, such as the LAPACK and ScaLAPACK libraries, suffer performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time the fine-grain dataflow model gains popularity as a paradigm for programming multicore architectures. Buttari et al. (Parellel Comput. Syst. Appl. 35 (2009), 38–53) introduced the concept oftile algorithmsin which parallelism is no longer hidden inside Basic Linear Algebra Subprograms but is brought to the fore to yield much better performance. Along with efficient scheduling mechanisms for data-driven execution, these tile two-sided reductions achieve high performance computing by reaching up to 75% of the DGEMM peak on a 12000×12000 matrix with 16 Intel Tigerton 2.4 GHz processors. The main drawback of thetile algorithmsapproach for two-sided transformations is that the full reduction cannot be obtained in one stage. Other methods have to be considered to further reduce the band matrices to the required forms.

2019 ◽  
Vol 29 (2) ◽  
pp. 407-419
Author(s):  
Beata Bylina ◽  
Jarosław Bylina

Abstract The aim of this paper is to investigate dense linear algebra algorithms on shared memory multicore architectures. The design and implementation of a parallel tiled WZ factorization algorithm which can fully exploit such architectures are presented. Three parallel implementations of the algorithm are studied. The first one relies only on exploiting multithreaded BLAS (basic linear algebra subprograms) operations. The second implementation, except for BLAS operations, employs the OpenMP standard to use the loop-level parallelism. The third implementation, except for BLAS operations, employs the OpenMP task directive with the depend clause. We report the computational performance and the speedup of the parallel tiled WZ factorization algorithm on shared memory multicore architectures for dense square diagonally dominant matrices. Then we compare our parallel implementations with the respective LU factorization from a vendor implemented LAPACK library. We also analyze the numerical accuracy. Two of our implementations can be achieved with near maximal theoretical speedup implied by Amdahl’s law.


2021 ◽  
Vol 18 (3) ◽  
pp. 1-24
Author(s):  
Yashuai Lü ◽  
Hui Guo ◽  
Libo Huang ◽  
Qi Yu ◽  
Li Shen ◽  
...  

Due to massive thread-level parallelism, GPUs have become an attractive platform for accelerating large-scale data parallel computations, such as graph processing. However, achieving high performance for graph processing with GPUs is non-trivial. Processing graphs on GPUs introduces several problems, such as load imbalance, low utilization of hardware unit, and memory divergence. Although previous work has proposed several software strategies to optimize graph processing on GPUs, there are several issues beyond the capability of software techniques to address. In this article, we present GraphPEG, a graph processing engine for efficient graph processing on GPUs. Inspired by the observation that many graph algorithms have a common pattern on graph traversal, GraphPEG improves the performance of graph processing by coupling automatic edge gathering with fine-grain work distribution. GraphPEG can also adapt to various input graph datasets and simplify the software design of graph processing with hardware-assisted graph traversal. Simulation results show that, in comparison with two representative highly efficient GPU graph processing software framework Gunrock and SEP-Graph, GraphPEG improves graph processing throughput by 2.8× and 2.5× on average, and up to 7.3× and 7.0× for six graph algorithm benchmarks on six graph datasets, with marginal hardware cost.


2021 ◽  
Vol 47 (2) ◽  
pp. 1-28
Author(s):  
Goran Flegar ◽  
Hartwig Anzt ◽  
Terry Cojean ◽  
Enrique S. Quintana-Ortí

The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator–like a preconditioner–in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.


2012 ◽  
Vol 9 ◽  
pp. 966-975
Author(s):  
Ferdinando Alessi ◽  
Annalisa Massini ◽  
Roberto Basili

Geophysics ◽  
2021 ◽  
pp. 1-71
Author(s):  
Hongwei Liu ◽  
Yi Luo

The finite-difference solution of the second-order acoustic wave equation is a fundamental algorithm in seismic exploration for seismic forward modeling, imaging, and inversion. Unlike the standard explicit finite difference (EFD) methods that usually suffer from the so-called "saturation effect", the implicit FD methods can obtain much higher accuracy with relatively short operator length. Unfortunately, these implicit methods are not widely used because band matrices need to be solved implicitly, which is not suitable for most high-performance computer architectures. We introduce an explicit method to overcome this limitation by applying explicit causal and anti-causal integrations. We can prove that the explicit solution is equivalent to the traditional implicit LU decomposition method in analytical and numerical ways. In addition, we also compare the accuracy of the new methods with the traditional EFD methods up to 32nd order, and numerical results indicate that the new method is more accurate. In terms of the computational cost, the newly proposed method is standard 8th order EFD plus two causal and anti-causal integrations, which can be applied recursively, and no extra memory is needed. In summary, compared to the standard EFD methods, the new method has a spectral-like accuracy; compared to the traditional LU-decomposition implicit methods, the new method is explicit. It is more suitable for high-performance computing without losing any accuracy.


2014 ◽  
Vol 40 (10) ◽  
pp. 559-573 ◽  
Author(s):  
Li Tan ◽  
Shashank Kothapalli ◽  
Longxiang Chen ◽  
Omar Hussaini ◽  
Ryan Bissiri ◽  
...  

2012 ◽  
Vol 198-199 ◽  
pp. 523-527
Author(s):  
Fang Yuan Chen ◽  
Dong Song Zhang ◽  
Zhi Ying Wang

Worst-Case Execution Time (WCET) is crucial in real-time systems and is very challenging in multicore processors due to the possible runtime inter-thread interferences caused by shared resources. This paper proposes a novel approach to analyze runtime inter-core interferences for consecutive or inconsecutive concurrent programs. Our approach can reasonably estimate runtime inter-core interferences in shared cache by introducing lifetime and instruction fetching timing relations analysis into address mapping method. Compared with the method based on lifetime alone, our proposed approach efficiently improves the tightness of WCET estimation.


Electronics ◽  
2021 ◽  
Vol 10 (16) ◽  
pp. 1984
Author(s):  
Wei Zhang ◽  
Zihao Jiang ◽  
Zhiguang Chen ◽  
Nong Xiao ◽  
Yang Ou

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.


Sign in / Sign up

Export Citation Format

Share Document