Hybrid-Grained Dynamic Load Balanced GEMM on NUMA Architectures

The Basic Linear Algebra Subprograms (BLAS) is a fundamental numerical software and GEneral Matrix Multiply (GEMM) is the most important computational kernel routine in the BLAS library. On multi-core and many-core processors, the whole workload of GEMM is partitioned and scheduled to multiple threads to exploit the parallel hardware. Generally, the workload is equally partitioned among threads and all threads are expected to accomplish their work in roughly the same time. However, this is not the case on Non-Uniform Memory Access (NUMA) architectures. The NUMA effect may cause threads to run at different speeds, and the overall executing times of GEMM is determined by the slowest thread. In this paper, we propose a hybrid-grained dynamic load-balancing method to reduce the harm of the NUMA effect by allowing fast threads to steal work from slow ones. We evaluate the proposed method on Phytium 2000+, an emerging 64-core high-performance processor based on Arm’s AArch64 architecture. Results show that our method reduces the synchronization overhead by 51.5% and achieves an improvement of GEMM performance by 1.9%.

Download Full-text

High performance pattern matching with dynamic load balancing on heterogeneous systems

14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP'06) ◽

10.1109/pdp.2006.41 ◽

2006 ◽

Cited By ~ 2

Author(s):

Jin Hwan Park ◽

B.A. Demirdag

Keyword(s):

Load Balancing ◽

Pattern Matching ◽

Dynamic Load ◽

High Performance ◽

Heterogeneous Systems ◽

Dynamic Load Balancing ◽

Performance Pattern

Download Full-text

Two-level Dynamic Load Balancing for High Performance Scientific Applications

Proceedings of the 2020 SIAM Conference on Parallel Processing for Scientific Computing ◽

10.1137/1.9781611976137.7 ◽

2020 ◽

pp. 69-80

Author(s):

Ali Mohammed ◽

Aurélien Cavelan ◽

Florina M. Ciorba ◽

Rubén M. Cabezón ◽

Ioana Banicescu

Keyword(s):

Load Balancing ◽

Dynamic Load ◽

High Performance ◽

Dynamic Load Balancing ◽

Scientific Applications

Download Full-text

Dynamic Load-Balancing and High Performance Communication in Jcluster

2007 IEEE International Parallel and Distributed Processing Symposium ◽

10.1109/ipdps.2007.370417 ◽

2007 ◽

Cited By ~ 3

Author(s):

Bao-Yin Zhang ◽

Ze-Yao Mo ◽

Guang-Wen Yang ◽

Wei-Min Zheng

Keyword(s):

Load Balancing ◽

Dynamic Load ◽

High Performance ◽

Dynamic Load Balancing

Download Full-text

Cache Locality-Centric Parallel String Matching on Many-Core Accelerator Chips

Scientific Programming ◽

10.1155/2015/937694 ◽

2015 ◽

Vol 2015 ◽

pp. 1-20 ◽

Cited By ~ 1

Author(s):

Nhat-Phuong Tran ◽

Myungho Lee ◽

Dong Hoon Choi

Keyword(s):

High Performance ◽

Parallel Implementation ◽

String Matching ◽

Processing Unit ◽

Xeon Phi ◽

Intel Xeon Phi ◽

Multiple Threads ◽

The Many ◽

Many Core ◽

Intel Xeon

Aho-Corasick (AC) algorithm is a multiple patterns string matching algorithm commonly used in computer and network security and bioinformatics, among many others. In order to meet the highly demanding computational requirements imposed on these applications, achieving high performance for the AC algorithm is crucial. In this paper, we present a high performance parallelization of the AC on the many-core accelerator chips such as the Graphic Processing Unit (GPU) from Nvidia and the Intel Xeon Phi. Our parallelization approach significantly improves the cache locality of the AC by partitioning a given set of string patterns into multiple smaller sets of patterns in a space-efficient way. Using the multiple pattern sets, intensive pattern matching operations are concurrently conducted with respect to the whole input text data. Compared with the previous approaches where the input data is partitioned amongst multiple threads instead of partitioning the pattern set, our approach significantly improves the performance. Experimental results show that our approach leads up to 2.73 times speedup on the Nvidia K20 GPU and 2.00 times speedup on the Intel Xeon Phi compared with the previous approach. Our parallel implementation delivers up to 693 Gbps throughput performance on the K20.

Download Full-text

A Set of Batched Basic Linear Algebra Subprograms and LAPACK Routines

ACM Transactions on Mathematical Software ◽

10.1145/3431921 ◽

2021 ◽

Vol 47 (3) ◽

pp. 1-23

Author(s):

Ahmad Abdelfattah ◽

Timothy Costa ◽

Jack Dongarra ◽

Mark Gates ◽

Azzam Haidar ◽

...

Keyword(s):

Machine Learning ◽

Linear Algebra ◽

High Performance ◽

Large Scale ◽

Floating Point ◽

Equal Size ◽

Hardware Accelerators ◽

Double Precision ◽

Basic Linear Algebra Subprograms ◽

Many Core

This article describes a standard API for a set of Batched Basic Linear Algebra Subprograms (Batched BLAS or BBLAS). The focus is on many independent BLAS operations on small matrices that are grouped together and processed by a single routine, called a Batched BLAS routine. The matrices are grouped together in uniformly sized groups, with just one group if all the matrices are of equal size. The aim is to provide more efficient, but portable, implementations of algorithms on high-performance many-core platforms. These include multicore and many-core CPU processors, GPUs and coprocessors, and other hardware accelerators with floating-point compute facility. As well as the standard types of single and double precision, we also include half and quadruple precision in the standard. In particular, half precision is used in many very large scale applications, such as those associated with machine learning.

Download Full-text

A general matrix iterative model for dynamic load balancing

Parallel Computing ◽

10.1016/0167-8191(96)00026-9 ◽

1996 ◽

Vol 22 (7) ◽

pp. 969-989 ◽

Cited By ~ 10

Author(s):

Mark A. Franklin ◽

Vasudha Govindan

Keyword(s):

Load Balancing ◽

Dynamic Load ◽

Dynamic Load Balancing ◽

General Matrix ◽

Iterative Model

Download Full-text

Dynamic Load Balancing for High-Performance Simulations of Combustion in Engine Applications

2011 19th International Euromicro Conference on Parallel, Distributed and Network-Based Processing ◽

10.1109/pdp.2011.88 ◽

2011 ◽

Author(s):

Laura Antonelli ◽

Pasqua D'Ambra

Keyword(s):

Load Balancing ◽

Dynamic Load ◽

High Performance ◽

Dynamic Load Balancing

Download Full-text

Dynamic Load Balancing for High-Performance Graph Processing on Hybrid CPU-GPU Platforms

2016 6th Workshop on Irregular Applications: Architecture and Algorithms (IA3) ◽

10.1109/ia3.2016.016 ◽

2016 ◽

Cited By ~ 3

Author(s):

Stijn Heldens ◽

Ana Lucia Varbanescu ◽

Alexandru Iosup

Keyword(s):

Load Balancing ◽

Dynamic Load ◽

High Performance ◽

Dynamic Load Balancing ◽

Graph Processing

Download Full-text

AN ADAPTIVE DYNAMIC LOAD BALANCING TECHNIQUE FOR GRID-BASED LARGE SCALE DISTRIBUTED SIMULATIONS

Journal of Interconnection Networks ◽

10.1142/s0219265909002637 ◽

2009 ◽

Vol 10 (04) ◽

pp. 391-419 ◽

Cited By ~ 2

Author(s):

ELIE EL AJALTOUNI ◽

MING ZHANG ◽

AZZEDINE BOUKERCHE ◽

ROBSON EDUARDO DE GRANDE

Keyword(s):

Load Balancing ◽

Dynamic Load ◽

High Performance ◽

Large Scale ◽

Cluster Algorithm ◽

Distributed Simulation ◽

Dynamic Load Balancing ◽

System Specification ◽

Distributed Simulations ◽

Run Time

Dynamic load balancing is a key factor in achieving high performance for large scale distributed simulations on grid infrastructures. In a grid environment, the available resources and the simulation's computation and communication behavior may experience critical run-time imbalances. Consequently, an initial static partitioning should be combined with a dynamic load balancing scheme to ensure the high performance of the distributed simulation. In this paper, we propose a dynamic load balancing scheme for distributed simulations on a grid infrastructure. Our scheme is composed of an online network analyzing service coupled with monitoring agents and a run-time model repartitioning service. We present a hierarchical scalable adaptive JXTA service based scheme and use simulation experiments to demonstrate that our proposed scheme exhibits better performance in terms of simulation execution time. Furthermore, we extend our algorithm from a local intra-cluster algorithm to a global inter-cluster algorithm and we consider the proposed global design through a formalized Discrete Event System Specification (DEVS) model system

Download Full-text