Scaling Up Parallel Computation of Tiled QR Factorizations by a Distributed Scheduling Runtime System and Analytical Modeling

Parallel Processing Letters ◽

10.1142/s0129626418500044 ◽

2018 ◽

Vol 28 (01) ◽

pp. 1850004 ◽

Author(s):

Weijian Zheng ◽

Fengguang Song ◽

Lan Lin ◽

Zizhong Chen

Keyword(s):

Dynamic Scheduling ◽

Parallel Implementation ◽

Optimal Number ◽

Performance Model ◽

Analytical Performance ◽

Distributed Scheduling ◽

Qr Factorization ◽

Runtime Systems ◽

Runtime System ◽

Factorization Algorithms

Implementing parallel software for QR factorizations to achieve scalable performance on massively parallel manycore systems requires a comprehensive design that includes algorithm redesign, efficient runtime systems, synchronization and communication reduction, and analytical performance modeling. This paper presents a piece of tiled communication-avoiding QR factorization software that is able to scale efficiently for matrices with general dimensions. We design a tiled communication-avoiding QR factorization algorithm and implement it with a fully distributed dynamic scheduling runtime system to minimize both synchronization and communication. The whole class of communication-avoiding QR factorization algorithms uses an important parameter of D (i.e., the number of domains), whose best solution is still unknown so far and requires manual tuning and empirical searching to find it. To that end, we introduce a simplified analytical performance model to determine an optimal number of domains D[Formula: see text]. The experimental results show that our new parallel implementation is faster than a state-of-the-art multicore-based numerical library by up to 30%, and faster than ScaLAPACK by up to 30 times with thousands of CPU cores. Furthermore, using the new analytical model to predict an optimal number of domains is as competitive as exhaustive searching, and exhibits an average performance difference of 1%.

Download Full-text

OpenMP and StarPU Abreast: the Impact of Runtime in Task-Based Block QR Factorization Performance

10.5753/wscad.2019.8654 ◽

2019 ◽

Author(s):

Marcelo Cogo Miletto ◽

Lucas Schnorr

Keyword(s):

Directed Acyclic Graph ◽

Parallel Applications ◽

Qr Factorization ◽

Runtime Systems ◽

Runtime System ◽

Application Performance ◽

Programming Paradigm ◽

High Level Abstraction

Directed Acyclic Graph (DAG) is a high-level abstraction to describe the activities of parallel applications. A DAG contains tasks (nodes) and dependencies (edges) in the task-based programming paradigm. Application performance depends on the choices of the runtime system. Our work intends to evaluate and compare the performance of three different runtime systems, GCC/libgomp, LLVM/libomp, and StarPU for a task-based dense block QR factorization. The obtained results show that while GCC/libgomp achieves up to 5.4% better performance in the best case, it has scalability problems for finegrain problems with large DAGs. LLVM/libomp and StarPU are more scalable, and StarPU is much faster in task creation and submission than the other runtimes.

Download Full-text

A family of parallel QR factorization algorithms

Concurrency Practice and Experience ◽

10.1002/(sici)1096-9128(199607)8:6<461::aid-cpe256>3.0.co;2-h ◽

1996 ◽

Vol 8 (6) ◽

pp. 461-473 ◽

Author(s):

Gerard G.L. Meyer ◽

Mike Pascale

Keyword(s):

Qr Factorization ◽

Factorization Algorithms

Download Full-text

An analytical performance model of robotic storage libraries

Performance Evaluation ◽

10.1016/0166-5316(96)00034-x ◽

1996 ◽

Vol 27-28 (1) ◽

pp. 231-251

Author(s):

T Johnson

Keyword(s):

Performance Model ◽

Analytical Performance

Download Full-text

Analytical performance model for FPGA-based reconfigurable computing

Microprocessors and Microsystems ◽

10.1016/j.micpro.2015.09.009 ◽

2015 ◽

Vol 39 (8) ◽

pp. 796-806 ◽

Author(s):

Hossein Mehri ◽

Bijan Alizadeh

Keyword(s):

Reconfigurable Computing ◽

Performance Model ◽

Analytical Performance

Download Full-text

An Analytical Performance Model for the Spidergon NoC

21st International Conference on Advanced Networking and Applications (AINA '07) ◽

10.1109/aina.2007.31 ◽

2007 ◽

Author(s):

Mahmoud Moadeli ◽

Ali Shahrabi ◽

Wim Vanderbauwhede ◽

Mohamed Ould-Khaoua

Keyword(s):

Performance Model ◽

Analytical Performance

Download Full-text

An analytical performance model for parallel production systems

Proceedings Fourth International Conference on Tools with Artificial Intelligence TAI '92 ◽

10.1109/tai.1992.246430 ◽

2003 ◽

Author(s):

J.-H. Wang ◽

J. Srivastava

Keyword(s):

Production Systems ◽

Performance Model ◽

Analytical Performance

Download Full-text

Analytical performance model for mobile network operator cloud

The Journal of Supercomputing ◽

10.1007/s11227-015-1551-4 ◽

2015 ◽

Vol 71 (12) ◽

pp. 4555-4577 ◽

Author(s):

Hassan Raei ◽

Nasser Yazdani

Keyword(s):

Mobile Network ◽

Performance Model ◽

Analytical Performance ◽

Network Operator ◽

Mobile Network Operator

Download Full-text

Tiled QR factorization algorithms

Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis on - SC '11 ◽

10.1145/2063384.2063393 ◽

2011 ◽

Author(s):

Henricus Bouwmeester ◽

Mathias Jacquelin ◽

Julien Langou ◽

Yves Robert

Keyword(s):

Qr Factorization ◽

Factorization Algorithms

Download Full-text

A Comprehensive Analytical Performance Model of DRAM Caches

Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering - ICPE '15 ◽

10.1145/2668930.2688044 ◽

2015 ◽

Author(s):

Nagendra Gulur ◽

Mahesh Mehendale ◽

Ramaswamy Govindarajan

Keyword(s):

Performance Model ◽

Analytical Performance

Download Full-text

StratusPM: An Analytical Performance Model for Cloud Applications

2016 IEEE 10th International Symposium on the Maintenance and Evolution of Service-Oriented and Cloud-Based Environments (MESOCA) ◽

10.1109/mesoca.2016.13 ◽

2016 ◽

Author(s):

Mohammad Hamdaqa ◽

Ladan Tahvildari

Keyword(s):

Performance Model ◽

Analytical Performance ◽

Cloud Applications

Download Full-text