Performance Measurement and Analysis of Large-Scale Parallel Applications on Leadership Computing Systems

Developers of applications with large-scale computing requirements are currently presented with a variety of high-performance systems optimised for message-passing, however, effectively exploiting the available computing resources remains a major challenge. In addition to fundamental application scalability characteristics, application and system peculiarities often only manifest at extreme scales, requiring highly scalable performance measurement and analysis tools that are convenient to incorporate in application development and tuning activities. We present our experiences with a multigrid solver benchmark and state-of-the-art real-world applications for numerical weather prediction and computational fluid dynamics, on three quite different multi-thousand-processor supercomputer systems – Cray XT3/4, MareNostrum & Blue Gene/L – using the newly-developed SCALASCA toolset to quantify and isolate a range of significant performance issues.

Download Full-text

Performance Measurement and Analysis of High-Performance Parallel Applications over Lambda Grid

The 9th International Conference on Advanced Communication Technology ◽

10.1109/icact.2007.358469 ◽

2007 ◽

Author(s):

Dongwook Kim ◽

Hyun-Wook Jin ◽

Karpjoo Jeong ◽

Jonghyun Lee ◽

Minki Noh

Keyword(s):

Performance Measurement ◽

High Performance ◽

Parallel Applications ◽

Measurement And Analysis

Download Full-text

Statistical and machine learning models for optimizing energy in parallel applications

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019842915 ◽

2019 ◽

Vol 33 (6) ◽

pp. 1079-1097 ◽

Cited By ~ 2

Author(s):

Mark Endrei ◽

Chao Jin ◽

Minh Ngoc Dinh ◽

David Abramson ◽

Heidi Poxon ◽

...

Keyword(s):

Machine Learning ◽

Energy Efficiency ◽

High Performance ◽

Large Scale ◽

Energy Use ◽

Parallel Applications ◽

Learning Models ◽

Trade Off ◽

Time Required ◽

Machine Learning Models

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.

Download Full-text

A lightweight approach to performance portability with targetDP

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016682071 ◽

2016 ◽

Vol 32 (2) ◽

pp. 288-301

Author(s):

Alan Gray ◽

Kevin Stratford

Keyword(s):

Particle Physics ◽

Message Passing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Processing Unit ◽

Performance Portability ◽

Graphics Processing

Leading high performance computing systems achieve their status through use of highly parallel devices such as NVIDIA graphics processing units or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus separate lattice quantum chromodynamics particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with Message Passing Interface (MPI) to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and graphics processing unit-accelerated large scale supercomputers.

Download Full-text

Beyond spatial scalability limitations with a massively parallel method for linear oscillatory problems

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016687625 ◽

2017 ◽

Vol 32 (6) ◽

pp. 913-933 ◽

Cited By ~ 6

Author(s):

Martin Schreiber ◽

Pedro S Peixoto ◽

Terry Haut ◽

Beth Wingate

Keyword(s):

High Performance ◽

Large Scale ◽

Weather Prediction ◽

Finite Difference Methods ◽

Scaling Limit ◽

Performance Model ◽

Massively Parallel ◽

Large Set ◽

Problem Size ◽

Single Node

This paper presents, discusses and analyses a massively parallel-in-time solver for linear oscillatory partial differential equations, which is a key numerical component for evolving weather, ocean, climate and seismic models. The time parallelization in this solver allows us to significantly exceed the computing resources used by parallelization-in-space methods and results in a correspondingly significantly reduced wall-clock time. One of the major difficulties of achieving Exascale performance for weather prediction is that the strong scaling limit – the parallel performance for a fixed problem size with an increasing number of processors – saturates. A main avenue to circumvent this problem is to introduce new numerical techniques that take advantage of time parallelism. In this paper, we use a time-parallel approximation that retains the frequency information of oscillatory problems. This approximation is based on (a) reformulating the original problem into a large set of independent terms and (b) solving each of these terms independently of each other which can now be accomplished on a large number of high-performance computing resources. Our results are conducted on up to 3586 cores for problem sizes with the parallelization-in-space scalability limited already on a single node. We gain significant reductions in the time-to-solution of 118.3× for spectral methods and 1503.0× for finite-difference methods with the parallelization-in-time approach. A developed and calibrated performance model gives the scalability limitations a priori for this new approach and allows us to extrapolate the performance of the method towards large-scale systems. This work has the potential to contribute as a basic building block of parallelization-in-time approaches, with possible major implications in applied areas modelling oscillatory dominated problems.

Download Full-text

Periodic hierarchical load balancing for large supercomputers

The International Journal of High Performance Computing Applications ◽

10.1177/1094342010394383 ◽

2011 ◽

Vol 25 (4) ◽

pp. 371-385 ◽

Cited By ~ 34

Author(s):

Gengbin Zheng ◽

Abhinav Bhatelé ◽

Esteban Meneses ◽

Laxmikant V. Kalé

Keyword(s):

Load Balancing ◽

Large Scale ◽

Parallel Machines ◽

National Laboratory ◽

Argonne National Laboratory ◽

Parallel Applications ◽

Scientific Application ◽

Computing Center ◽

Blue Gene ◽

Advanced Computing

Large parallel machines with hundreds of thousands of processors are becoming more prevalent. Ensuring good load balance is critical for scaling certain classes of parallel applications on even thousands of processors. Centralized load balancing algorithms suffer from scalability problems, especially on machines with a relatively small amount of memory. Fully distributed load balancing algorithms, on the other hand, tend to take longer to arrive at good solutions. In this paper, we present an automatic dynamic hierarchical load balancing method that overcomes the scalability challenges of centralized schemes and longer running times of traditional distributed schemes. Our solution overcomes these issues by creating multiple levels of load balancing domains which form a tree. This hierarchical method is demonstrated within a measurement-based load balancing framework in Charm++. We discuss techniques to deal with scalability challenges of load balancing at very large scale. We present performance data of the hierarchical load balancing method on up to 16,384 cores of Ranger (at the Texas Advanced Computing Center) and 65,536 cores of Intrepid (the Blue Gene/P at Argonne National Laboratory) for a synthetic benchmark. We also demonstrate the successful deployment of the method in a scientific application, NAMD, with results on Intrepid.

Download Full-text

Pruners: Providing reproducibility for uncovering non-deterministic errors in runs on supercomputers

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019834621 ◽

2019 ◽

Vol 33 (5) ◽

pp. 777-783 ◽

Cited By ~ 1

Author(s):

Kento Sato ◽

Ignacio Laguna ◽

Gregory L Lee ◽

Martin Schulz ◽

Christopher M Chambreau ◽

...

Keyword(s):

Real World ◽

High Performance ◽

Parallel Applications ◽

Program Execution ◽

World Production ◽

Application Development ◽

Full System ◽

Scientific Simulations ◽

Large Application

Large scientific simulations must be able to achieve the full-system potential of supercomputers. When they tap into high-performance features, however, a phenomenon known as non-determinism may be introduced in their program execution, which significantly hampers application development. Pruners is a new toolset to detect and remedy non-deterministic bugs and errors in large parallel applications. To show the capabilities of Pruners for large application development, we also demonstrate their early usage on real-world production applications.

Download Full-text

A Message-Passing Hardware/Software Cosimulation Environment for Reconfigurable Computing Systems

International Journal of Reconfigurable Computing ◽

10.1155/2009/376232 ◽

2009 ◽

Vol 2009 ◽

pp. 1-9

Author(s):

Manuel Saldaña ◽

Emanuel Ramalho ◽

Paul Chow

Keyword(s):

Reconfigurable Computing ◽

Message Passing ◽

High Performance ◽

System Level ◽

Application Development ◽

Reconfigurable Computers ◽

Development Tool ◽

Verification Tools ◽

Linpack Benchmark ◽

Xilinx Fpga

High-performance reconfigurable computers (HPRCs) provide a mix of standard processors and FPGAs to collectively accelerate applications. This introduces new design challenges, such as the need for portable programming models across HPRCs and system-level verification tools. To address the need for cosimulating a complete heterogeneous application using both software and hardware in an HPRC, we have created a tool called the Message-passing Simulation Framework (MSF). We have used it to simulate and develop an interface enabling an MPI-based approach to exchange data between X86 processors and hardware engines inside FPGAs. The MSF can also be used as an application development tool that enables multiple FPGAs in simulation to exchange messages amongst themselves and with X86 processors. As an example, we simulate a LINPACK benchmark hardware core using an Intel-FSB-Xilinx-FPGA platform to quickly prototype the hardware, to test the communications. and to verify the benchmark results.

Download Full-text

Demonstration of cluster computing for three-dimensional CFD simulations

The Aeronautical Journal ◽

10.1017/s0001924000028037 ◽

1999 ◽

Vol 103 (1027) ◽

pp. 443-447 ◽

Cited By ~ 5

Author(s):

W. McMillan ◽

M. Woodgate ◽

B. E. Richards ◽

B. J. Gribben ◽

K. J. Badcock ◽

...

Keyword(s):

Message Passing ◽

Large Scale ◽

Cluster Computing ◽

Low Cost ◽

Three Dimensional ◽

Cost Effective ◽

Parallel Applications ◽

Cfd Simulations ◽

Single Node ◽

Computing Unit

Abstract Motivated by a lack of sufficient local and national computing facilities for computational fluid dynamics simulations, the Affordable Systems Computing Unit (ASCU) was established to investigate low cost alternatives. The options considered have all involved cluster computing, a term which refers to the grouping of a number of components into a managed system capable of running both serial and parallel applications. The present work aims to demonstrate the utility of commodity processors for dedicated batch processing. The performance of the cluster has proved to be extremely cost effective, enabling large three dimensional flow simulations on a computer costing less than £25k sterling at current market prices. The experience gained on this system in terms of single node performance, message passing and parallel performance will be discussed. In particular, comparisons with the performance of other systems will be made. Several medium-large scale CFD simulations performed using the new cluster will be presented to demonstrate the potential of commodity processor based parallel computers for aerodynamic simulation.

Download Full-text

SDNOFS: Software Defined Networking with Openflow Switches & BCN-ECN with ALTQ for Congestion Avoidance

International Journal of Recent Technology and Engineering - 2 ◽

10.35940/ijrte.d8873.018520 ◽

2020 ◽

Vol 8 (5) ◽

pp. 3710-3719

Keyword(s):

High Performance Computing ◽

Message Passing ◽

High Performance ◽

Congestion Management ◽

Congestion Avoidance ◽

Software Defined Networks ◽

Application Development ◽

Open Flow ◽

Business Requirements ◽

Performance Computing

High-performance computing cluster in a cloud environment. High-performance computing (HPC) helps scientists and researchers to solve complex problems involving multiple computational capabilities. The main reason for using a message passing model is to promote application development, porting, and execution on the variety of parallel computers that can support the paradigm. Since congestion avoidance is critical for the efficient use of different applications, an efficient method for congestion management in software-defined networks based on Open Flow protocol has been presented. This paper proposed two methods; initially, to avoid the congestion problem used by Software Defined Networks (SDN) with open flow switches, this method was originally defined as a communication protocol in SDN environments which allows the SDN controller to interact directly with the forwarding plane of network devices such as switches and routers, both physical and virtual (hypervisorbased), so that it could better adapt to changing business requirements.. Second, to enhance the quality of service and avoid the congestion problem used BCN-ECN with ALTQ. While comparing the existing method, the SDN open flow switches and BCN-ECN with ALTQ provides 98 % accuracy. Usage of these proposed methods will enhance the parameters structures delay time, level of congestion quality time and execution time

Download Full-text