Detailed Load Balance Analysis of Large Scale Parallel Applications

Rising power costs and constraints are driving a growing focus on the energy efficiency of high performance computing systems. The unique characteristics of a particular system and workload and their effect on performance and energy efficiency are typically difficult for application users to assess and to control. Settings for optimum performance and energy efficiency can also diverge, so we need to identify trade-off options that guide a suitable balance between energy use and performance. We present statistical and machine learning models that only require a small number of runs to make accurate Pareto-optimal trade-off predictions using parameters that users can control. We study model training and validation using several parallel kernels and more complex workloads, including Algebraic Multigrid (AMG), Large-scale Atomic Molecular Massively Parallel Simulator, and Livermore Unstructured Lagrangian Explicit Shock Hydrodynamics. We demonstrate that we can train the models using as few as 12 runs, with prediction error of less than 10%. Our AMG results identify trade-off options that provide up to 45% improvement in energy efficiency for around 10% performance loss. We reduce the sample measurement time required for AMG by 90%, from 13 h to 74 min.

Download Full-text

PeerFlow: Secure Load Balancing in Tor

Proceedings on Privacy Enhancing Technologies ◽

10.1515/popets-2017-0017 ◽

2017 ◽

Vol 2017 (2) ◽

pp. 74-94 ◽

Cited By ~ 7

Author(s):

Aaron Johnson ◽

Rob Jansen ◽

Nicholas Hopper ◽

Aaron Segal ◽

Paul Syverson

Keyword(s):

Load Balancing ◽

Load Balance ◽

Large Scale ◽

Scanning System ◽

Large Scale Network ◽

Improve Accuracy ◽

Scale Network ◽

Network Simulations ◽

Improved Design ◽

Speed And Accuracy

Abstract We present PeerFlow, a system to securely load balance client traffic in Tor. Security in Tor requires that no adversary handle too much traffic. However, Tor relays are run by volunteers who cannot be trusted to report the relay bandwidths, which Tor clients use for load balancing. We show that existing methods to determine the bandwidths of Tor relays allow an adversary with little bandwidth to attack large amounts of client traffic. These methods include Tor’s current bandwidth-scanning system, TorFlow, and the peer-measurement system EigenSpeed. We present an improved design called PeerFlow that uses a peer-measurement process both to limit an adversary’s ability to increase his measured bandwidth and to improve accuracy. We show our system to be secure, fast, and efficient. We implement PeerFlow in Tor and demonstrate its speed and accuracy in large-scale network simulations.

Download Full-text

Performance Prediction for Large-Scale Parallel Applications Using Representative Replay

IEEE Transactions on Computers ◽

10.1109/tc.2015.2479630 ◽

2016 ◽

Vol 65 (7) ◽

pp. 2184-2198 ◽

Cited By ~ 9

Author(s):

Jidong Zhai ◽

Wenguang Chen ◽

Weimin Zheng ◽

Keqin Li

Keyword(s):

Performance Prediction ◽

Large Scale ◽

Parallel Applications

Download Full-text

Periodic hierarchical load balancing for large supercomputers

The International Journal of High Performance Computing Applications ◽

10.1177/1094342010394383 ◽

2011 ◽

Vol 25 (4) ◽

pp. 371-385 ◽

Cited By ~ 34

Author(s):

Gengbin Zheng ◽

Abhinav Bhatelé ◽

Esteban Meneses ◽

Laxmikant V. Kalé

Keyword(s):

Load Balancing ◽

Large Scale ◽

Parallel Machines ◽

National Laboratory ◽

Argonne National Laboratory ◽

Parallel Applications ◽

Scientific Application ◽

Computing Center ◽

Blue Gene ◽

Advanced Computing

Large parallel machines with hundreds of thousands of processors are becoming more prevalent. Ensuring good load balance is critical for scaling certain classes of parallel applications on even thousands of processors. Centralized load balancing algorithms suffer from scalability problems, especially on machines with a relatively small amount of memory. Fully distributed load balancing algorithms, on the other hand, tend to take longer to arrive at good solutions. In this paper, we present an automatic dynamic hierarchical load balancing method that overcomes the scalability challenges of centralized schemes and longer running times of traditional distributed schemes. Our solution overcomes these issues by creating multiple levels of load balancing domains which form a tree. This hierarchical method is demonstrated within a measurement-based load balancing framework in Charm++. We discuss techniques to deal with scalability challenges of load balancing at very large scale. We present performance data of the hierarchical load balancing method on up to 16,384 cores of Ranger (at the Texas Advanced Computing Center) and 65,536 cores of Intrepid (the Blue Gene/P at Argonne National Laboratory) for a synthetic benchmark. We also demonstrate the successful deployment of the method in a scientific application, NAMD, with results on Intrepid.

Download Full-text

Load Balance Optimization Based Multi-robot Cooperative Task Planning for Large-Scale Aerospace Structures

10.1007/978-3-030-89098-8_75 ◽

2021 ◽

pp. 797-809

Author(s):

Jiamei Lin ◽

Wei Tian ◽

Pengcheng Li ◽

Shaorui Lu

Keyword(s):

Load Balance ◽

Large Scale ◽

Task Planning ◽

Aerospace Structures ◽

Multi Robot

Download Full-text

FastMM: an efficient toolbox for personalized constraint-based metabolic modeling

BMC Bioinformatics ◽

10.1186/s12859-020-3410-4 ◽

2020 ◽

Vol 21 (1) ◽

Cited By ~ 3

Author(s):

Gong-Hua Li ◽

Shaoxing Dai ◽

Feifei Han ◽

Wenxing Li ◽

Jingfei Huang ◽

...

Keyword(s):

Flux Balance Analysis ◽

Large Scale ◽

Computing Time ◽

Metabolic Modeling ◽

The Cancer Genome Atlas ◽

Flux Balance ◽

Mcmc Sampling ◽

Genome Wide ◽

Balance Analysis ◽

User Friendly

Abstract Background Constraint-based metabolic modeling has been applied to understand metabolism related disease mechanisms, to predict potential new drug targets and anti-metabolites, and to identify biomarkers of complex diseases. Although the state-of-art modeling toolbox, COBRA 3.0, is powerful, it requires substantial computing time conducting flux balance analysis, knockout analysis, and Markov Chain Monte Carlo (MCMC) sampling, which may limit its application in large scale genome-wide analysis. Results Here, we rewrote the underlying code of COBRA 3.0 using C/C++, and developed a toolbox, termed FastMM, to effectively conduct constraint-based metabolic modeling. The results showed that FastMM is 2~400 times faster than COBRA 3.0 in performing flux balance analysis and knockout analysis and returns consistent outputs. When applied to MCMC sampling, FastMM is 8 times faster than COBRA 3.0. FastMM is also faster than some efficient metabolic modeling applications, such as Cobrapy and Fast-SL. In addition, we developed a Matlab/Octave interface for fast metabolic modeling. This interface was fully compatible with COBRA 3.0, enabling users to easily perform complex applications for metabolic modeling. For example, users who do not have deep constraint-based metabolic model knowledge can just type one command in Matlab/Octave to perform personalized metabolic modeling. Users can also use the advance and multiple threading parameters for complex metabolic modeling. Thus, we provided an efficient and user-friendly solution to perform large scale genome-wide metabolic modeling. For example, FastMM can be applied to the modeling of individual cancer metabolic profiles of hundreds to thousands of samples in the Cancer Genome Atlas (TCGA). Conclusion FastMM is an efficient and user-friendly toolbox for large-scale personalized constraint-based metabolic modeling. It can serve as a complementary and invaluable improvement to the existing functionalities in COBRA 3.0. FastMM is under GPL license and can be freely available at GitHub site: https://github.com/GonghuaLi/FastMM.

Download Full-text

Performance-Aware Scheduling of Parallel Applications on Non-Dedicated Clusters

Electronics ◽

10.3390/electronics8090982 ◽

2019 ◽

Vol 8 (9) ◽

pp. 982 ◽

Cited By ~ 1

Author(s):

Alberto Cascajo ◽

David E. Singh ◽

Jesus Carretero

Keyword(s):

Large Scale ◽

Job Scheduling ◽

Parallel Applications ◽

Data Staging ◽

Performance Improvements ◽

Practical Evaluation ◽

Significant Performance ◽

Scalable Monitoring ◽

And Control ◽

New Strategies

This work presents a HPC framework that provides new strategies for resource management and job scheduling, based on executing different applications in shared compute nodes, maximizing platform utilization. The framework includes a scalable monitoring tool that is able to analyze the platform’s compute node utilization. We also introduce an extension of CLARISSE, a middleware for data-staging coordination and control on large-scale HPC platforms that uses the information provided by the monitor in combination with application-level analysis to detect performance degradation in the running applications. This degradation, caused by the fact that the applications share the compute nodes and may compete for their resources, is avoided by means of dynamic application migration. A description of the architecture, as well as a practical evaluation of the proposal, shows significant performance improvements up to 20% in the makespan and 10% in energy consumption compared to a non-optimized execution.

Download Full-text

Quality of Service on Link Aggregation Network Virtualization for Docker Containers

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.886.227 ◽

2019 ◽

Vol 886 ◽

pp. 227-232

Author(s):

Yanapat Chuchuen ◽

Kritwara Rattanaopas ◽

Sarapee Chunkaew

Keyword(s):

Traffic Control ◽

Load Balance ◽

High Speed ◽

Large Scale ◽

Network Virtualization ◽

Network Bandwidth ◽

Working Together ◽

Multiple Network ◽

Implementation Tool

Docker engine is an extremely powerful tool for PaaS platform of cloud computing. It gives benefits for large-scale of internet services. Web service is basic service for everyone who requires to access internet that web infrastructure must has scalability with load-balance web server called reverse proxy. The key answers for a large-scale web must have multiple web servers working together with high speed bandwidth. Moreover, multiple clusters can find in the same data center there are required to assign priority and quality of each cluster service. We investigate load-balance assign link aggregation with network QoS by using pipework script and traffic control tool in frontend reverse proxy server on each cluster. Our research evaluates scenario of network QoS ratios which include 50/50, 60/40, 70/30 and 80/20. We compare network bandwidth between both web reverse proxy clusters. The results present our designed and implementation tool not only can control network QoS on each web reverse proxy cluster in all load-balance link aggregation modes which include round-robin, XOR and ALB but also those of clusters can access multiple network interface. In experiment, average network bandwidths in all QoS cases are around 200 MB per second for link aggregation of 2 gigabit interface.

Download Full-text