Influence of Job Runtime Prediction on Scheduling Quality

Abstract Motivation One of the many technical challenges that arises when scheduling bioinformatics analyses at scale is determining the appropriate amount of memory and processing resources. Both over- and under-allocation leads to an inefficient use of computational infrastructure. Over allocation locks resources that could otherwise be used for other analyses. Under-allocation causes job failure and requires analyses to be repeated with a larger memory or runtime allowance. We address this challenge by using a historical dataset of bioinformatics analyses run on the Galaxy platform to demonstrate the feasibility of an online service for resource requirement estimation. Results Here we introduced the Galaxy job run dataset and tested popular machine learning models on the task of resource usage prediction. We include three popular forest models: the extra trees regressor, the gradient boosting regressor and the random forest regressor, and find that random forests perform best in the runtime prediction task. We also present two methods of choosing walltimes for previously unseen jobs. Quantile regression forests are more accurate in their predictions, and grant the ability to improve performance by changing the confidence of the estimates. However, the sizes of the confidence intervals are variable and cannot be absolutely constrained. Random forest classifiers address this problem by providing control over the size of the prediction intervals with an accuracy that is comparable to that of the regressor. We show that estimating the memory requirements of a job is possible using the same methods, which as far as we know, has not been done before. Such estimation can be highly beneficial for accurate resource allocation. Availability and implementation Source code available at https://github.com/atyryshkina/algorithm-performance-analysis, implemented in Python. Supplementary information Supplementary data are available at Bioinformatics online.

Download Full-text

Adaptive Online Runtime Prediction to Improve HPC Applications Latency in Cloud

2018 IEEE 11th International Conference on Cloud Computing (CLOUD) ◽

10.1109/cloud.2018.00104 ◽

2018 ◽

Cited By ~ 2

Author(s):

Mina Naghshnejad ◽

Mukesh Singhal

Keyword(s):

Runtime Prediction

Download Full-text

Runtime Prediction Error Levels for Virtual Machine Placement in IaaS Cloud

Procedia Computer Science ◽

10.1016/j.procs.2018.04.054 ◽

2018 ◽

Vol 130 ◽

pp. 368-375

Author(s):

Loïc Perennou ◽

Sylvain Lefebvre

Keyword(s):

Virtual Machine ◽

Prediction Error ◽

Virtual Machine Placement ◽

Runtime Prediction

Download Full-text

An Approach of Chunk-based Task Runtime Prediction for Self-Scheduling on Multi-core Desk Grid

Journal of Computers ◽

10.4304/jcp.6.7.1339-1345 ◽

2011 ◽

Vol 6 (7) ◽

Cited By ~ 3

Author(s):

Peifeng Li ◽

Qiaoming Zhu ◽

Qin Ji ◽

Xiaoxu Zhu

Keyword(s):

Runtime Prediction

Download Full-text

RLSchert: An HPC Job Scheduler Using Deep Reinforcement Learning and Remaining Time Prediction

Applied Sciences ◽

10.3390/app11209448 ◽

2021 ◽

Vol 11 (20) ◽

pp. 9448

Author(s):

Qiqi Wang ◽

Hongjie Zhang ◽

Cheng Qu ◽

Yu Shen ◽

Xiaohui Liu ◽

...

Keyword(s):

Reinforcement Learning ◽

High Performance ◽

Job Scheduling ◽

Vital Role ◽

Runtime Prediction ◽

Heuristic Scheduling ◽

The Status ◽

Computing Platforms ◽

Job Scheduler ◽

Policy Optimization

The job scheduler plays a vital role in high-performance computing platforms. It determines the execution order of the jobs and the allocation of resources, which in turn affect the resource utilization of the entire system. As the scale and complexity of HPC continue to grow, job scheduling is becoming increasingly important and difficult. Existing studies relied on user-specified or regression techniques to give fixed runtime prediction values and used the values in static heuristic scheduling algorithms. However, these approaches require very accurate runtime predictions to produce better results, and fixed heuristic scheduling strategies cannot adapt to changes in the workload. In this work, we propose RLSchert, a job scheduler based on deep reinforcement learning and remaining runtime prediction. Firstly, RLSchert estimates the state of the system by using a dynamic job remaining runtime predictor, thereby providing an accurate spatiotemporal view of the cluster status. Secondly, RLSchert learns the optimal policy to select or kill jobs according to the status through imitation learning and the proximal policy optimization algorithm. Extensive experiments on real-world job logs at the USTC Supercomputing Center showed that RLSchert is superior to static heuristic policies and outperforms the learning-based scheduler DeepRM. In addition, the dynamic predictor gives a more accurate remaining runtime prediction result, which is essential for most learning-based schedulers.

Download Full-text