runtime prediction
Recently Published Documents


TOTAL DOCUMENTS

41
(FIVE YEARS 12)

H-INDEX

5
(FIVE YEARS 1)

2021 ◽  
Vol 42 (11) ◽  
pp. 2562-2570
Author(s):  
G. I. Savin ◽  
D. S. Lyakhovets ◽  
A. V. Baranov
Keyword(s):  

2021 ◽  
Vol 11 (20) ◽  
pp. 9448
Author(s):  
Qiqi Wang ◽  
Hongjie Zhang ◽  
Cheng Qu ◽  
Yu Shen ◽  
Xiaohui Liu ◽  
...  

The job scheduler plays a vital role in high-performance computing platforms. It determines the execution order of the jobs and the allocation of resources, which in turn affect the resource utilization of the entire system. As the scale and complexity of HPC continue to grow, job scheduling is becoming increasingly important and difficult. Existing studies relied on user-specified or regression techniques to give fixed runtime prediction values and used the values in static heuristic scheduling algorithms. However, these approaches require very accurate runtime predictions to produce better results, and fixed heuristic scheduling strategies cannot adapt to changes in the workload. In this work, we propose RLSchert, a job scheduler based on deep reinforcement learning and remaining runtime prediction. Firstly, RLSchert estimates the state of the system by using a dynamic job remaining runtime predictor, thereby providing an accurate spatiotemporal view of the cluster status. Secondly, RLSchert learns the optimal policy to select or kill jobs according to the status through imitation learning and the proximal policy optimization algorithm. Extensive experiments on real-world job logs at the USTC Supercomputing Center showed that RLSchert is superior to static heuristic policies and outperforms the learning-based scheduler DeepRM. In addition, the dynamic predictor gives a more accurate remaining runtime prediction result, which is essential for most learning-based schedulers.


2021 ◽  
pp. 156-168
Author(s):  
Xiaofei Yue ◽  
Lan Shi ◽  
Yuhai Zhao ◽  
Hangxu Ji ◽  
Guoren Wang

2020 ◽  
Vol 38 (4) ◽  
pp. 819-839
Author(s):  
Hani Al-Sayeh ◽  
Stefan Hagedorn ◽  
Kai-Uwe Sattler

Abstract Apache Spark jobs are often characterized by processing huge data sets and, therefore, require runtimes in the range of minutes to hours. Thus, being able to predict the runtime of such jobs would be useful not only to know when the job will finish, but also for scheduling purposes, to estimate monetary costs for cloud deployment, or to determine an appropriate cluster configuration, such as the number of nodes. However, predicting Spark job runtimes is much more challenging than for standard database queries: cluster configuration and parameters have a significant performance impact and jobs usually contain a lot of user-defined code making it difficult to estimate cardinalities and execution costs. In this paper, we present a gray-box modeling methodology for runtime prediction of Apache Spark jobs. Our approach comprises two steps: first, a white-box model for predicting the cardinalities of the input RDDs of each operator is built based on prior knowledge about the behavior and application parameters such as applied filters data, number of iterations, etc. In the second step, a black-box model for each task constructed by monitoring runtime metrics while varying allocated resources and input RDD cardinalities is used. We further show how to use this gray-box approach not only for predicting the runtime of a given job, but also as part of a decision model for reusing intermediate cached results of Spark jobs. Our methodology is validated with experimental evaluation showing a highly accurate prediction of the actual job runtime and a performance improvement if intermediate results can be reused.


Sign in / Sign up

Export Citation Format

Share Document