runtime prediction Latest Research Papers

The job scheduler plays a vital role in high-performance computing platforms. It determines the execution order of the jobs and the allocation of resources, which in turn affect the resource utilization of the entire system. As the scale and complexity of HPC continue to grow, job scheduling is becoming increasingly important and difficult. Existing studies relied on user-specified or regression techniques to give fixed runtime prediction values and used the values in static heuristic scheduling algorithms. However, these approaches require very accurate runtime predictions to produce better results, and fixed heuristic scheduling strategies cannot adapt to changes in the workload. In this work, we propose RLSchert, a job scheduler based on deep reinforcement learning and remaining runtime prediction. Firstly, RLSchert estimates the state of the system by using a dynamic job remaining runtime predictor, thereby providing an accurate spatiotemporal view of the cluster status. Secondly, RLSchert learns the optimal policy to select or kill jobs according to the status through imitation learning and the proximal policy optimization algorithm. Extensive experiments on real-world job logs at the USTC Supercomputing Center showed that RLSchert is superior to static heuristic policies and outperforms the learning-based scheduler DeepRM. In addition, the dynamic predictor gives a more accurate remaining runtime prediction result, which is essential for most learning-based schedulers.

Download Full-text

An RED Hybrid Model for SOC Tracking, Runtime Prediction and Transient Response Description

2021 IEEE Applied Power Electronics Conference and Exposition (APEC) ◽

10.1109/apec42165.2021.9487041 ◽

2021 ◽

Author(s):

Zhihong Yan ◽

Ying Huang ◽

Chenxiao Jiang ◽

Ying Mei ◽

Siew-Chong Tan ◽

...

Keyword(s):

Hybrid Model ◽

Transient Response ◽

Runtime Prediction

Download Full-text

Simulation Runtime Prediction Approach based on Stacking Ensemble Learning

Proceedings of the 11th International Conference on Simulation and Modeling Methodologies, Technologies and Applications ◽

10.5220/0010517600420049 ◽

2021 ◽

Author(s):

Yuhao Xiao ◽

Yiping Yao ◽

Feng Zhu ◽

Kai Chen

Keyword(s):

Ensemble Learning ◽

Runtime Prediction ◽

Prediction Approach

Download Full-text

Online Runtime Prediction Method for Distributed Iterative Jobs

10.1007/978-3-030-87571-8_14 ◽

2021 ◽

pp. 156-168

Author(s):

Xiaofei Yue ◽

Lan Shi ◽

Yuhai Zhao ◽

Hangxu Ji ◽

Guoren Wang

Keyword(s):

Prediction Method ◽

Runtime Prediction

Download Full-text

Simulation Runtime Prediction Approach based on Stacking Ensemble Learning

10.5220/0010517600002995 ◽

2021 ◽

Author(s):

Yuhao Xiao ◽

Yiping Yao ◽

Feng Zhu ◽

Kai Chen

Keyword(s):

Ensemble Learning ◽

Runtime Prediction ◽

Prediction Approach

Download Full-text

Runtime prediction of high-performance computing jobs based on ensemble learning

Proceedings of the 2020 4th International Conference on High Performance Compilation, Computing and Communications ◽

10.1145/3407947.3407968 ◽

2020 ◽

Author(s):

Xiaomeng Chen ◽

Hui Zhang ◽

Hanli Bai ◽

Chunming Yang ◽

Xujian Zhao ◽

...

Keyword(s):

High Performance Computing ◽

Ensemble Learning ◽

High Performance ◽

Runtime Prediction ◽

Performance Computing

Download Full-text

A gray-box modeling methodology for runtime prediction of Apache Spark jobs

Distributed and Parallel Databases ◽

10.1007/s10619-020-07286-y ◽

2020 ◽

Vol 38 (4) ◽

pp. 819-839

Author(s):

Hani Al-Sayeh ◽

Stefan Hagedorn ◽

Kai-Uwe Sattler

Keyword(s):

Apache Spark ◽

Box Model ◽

Data Sets ◽

Performance Impact ◽

Runtime Prediction ◽

Execution Costs ◽

Modeling Methodology ◽

Huge Data ◽

Cluster Configuration ◽

Significant Performance

Abstract Apache Spark jobs are often characterized by processing huge data sets and, therefore, require runtimes in the range of minutes to hours. Thus, being able to predict the runtime of such jobs would be useful not only to know when the job will finish, but also for scheduling purposes, to estimate monetary costs for cloud deployment, or to determine an appropriate cluster configuration, such as the number of nodes. However, predicting Spark job runtimes is much more challenging than for standard database queries: cluster configuration and parameters have a significant performance impact and jobs usually contain a lot of user-defined code making it difficult to estimate cardinalities and execution costs. In this paper, we present a gray-box modeling methodology for runtime prediction of Apache Spark jobs. Our approach comprises two steps: first, a white-box model for predicting the cardinalities of the input RDDs of each operator is built based on prior knowledge about the behavior and application parameters such as applied filters data, number of iterations, etc. In the second step, a black-box model for each task constructed by monitoring runtime metrics while varying allocated resources and input RDD cardinalities is used. We further show how to use this gray-box approach not only for predicting the runtime of a given job, but also as part of a decision model for reusing intermediate cached results of Spark jobs. Our methodology is validated with experimental evaluation showing a highly accurate prediction of the actual job runtime and a performance improvement if intermediate results can be reused.

Download Full-text