job scheduler
Recently Published Documents


TOTAL DOCUMENTS

82
(FIVE YEARS 26)

H-INDEX

9
(FIVE YEARS 2)

2021 ◽  
Author(s):  
Sirivan Chaleunxay ◽  
Nikhil Shah

Abstract Understanding the earth's subsurface is critical to the needs of the exploration and production (E&P) industry for minimizing risk and maximizing recovery. Until recently, the industry's service sector has not made many advances in data-driven automated earth model building from raw exploration seismic data. But thankfully, that has now changed. The industry's leading technique to gain an unprecedented increase in resolution and accuracy when establishing a view of the interior of the earth is known as the Full Waveform Inversion (FWI). Advanced formulations of FWI are capable of automating subsurface model building using only raw unprocessed data. Cloud-based FWI is helping to accelerate this journey by encompassing the most sophisticated waveform inversion techniques with the largest compute facility on the planet. This combines to give verifiable accuracy, more automation and more efficiency. In this paper, we describe the transformation of enabling cloud-based FWI to natively take advantage of the public cloud platform's main strength in terms of flexibility and on-demand scalability. We start from lift-and-shift of a legacy MPI-based application designed to be run by a traditional on-prem job scheduler. Our specific goals are to (1) utilize a heterogeneous set of compute hardware throughout the lifecycle of a production FWI run without having to provision them for the entire duration, (2) take advantage of cost-efficient spare-capacity compute instances without uptime guarantees, and (3) maintain a single codebase that can be run both on on-prem HPC systems and on the cloud. To achieve these goals meant transitioning the job-scheduling and "embarrassingly parallel" aspects of the communication code away from using MPI, and onto various cloud-based orchestration systems, as well as finding cloud-based solutions that worked and scaled well for the broadcast/reduction operation. Placing these systems behind a customized TCP-based stub for MPI library calls allows us to run the code as-is in an on-prem HPC environment, while on the cloud we can asynchronously provision and suspend worker instances (potentially with very different hardware configurations) as needed without the burden of maintaining a static MPI world communicator. With this dynamic cloud-native architecture, we 1) utilize advanced formulations of FWI capable of automating subsurface model building using only raw unprocessed data, 2) extract velocity models from the full recorded wavefield (refractions, reflections and multiples), and 3) introduce explicit sensitivity to reflection moveout, invisible to conventional FWI, for macro-model updates below the diving wave zone. This makes it viable to go back to older legacy datasets acquired in complex environments and unlock considerable value where FWI until now has been impossible to apply successfully from a poor starting model.


2021 ◽  
Vol 11 (20) ◽  
pp. 9448
Author(s):  
Qiqi Wang ◽  
Hongjie Zhang ◽  
Cheng Qu ◽  
Yu Shen ◽  
Xiaohui Liu ◽  
...  

The job scheduler plays a vital role in high-performance computing platforms. It determines the execution order of the jobs and the allocation of resources, which in turn affect the resource utilization of the entire system. As the scale and complexity of HPC continue to grow, job scheduling is becoming increasingly important and difficult. Existing studies relied on user-specified or regression techniques to give fixed runtime prediction values and used the values in static heuristic scheduling algorithms. However, these approaches require very accurate runtime predictions to produce better results, and fixed heuristic scheduling strategies cannot adapt to changes in the workload. In this work, we propose RLSchert, a job scheduler based on deep reinforcement learning and remaining runtime prediction. Firstly, RLSchert estimates the state of the system by using a dynamic job remaining runtime predictor, thereby providing an accurate spatiotemporal view of the cluster status. Secondly, RLSchert learns the optimal policy to select or kill jobs according to the status through imitation learning and the proximal policy optimization algorithm. Extensive experiments on real-world job logs at the USTC Supercomputing Center showed that RLSchert is superior to static heuristic policies and outperforms the learning-based scheduler DeepRM. In addition, the dynamic predictor gives a more accurate remaining runtime prediction result, which is essential for most learning-based schedulers.


2021 ◽  
Vol 3 (1) ◽  
Author(s):  
Hao Wang ◽  
Yi-Qin Dai ◽  
Jie Yu ◽  
Yong Dong

AbstractImproving resource utilization is an important goal of high-performance computing systems of supercomputing centers. To meet this goal, the job scheduler of high-performance computing systems often uses backfilling scheduling to fill short-time jobs into job gaps at the front of the queue. Backfilling scheduling needs to obtain the running time of the job. In the past, the job running time is usually given by users and often far exceeded the actual running time of the job, which leads to inaccurate backfilling and a waste of computing resources. In particular, when the predicted job running time is lower than the actual time, the damage caused to the utilization of the system’s computing resources becomes more serious. Therefore, the prediction accuracy of the job running time is crucial to the utilization of system resources. The use of machine learning methods can make more accurate predictions of the job running time. Aiming at the parallel application of aerodynamics, we propose a job running time prediction framework SU combining supervised and unsupervised learning and verify it on the real historical data of the high-performance computing systems of China Aerodynamics Research and Development Center (CARDC). The experimental results show that SU has a high prediction accuracy (80.46%) and a low underestimation rate (24.85%).


2021 ◽  
Author(s):  
Hao Wang ◽  
Yi-Qin Dai ◽  
Jie Yu ◽  
Yong Dong

Abstract Improving resource utilization is an important goal of high-performance computing systems of supercomputing centers. In order to meet this goal, the job scheduler of high-performance computing systems often use backfilling scheduling to fill short-time jobs into the gaps of jobs at the front of the queue. Backfilling scheduling needs to obtain the running time of the job. In the past, the job running times are usually given by users and often far exceeded the actual running time of the job, which leads to inaccurate backfilling and a waste of computing resources. In particular, when the predicted job running time is lower than the actual time, the damage caused to the utilization of the system’s computing resources becomes more serious. Therefore, the prediction accuracy of the job running time is crucial to the utilization of system resources. The use of machine learning methods can make more accurate predictions of the job running time. Aiming at the parallel application of aerodynamics, we propose a job running time prediction framework SU combining supervised and unsupervised learning, and verifies it on the real historical data of the high-performance computing systems of China erodynamics Research and Development Center(CARDC). The experimental results show that SU has a high prediction accuracy(80.46%) and a low underestimation rate(24.85%).


2021 ◽  
Author(s):  
Hao Wang ◽  
Yi-Qin Dai ◽  
Jie Yu ◽  
Yong Dong

Abstract Improving resource utilization is an important goal of high-performance computing systems of supercomputing centers. In order to meet this goal, the job scheduler of high-performance computing systems often use backfilling scheduling to fill short-time jobs into the gaps of jobs at the front of the queue. Backfilling scheduling needs to obtain the running time of the job. In the past, the job running times are usually given by users and often far exceeded the actual running time of the job, which leads to inaccurate backfilling and a waste of computing resources. In particular, when the predicted job running time is lower than the actual time, the damage caused to the utilization of the system’s computing resources becomes more serious. Therefore, the prediction accuracy of the job running time is crucial to the utilization of system resources. The use of machine learning methods can make more accurate predictions of the job running time. Aiming at the parallel application of aerodynamics, we propose a job running time prediction framework SU combining supervised and unsupervised learning, and verifies it on the real historical data of the high-performance computing systems of China Aerodynamics Research and Development Center(CARDC). The experimental results show that SU has a high prediction accuracy(80.46%) and a low underestimation rate(24.85%).


Electronics ◽  
2020 ◽  
Vol 9 (11) ◽  
pp. 1857
Author(s):  
Siwoon Son ◽  
Yang-Sae Moon

Distributed stream processing engines (DSPEs) deploy multiple tasks on distributed servers to process data streams in real time. Many DSPEs have provided locality-aware stream partitioning (LSP) methods to reduce network communication costs. However, an even job scheduler provided by DSPEs deploys tasks far away from each other on the distributed servers, which cannot use the LSP properly. In this paper, we propose a Locality/Fairness-aware job scheduler (L/F job scheduler) that considers locality together to solve problems of the even job scheduler that only considers fairness. First, the L/F job scheduler increases cohesion of contiguous tasks that require message transmissions for the locality. At the same time, it reduces coupling of parallel tasks that do not require message transmissions for the fairness. Next, we connect the contiguous tasks into a stream pipeline and evenly deploy stream pipelines to the distributed servers so that the L/F job scheduler achieves high cohesion and low coupling. Finally, we implement the proposed L/F job scheduler in Apache Storm, a representative DSPE, and evaluate it in both synthetic and real-world workloads. Experimental results show that the L/F job scheduler is similar in throughput compared to the even job scheduler, but latency is significantly improved by up to 139.2% for the LSP applications and by up to 140.7% even for the non-LSP applications. The L/F job scheduler also improves latency by 19.58% and 12.13%, respectively, in two real-world workloads. These results indicate that our L/F job scheduler provides superior processing performance for the DSPE applications.


2020 ◽  
Vol 8 (4) ◽  
pp. 1030-1039 ◽  
Author(s):  
Delong Cui ◽  
Zhiping Peng ◽  
Jianbin Xiong ◽  
Bo Xu ◽  
Weiwei Lin

Sign in / Sign up

Export Citation Format

Share Document