scholarly journals Research and Exploit of Resource Sharing Strategy at IHEP

2019 ◽  
Vol 214 ◽  
pp. 03014
Author(s):  
Xiaowei JIANG ◽  
Jingyan Shi ◽  
Jiaheng Zou ◽  
Qingbao Hu ◽  
Ran Du ◽  
...  

At IHEP (Institute of High Energy Physics, Chinese Academy of Sciences), computing resources are contributed by different experiments including BES, JUNO, DYW, HXMT, etc. The resources were divided into different partitions to satisfy the dedicated experiment data processing requirements. IHEP had a local Torqu&Maui cluster with 50 queues serving for above 10 experiments. The separated resource partitions leaded to imbalance resource load. In a typical situation, BES resource partition was quite busy without free slot but still with lots of jobs in idle, while JUNO resources are free and wasted seriously. After moving resources from Torque&Maui to HTCondor in 2016, job scheduling efficiency has been improved a lot. In order to balance the imbalance resource load, we designed an efficient sharing strategy to improve the overall resourceutilization. We created an unified pool shared by all experiments. For each experiment, resources are divided into two parts: dedicated resource and sharing resource. The slots in dedicated resource only run jobs from its own experiment, and the slots in sharing resource are shared by jobs from all experiments. Default ratio of dedicated resource to sharing resource is 1:4. To maximize the sharing effectiveness, the ratio is dynamically adjusted between 0:5 and 4:1 based on the number of jobs submitted by each experiment. We have developed a central control system to decide how many resources can be allocated to each experiment group. This system is implemented at two sides: server side and client side. A management database is built at server side, which is storing resource, group and experiment information. Once the sharing ratio needs to be adjusted, resource group will be changed and updated into database. The resource group information is published to the server buffer in real-time. The client periodically pulls resource group information from server buffer via https protocol And resource scheduling configuration at client side is changed based on the resource group information. With this method, share ratio can be modified and deployed dynamically. Sharing strategy is implemented with HTCondor. ClassAd mechanism and accounting-group in HTCondor facilitate to utilizethe sharing strategy at IHEP computing cluster. With the sharing strategy, resource usage has been improved dramatically.

2019 ◽  
Vol 214 ◽  
pp. 08009 ◽  
Author(s):  
Matthias J. Schnepf ◽  
R. Florian von Cube ◽  
Max Fischer ◽  
Manuel Giffels ◽  
Christoph Heidecker ◽  
...  

Demand for computing resources in high energy physics (HEP) shows a highly dynamic behavior, while the provided resources by the Worldwide LHC Computing Grid (WLCG) remains static. It has become evident that opportunistic resources such as High Performance Computing (HPC) centers and commercial clouds are well suited to cover peak loads. However, the utilization of these resources gives rise to new levels of complexity, e.g. resources need to be managed highly dynamically and HEP applications require a very specific software environment usually not provided at opportunistic resources. Furthermore, aspects to consider are limitations in network bandwidth causing I/O-intensive workflows to run inefficiently. The key component to dynamically run HEP applications on opportunistic resources is the utilization of modern container and virtualization technologies. Based on these technologies, the Karlsruhe Institute of Technology (KIT) has developed ROCED, a resource manager to dynamically integrate and manage a variety of opportunistic resources. In combination with ROCED, HTCondor batch system acts as a powerful single entry point to all available computing resources, leading to a seamless and transparent integration of opportunistic resources into HEP computing. KIT is currently improving the resource management and job scheduling by focusing on I/O requirements of individual workflows, available network bandwidth as well as scalability. For these reasons, we are currently developing a new resource manager, called TARDIS. In this paper, we give an overview of the utilized technologies, the dynamic management, and integration of resources as well as the status of the I/O-based resource and job scheduling.


2019 ◽  
Vol 214 ◽  
pp. 08004 ◽  
Author(s):  
R. Du ◽  
J. Shi ◽  
J. Zou ◽  
X. Jiang ◽  
Z. Sun ◽  
...  

There are two production clusters co-existed in the Institute of High Energy Physics (IHEP). One is a High Throughput Computing (HTC) cluster with HTCondor as the workload manager, the other is a High Performance Computing (HPC) cluster with Slurm as the workload manager. The resources of the HTCondor cluster are funded by multiple experiments, and the resource utilization reached more than 90% by adopting a dynamic resource share mechanism. Nevertheless, there is a bottleneck if more resources are requested by multiple experiments at the same moment. On the other hand, parallel jobs running on the Slurm cluster reflect some specific attributes, such as high degree of parallelism, low quantity and long wall time. Such attributes make it easy to generate free resource slots which are suitable for jobs from the HTCondor cluster. As a result, if there is a mechanism to schedule jobs from the HTCon-dor cluster to the Slurm cluster transparently, it would improve the resource utilization of the Slurm cluster, and reduce job queue time for the HTCondor cluster. In this proceeding, we present three methods to migrate HTCondor jobs to the Slurm cluster, and concluded that HTCondor-C is more preferred. Furthermore, because design philosophy and application scenes are di↵erent between HTCondor and Slurm, some issues and possible solutions related with job scheduling are presented.


2021 ◽  
Vol 251 ◽  
pp. 02070
Author(s):  
Matthew Feickert ◽  
Lukas Heinrich ◽  
Giordon Stark ◽  
Ben Galewsky

In High Energy Physics facilities that provide High Performance Computing environments provide an opportunity to efficiently perform the statistical inference required for analysis of data from the Large Hadron Collider, but can pose problems with orchestration and efficient scheduling. The compute architectures at these facilities do not easily support the Python compute model, and the configuration scheduling of batch jobs for physics often requires expertise in multiple job scheduling services. The combination of the pure-Python libraries pyhf and funcX reduces the common problem in HEP analyses of performing statistical inference with binned models, that would traditionally take multiple hours and bespoke scheduling, to an on-demand (fitting) “function as a service” that can scalably execute across workers in just a few minutes, offering reduced time to insight and inference. We demonstrate execution of a scalable workflow using funcX to simultaneously fit 125 signal hypotheses from a published ATLAS search for new physics using pyhf with a wall time of under 3 minutes. We additionally show performance comparisons for other physics analyses with openly published probability models and argue for a blueprint of fitting as a service systems at HPC centers.


2019 ◽  
Vol 214 ◽  
pp. 06007
Author(s):  
Malachi Schram ◽  
Nathan Tallent ◽  
Ryan Friese ◽  
Alok Singh ◽  
Ilkay Altintas

In this research, we investigated two approaches to detect job anomalies and/or contention for large scale computing efforts: 1. Preemptive job scheduling using binomial classification long short-term memory networks 2. Forecasting intra-node computing loads from the active jobs and additional job(s) For approach 1, we achieved a 14% improvement in computational resources utilization and an overall classification accuracy of 85% on real tasks executed in a High Energy Physics computing workflow. For this paper, we present the preliminary results used in second approach.


2021 ◽  
Vol 251 ◽  
pp. 02063
Author(s):  
Michal Simon ◽  
Andrew Hanushevsky

Across the years, being the backbone of numerous data management solutions used within the WLCG collaboration, the XRootD framework and protocol became one of the most important building blocks for storage solutions in the High Energy Physics (HEP) community. The latest big milestone for the project, release 5, introduced multitude of architectural improvements and functional enhancements, including the new client side declarative API, which is the main focus of this study. In this contribution, we give an overview of the new client API and we discuss its motivation and its positive impact on overall software quality (coupling, cohesion), readability and composability.


2020 ◽  
Vol 245 ◽  
pp. 07039
Author(s):  
Eileen Kuehn ◽  
Max Fischer ◽  
Sven Lange ◽  
Andreas Petzold ◽  
Andreas Heiss

To overcome the computing challenge in High Energy Physics available resources must be utilized as efficiently as possible. This targets algorithmic challenges in the workflows itself but also the scheduling of jobs to compute resources. To enable the best possible scheduling, job schedulers require accurate information about resource consumption of a job before it is even executed. It is the responsibility of the user to provide an accurate resource estimate required for jobs. However, this is quite a challenge for users as they (i) want to ensure their jobs to run correctly, (ii) must manage to deal with heterogeneous compute resources and (iii) face intransparent library dependencies and frequent updates. Users therefore tend to specify resource requests with an ample buffer. This inaccuracy results in inefficient utilisation by either blocking unused resources or exceeding reserved resources. Especially in the context of opportunistic resource provisioning the inaccuracies have an even broader impact that does not even target utilisation of resources but also composition of the most suitable resources. The contribution of this paper is an analysis of production and end-user workflows in HEP with regards to optimizing the various resources types. We further propose a method to improve user estimates.


Author(s):  
Preeti Kumari ◽  
◽  
Kavita Lalwani ◽  
Ranjit Dalal ◽  
Ashutosh Bhardwaj ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document