scholarly journals Preparing Distributed Computing Operations for the HL-LHC Era With Operational Intelligence

2022 ◽  
Vol 4 ◽  
Author(s):  
Alessandro Di Girolamo ◽  
Federica Legger ◽  
Panos Paparrigopoulos ◽  
Jaroslava Schovancová ◽  
Thomas Beermann ◽  
...  

As a joint effort from various communities involved in the Worldwide LHC Computing Grid, the Operational Intelligence project aims at increasing the level of automation in computing operations and reducing human interventions. The distributed computing systems currently deployed by the LHC experiments have proven to be mature and capable of meeting the experimental goals, by allowing timely delivery of scientific results. However, a substantial number of interventions from software developers, shifters, and operational teams is needed to efficiently manage such heterogenous infrastructures. Under the scope of the Operational Intelligence project, experts from several areas have gathered to propose and work on “smart” solutions. Machine learning, data mining, log analysis, and anomaly detection are only some of the tools we have evaluated for our use cases. In this community study contribution, we report on the development of a suite of operational intelligence services to cover various use cases: workload management, data management, and site operations.

2020 ◽  
Vol 245 ◽  
pp. 03017
Author(s):  
Alessandro Di Girolamo ◽  
Federica Legger ◽  
Panos Paparrigopoulos ◽  
Alexei Klimentov ◽  
Jaroslava Schovancová ◽  
...  

In the near future, large scientific collaborations will face unprecedented computing challenges. Processing and storing exabyte datasets require a federated infrastructure of distributed computing resources. The current systems have proven to be mature and capable of meeting the experiment goals, by allowing timely delivery of scientific results. However, a substantial amount of interventions from software developers, shifters and operational teams is needed to efficiently manage such heterogeneous infrastructures. A wealth of operational data can be exploited to increase the level of automation in computing operations by using adequate techniques, such as machine learning (ML), tailored to solve specific problems. The Operational Intelligence project is a joint effort from various WLCG communities aimed at increasing the level of automation in computing operations. We discuss how state-of-the-art technologies can be used to build general solutions to common problems and to reduce the operational cost of the experiment computing infrastructure.


2019 ◽  
Vol 214 ◽  
pp. 03047 ◽  
Author(s):  
Fernando Barreiro ◽  
Doug Benjamin ◽  
Taylor Childers ◽  
Kaushik De ◽  
Johannes Elmsheuser ◽  
...  

Since 2010 the Production and Distributed Analysis system (PanDA) for the ATLAS experiment at the Large Hadron Colliderhas seen big changes to accommodate new types of distributed computing resources: clouds, HPCs, volunteer computers and other external resources. While PanDA was originally designed for fairly homogeneous resources available through the Worldwide LHC Computing Grid, the new resources are heterogeneous, at diverse scales and with diverse interfaces. Up to a fifth of the resources available to ATLAS are of such new types and require special techniques for integration into PanDA. In this talk, we present the nature and scale of these resources. We provide an overview of the various challenges faced, spanning infrastructure, software distribution, workload requirements, scaling requirements, workflow management, data management, network provisioning, and associated software and computing facilities. We describe the strategies for integrating these heterogeneous resources into ATLAS, and the new software components being developed in PanDA to efficiently use them. Plans for software and computing evolution to meet the needs of LHC operations and upgrade in the long term future will be discussed.


Sign in / Sign up

Export Citation Format

Share Document