Clustering error messages produced by distributed computing infrastructure during the processing of high energy physics data

2021 ◽  
Vol 36 (10) ◽  
pp. 2150070
Author(s):  
Maria Grigorieva ◽  
Dmitry Grin

Large-scale distributed computing infrastructures ensure the operation and maintenance of scientific experiments at the LHC: more than 160 computing centers all over the world execute tens of millions of computing jobs per day. ATLAS — the largest experiment at the LHC — creates an enormous flow of data which has to be recorded and analyzed by a complex heterogeneous and distributed computing environment. Statistically, about 10–12% of computing jobs end with a failure: network faults, service failures, authorization failures, and other error conditions trigger error messages which provide detailed information about the issue, which can be used for diagnosis and proactive fault handling. However, this analysis is complicated by the sheer scale of textual log data, and often exacerbated by the lack of a well-defined structure: human experts have to interpret the detected messages and create parsing rules manually, which is time-consuming and does not allow identifying previously unknown error conditions without further human intervention. This paper is dedicated to the description of a pipeline of methods for the unsupervised clustering of multi-source error messages. The pipeline is data-driven, based on machine learning algorithms, and executed fully automatically, allowing categorizing error messages according to textual patterns and meaning.

2005 ◽  
Vol 20 (14) ◽  
pp. 3021-3032
Author(s):  
Ian M. Fisk

In this review, the computing challenges facing the current and next generation of high energy physics experiments will be discussed. High energy physics computing represents an interesting infrastructure challenge as the use of large-scale commodity computing clusters has increased. The causes and ramifications of these infrastructure challenges will be outlined. Increasing requirements, limited physical infrastructure at computing facilities, and limited budgets have driven many experiments to deploy distributed computing solutions to meet the growing computing needs for analysis reconstruction, and simulation. The current generation of experiments have developed and integrated a number of solutions to facilitate distributed computing. The current work of the running experiments gives an insight into the challenges that will be faced by the next generation of experiments and the infrastructure that will be needed.


2020 ◽  
Vol 245 ◽  
pp. 07036
Author(s):  
Christoph Beyer ◽  
Stefan Bujack ◽  
Stefan Dietrich ◽  
Thomas Finnern ◽  
Martin Flemming ◽  
...  

DESY is one of the largest accelerator laboratories in Europe. It develops and operates state of the art accelerators for fundamental science in the areas of high energy physics, photon science and accelerator development. While for decades high energy physics (HEP) has been the most prominent user of the DESY compute, storage and network infrastructure, various scientific areas as science with photons and accelerator development have caught up and are now dominating the demands on the DESY infrastructure resources, with significant consequences for the IT resource provisioning. In this contribution, we will present an overview of the computational, storage and network resources covering the various physics communities on site. Ranging from high-throughput computing (HTC) batch-like offline processing in the Grid and the interactive user analyses resources in the National Analysis Factory (NAF) for the HEP community, to the computing needs of accelerator development or of photon sciences such as PETRA III or the European XFEL. Since DESY is involved in these experiments and their data taking, their requirements include fast low-latency online processing for data taking and calibration as well as offline processing, thus high-performance computing (HPC) workloads, that are run on the dedicated Maxwell HPC cluster. As all communities face significant challenges due to changing environments and increasing data rates in the following years, we will discuss how this will reflect in necessary changes to the computing and storage infrastructures. We will present DESY compute cloud and container orchestration plans as a basis for infrastructure and platform services. We will show examples of Jupyter notebooks for small scale interactive analysis, as well as its integration into large scale resources such as batch systems or Spark clusters. To overcome the fragmentation of the various resources for all scientific communities at DESY, we explore how to integrate them into a seamless user experience in an Interdisciplinary Data Analysis Facility.


2004 ◽  
Vol 13 (03) ◽  
pp. 391-502 ◽  
Author(s):  
MASSIMO GIOVANNINI

Cosmology, high-energy physics and astrophysics are today converging to the study of large scale magnetic fields. While the experimental evidence for the existence of large scale magnetization in galaxies, clusters and super-clusters is rather compelling, the origin of the phenomenon remains puzzling especially in light of the most recent observations. The purpose of the present review is to describe the physical motivations and the open theoretical problems related to the existence of large scale magnetic fields.


2019 ◽  
Vol 214 ◽  
pp. 04020 ◽  
Author(s):  
Martin Barisits ◽  
Fernando Barreiro ◽  
Thomas Beermann ◽  
Karan Bhatia ◽  
Kaushik De ◽  
...  

Transparent use of commercial cloud resources for scientific experiments is a hard problem. In this article, we describe the first steps of the Data Ocean R&D collaboration between the high-energy physics experiment ATLAS together with Google Cloud Platform, to allow seamless use of Google Compute Engine and Google Cloud Storage for physics analysis. We start by describing the three preliminary use cases that were identified at the beginning of the project. The following sections then detail the work done in the data management system Rucio and the workflow management systems PanDA and Harvester to interface Google Cloud Platform with the ATLAS distributed computing environment, and show the results of the integration tests. Afterwards, we describe the setup and results from a full ATLAS user analysis that was executed natively on Google Cloud Platform, and give estimates on projected costs. We close with a summary and and outlook on future work.


2020 ◽  
Vol 245 ◽  
pp. 07001
Author(s):  
Laura Sargsyan ◽  
Filipe Martins

Large experiments in high energy physics require efficient and scalable monitoring solutions to digest data of the detector control system. Plotting multiple graphs in the slow control system and extracting historical data for long time periods are resource intensive tasks. The proposed solution leverages the new virtualization, data analytics and visualization technologies such as InfluxDB time-series database for faster access large scale data, Grafana to visualize time-series data and an OpenShift container platform to automate build, deployment, and management of application. The monitoring service runs separately from the control system thus reduces a workload on the control system computing resources. As an example, a test version of the new monitoring was applied to the ATLAS Tile Calorimeter using the CERN Cloud Process as a Service platform. Many dashboards in Grafana have been created to monitor and analyse behaviour of the High Voltage distribution system. They visualize not only values measured by the control system, but also run information and analytics data (difference, deviation, etc.). The new monitoring with a feature-rich visualization, filtering possibilities and analytics tools allows to extend detector control and monitoring capabilities and can help experts working on large scale experiments.


2020 ◽  
Vol 245 ◽  
pp. 03038
Author(s):  
Giuseppe Andronico

The Jiangmen Underground Neutrino Observatory (JUNO) is an underground 20 kton liquid scintillator detector being built in the south of China. Targeting an unprecedented relative energy resolution of 3% at 1 MeV, JUNO will be able to study neutrino oscillation phenomena and determine neutrino mass ordering with a statistical significance of 3-4 sigma within six years running time. These physics challenges are addressed by a large Collaboration localized in three continents. In this context, key to the success of JUNO will be the realization of a distributed computing infrastructure to fulfill foreseen computing needs. Computing infrastructure development is performed jointly by the Institute for High Energy Physics (IHEP) (part of Chinese Academy of Sciences (CAS)), and a number of Italian, French and Russian data centers, already part of WLCG (Worldwide LHC Computing Grid). Upon its establishment, JUNO is expected to deliver not less than 2 PB of data per year, to be stored in the data centers throughout China and Europe. Data analysis activities will be also carried out in cooperation. This contribution is meant to report on China-EU cooperation to design and build together the JUNO computing infrastructure and to describe its main characteristics and requirements.


Sign in / Sign up

Export Citation Format

Share Document