Clustering error messages produced by distributed computing infrastructure during the processing of high energy physics data

Large-scale distributed computing infrastructures ensure the operation and maintenance of scientific experiments at the LHC: more than 160 computing centers all over the world execute tens of millions of computing jobs per day. ATLAS — the largest experiment at the LHC — creates an enormous flow of data which has to be recorded and analyzed by a complex heterogeneous and distributed computing environment. Statistically, about 10–12% of computing jobs end with a failure: network faults, service failures, authorization failures, and other error conditions trigger error messages which provide detailed information about the issue, which can be used for diagnosis and proactive fault handling. However, this analysis is complicated by the sheer scale of textual log data, and often exacerbated by the lack of a well-defined structure: human experts have to interpret the detected messages and create parsing rules manually, which is time-consuming and does not allow identifying previously unknown error conditions without further human intervention. This paper is dedicated to the description of a pipeline of methods for the unsupervised clustering of multi-source error messages. The pipeline is data-driven, based on machine learning algorithms, and executed fully automatically, allowing categorizing error messages according to textual patterns and meaning.

Download Full-text

Computing in High Energy Physics

International Journal of Modern Physics A ◽

10.1142/s0217751x0502570x ◽

2005 ◽

Vol 20 (14) ◽

pp. 3021-3032

Author(s):

Ian M. Fisk

Keyword(s):

Distributed Computing ◽

Large Scale ◽

High Energy Physics ◽

High Energy ◽

Next Generation ◽

Physical Infrastructure ◽

Commodity Computing ◽

Physics Experiments ◽

Insight Into ◽

Energy Physics

In this review, the computing challenges facing the current and next generation of high energy physics experiments will be discussed. High energy physics computing represents an interesting infrastructure challenge as the use of large-scale commodity computing clusters has increased. The causes and ramifications of these infrastructure challenges will be outlined. Increasing requirements, limited physical infrastructure at computing facilities, and limited budgets have driven many experiments to deploy distributed computing solutions to meet the growing computing needs for analysis reconstruction, and simulation. The current generation of experiments have developed and integrated a number of solutions to facilitate distributed computing. The current work of the running experiments gives an insight into the challenges that will be faced by the next generation of experiments and the infrastructure that will be needed.

Download Full-text

A Capability-Based Matchmaking Mechanism Supporting Resource Aggregation within Large-Scale Distributed Computing Infrastructures

Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering - Cloud Computing ◽

10.1007/978-3-319-14254-8_14 ◽

2014 ◽

pp. 145-154

Author(s):

Feng Liang ◽

Hai Liu ◽

Yunzhen Liu ◽

Shilong Ma ◽

Siyao Zheng ◽

...

Keyword(s):

Distributed Computing ◽

Large Scale ◽

Distributed Computing Infrastructures

Download Full-text

High-energy physics strategies and future large-scale projects

Nuclear Instruments and Methods in Physics Research Section B Beam Interactions with Materials and Atoms ◽

10.1016/j.nimb.2015.03.090 ◽

2015 ◽

Vol 355 ◽

pp. 4-10 ◽

Cited By ~ 6

Author(s):

F. Zimmermann

Keyword(s):

Large Scale ◽

High Energy Physics ◽

High Energy ◽

Energy Physics

Download Full-text

Beyond HEP: Photon and accelerator science computing infrastructure at DESY

EPJ Web of Conferences ◽

10.1051/epjconf/202024507036 ◽

2020 ◽

Vol 245 ◽

pp. 07036

Author(s):

Christoph Beyer ◽

Stefan Bujack ◽

Stefan Dietrich ◽

Thomas Finnern ◽

Martin Flemming ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

High Energy Physics ◽

High Energy ◽

Resource Provisioning ◽

Small Scale ◽

Online Processing ◽

Offline Processing ◽

National Analysis ◽

Energy Physics

DESY is one of the largest accelerator laboratories in Europe. It develops and operates state of the art accelerators for fundamental science in the areas of high energy physics, photon science and accelerator development. While for decades high energy physics (HEP) has been the most prominent user of the DESY compute, storage and network infrastructure, various scientific areas as science with photons and accelerator development have caught up and are now dominating the demands on the DESY infrastructure resources, with significant consequences for the IT resource provisioning. In this contribution, we will present an overview of the computational, storage and network resources covering the various physics communities on site. Ranging from high-throughput computing (HTC) batch-like offline processing in the Grid and the interactive user analyses resources in the National Analysis Factory (NAF) for the HEP community, to the computing needs of accelerator development or of photon sciences such as PETRA III or the European XFEL. Since DESY is involved in these experiments and their data taking, their requirements include fast low-latency online processing for data taking and calibration as well as offline processing, thus high-performance computing (HPC) workloads, that are run on the dedicated Maxwell HPC cluster. As all communities face significant challenges due to changing environments and increasing data rates in the following years, we will discuss how this will reflect in necessary changes to the computing and storage infrastructures. We will present DESY compute cloud and container orchestration plans as a basis for infrastructure and platform services. We will show examples of Jupyter notebooks for small scale interactive analysis, as well as its integration into large scale resources such as batch systems or Spark clusters. To overcome the fragmentation of the various resources for all scientific communities at DESY, we explore how to integrate them into a seamless user experience in an Interdisciplinary Data Analysis Facility.

Download Full-text

Large Scale Computing and Storage Requirements for High Energy Physics

10.2172/1003817 ◽

2010 ◽

Author(s):

Richard A. Gerber ◽

Harvey Wasserman

Keyword(s):

Large Scale ◽

High Energy Physics ◽

High Energy ◽

Large Scale Computing ◽

And Storage ◽

Energy Physics

Download Full-text

THE MAGNETIZED UNIVERSE

International Journal of Modern Physics D ◽

10.1142/s0218271804004530 ◽

2004 ◽

Vol 13 (03) ◽

pp. 391-502 ◽

Cited By ~ 275

Author(s):

MASSIMO GIOVANNINI

Keyword(s):

Magnetic Fields ◽

Experimental Evidence ◽

Large Scale ◽

High Energy Physics ◽

High Energy ◽

Present Review ◽

Energy Physics

Cosmology, high-energy physics and astrophysics are today converging to the study of large scale magnetic fields. While the experimental evidence for the existence of large scale magnetization in galaxies, clusters and super-clusters is rather compelling, the origin of the phenomenon remains puzzling especially in light of the most recent observations. The purpose of the present review is to describe the physical motivations and the open theoretical problems related to the existence of large scale magnetic fields.

Download Full-text

Exascale Data Processing in Heterogeneous Distributed Computing Infrastructure for Applications in High Energy Physics

Physics of Particles and Nuclei ◽

10.1134/s1063779620060052 ◽

2020 ◽

Vol 51 (6) ◽

pp. 995-1068

Author(s):

A. A. Klimentov

Keyword(s):

Distributed Computing ◽

Data Processing ◽

High Energy Physics ◽

High Energy ◽

Heterogeneous Distributed Computing ◽

Distributed Computing Infrastructure ◽

Computing Infrastructure ◽

Energy Physics

Download Full-text

The Data Ocean Project

EPJ Web of Conferences ◽

10.1051/epjconf/201921404020 ◽

2019 ◽

Vol 214 ◽

pp. 04020 ◽

Cited By ~ 2

Author(s):

Martin Barisits ◽

Fernando Barreiro ◽

Thomas Beermann ◽

Karan Bhatia ◽

Kaushik De ◽

...

Keyword(s):

High Energy Physics ◽

Workflow Management ◽

High Energy ◽

Data Management System ◽

Cloud Platform ◽

Scientific Experiments ◽

Physics Experiment ◽

Future Work ◽

Work Done ◽

Energy Physics

Transparent use of commercial cloud resources for scientific experiments is a hard problem. In this article, we describe the first steps of the Data Ocean R&D collaboration between the high-energy physics experiment ATLAS together with Google Cloud Platform, to allow seamless use of Google Compute Engine and Google Cloud Storage for physics analysis. We start by describing the three preliminary use cases that were identified at the beginning of the project. The following sections then detail the work done in the data management system Rucio and the workflow management systems PanDA and Harvester to interface Google Cloud Platform with the ATLAS distributed computing environment, and show the results of the integration tests. Afterwards, we describe the setup and results from a full ATLAS user analysis that was executed natively on Google Cloud Platform, and give estimates on projected costs. We close with a summary and and outlook on future work.

Download Full-text

Evaluation of a new visualization and analytics solution for slow control data for large scale experiments

EPJ Web of Conferences ◽

10.1051/epjconf/202024507001 ◽

2020 ◽

Vol 245 ◽

pp. 07001

Author(s):

Laura Sargsyan ◽

Filipe Martins

Keyword(s):

Time Series ◽

Control System ◽

Distribution System ◽

Large Scale ◽

High Energy Physics ◽

Time Series Data ◽

High Energy ◽

Series Data ◽

Test Version ◽

Long Time

Large experiments in high energy physics require efficient and scalable monitoring solutions to digest data of the detector control system. Plotting multiple graphs in the slow control system and extracting historical data for long time periods are resource intensive tasks. The proposed solution leverages the new virtualization, data analytics and visualization technologies such as InfluxDB time-series database for faster access large scale data, Grafana to visualize time-series data and an OpenShift container platform to automate build, deployment, and management of application. The monitoring service runs separately from the control system thus reduces a workload on the control system computing resources. As an example, a test version of the new monitoring was applied to the ATLAS Tile Calorimeter using the CERN Cloud Process as a Service platform. Many dashboards in Grafana have been created to monitor and analyse behaviour of the High Voltage distribution system. They visualize not only values measured by the control system, but also run information and analytics data (difference, deviation, etc.). The new monitoring with a feature-rich visualization, filtering possibilities and analytics tools allows to extend detector control and monitoring capabilities and can help experts working on large scale experiments.

Download Full-text

China-EU scientific cooperation on JUNO distributed computing

EPJ Web of Conferences ◽

10.1051/epjconf/202024503038 ◽

2020 ◽

Vol 245 ◽

pp. 03038

Author(s):

Giuseppe Andronico

Keyword(s):

Distributed Computing ◽

Data Centers ◽

High Energy Physics ◽

Liquid Scintillator ◽

Statistical Significance ◽

High Energy ◽

Relative Energy ◽

Infrastructure Development ◽

Scientific Cooperation ◽

Computing Infrastructure

The Jiangmen Underground Neutrino Observatory (JUNO) is an underground 20 kton liquid scintillator detector being built in the south of China. Targeting an unprecedented relative energy resolution of 3% at 1 MeV, JUNO will be able to study neutrino oscillation phenomena and determine neutrino mass ordering with a statistical significance of 3-4 sigma within six years running time. These physics challenges are addressed by a large Collaboration localized in three continents. In this context, key to the success of JUNO will be the realization of a distributed computing infrastructure to fulfill foreseen computing needs. Computing infrastructure development is performed jointly by the Institute for High Energy Physics (IHEP) (part of Chinese Academy of Sciences (CAS)), and a number of Italian, French and Russian data centers, already part of WLCG (Worldwide LHC Computing Grid). Upon its establishment, JUNO is expected to deliver not less than 2 PB of data per year, to be stored in the data centers throughout China and Europe. Data analysis activities will be also carried out in cooperation. This contribution is meant to report on China-EU cooperation to design and build together the JUNO computing infrastructure and to describe its main characteristics and requirements.

Download Full-text