Declarative Big Data Analysis for High-Energy Physics: TOTEM Use Case

The Luiza analysis framework for GLORIA is based on the Marlin package, which was originally developed for data analysis in the new High Energy Physics (HEP) project, International Linear Collider (ILC). The HEP experiments have to deal with enormous amounts of data and distributed data analysis is therefore essential. The Marlin framework concept seems to be well suited for the needs of GLORIA. The idea (and large parts of the code) taken from Marlin is that every computing task is implemented as a processor (module) that analyzes the data stored in an internal data structure, and the additional output is also added to that collection. The advantage of this modular approach is that it keeps things as simple as possible. Each step of the full analysis chain, e.g. from raw images to light curves, can be processed step-by-step, and the output of each step is still self consistent and can be fed in to the next step without any manipulation.

Download Full-text

Striped Data Analysis Framework

EPJ Web of Conferences ◽

10.1051/epjconf/202024506042 ◽

2020 ◽

Vol 245 ◽

pp. 06042

Author(s):

Oliver Gutsche ◽

Igor Mandrichenko

Keyword(s):

Data Analysis ◽

Data Storage ◽

High Energy Physics ◽

Data Access ◽

High Energy ◽

Data Representation ◽

General Idea ◽

Common Data Model ◽

Local File ◽

Energy Physics

A columnar data representation is known to be an efficient way for data storage, specifically in cases when the analysis is often done based only on a small fragment of the available data structures. A data representation like Apache Parquet is a step forward from a columnar representation, which splits data horizontally to allow for easy parallelization of data analysis. Based on the general idea of columnar data storage, working on the [LDRD Project], we have developed a striped data representation, which, we believe, is better suited to the needs of High Energy Physics data analysis. A traditional columnar approach allows for efficient data analysis of complex structures. While keeping all the benefits of columnar data representations, the striped mechanism goes further by enabling easy parallelization of computations without requiring special hardware. We will present an implementation and some performance characteristics of such a data representation mechanism using a distributed no-SQL database or a local file system, unified under the same API and data representation model. The representation is efficient and at the same time simple so that it allows for a common data model and APIs for wide range of underlying storage mechanisms such as distributed no-SQL databases and local file systems. Striped storage adopts Numpy arrays as its basic data representation format, which makes it easy and efficient to use in Python applications. The Striped Data Server is a web service, which allows to hide the server implementation details from the end user, easily exposes data to WAN users, and allows to utilize well known and developed data caching solutions to further increase data access efficiency. We are considering the Striped Data Server as the core of an enterprise scale data analysis platform for High Energy Physics and similar areas of data processing. We have been testing this architecture with a 2TB dataset from a CMS dark matter search and plan to expand it to multiple 100 TB or even PB scale. We will present the striped format, Striped Data Server architecture and performance test results.

Download Full-text

High Performance Numerical Computing for High Energy Physics: A New Challenge for Big Data Science

Advances in High Energy Physics ◽

10.1155/2014/507690 ◽

2014 ◽

Vol 2014 ◽

pp. 1-13 ◽

Cited By ~ 3

Author(s):

Florin Pop

Keyword(s):

Monte Carlo ◽

Big Data ◽

Numerical Methods ◽

High Performance ◽

Data Science ◽

Experimental Validation ◽

High Energy Physics ◽

High Energy ◽

Performance Computing ◽

Energy Physics

Modern physics is based on both theoretical analysis and experimental validation. Complex scenarios like subatomic dimensions, high energy, and lower absolute temperature are frontiers for many theoretical models. Simulation with stable numerical methods represents an excellent instrument for high accuracy analysis, experimental validation, and visualization. High performance computing support offers possibility to make simulations at large scale, in parallel, but the volume of data generated by these experiments creates a new challenge for Big Data Science. This paper presents existing computational methods for high energy physics (HEP) analyzed from two perspectives: numerical methods and high performance computing. The computational methods presented are Monte Carlo methods and simulations of HEP processes, Markovian Monte Carlo, unfolding methods in particle physics, kernel estimation in HEP, and Random Matrix Theory used in analysis of particles spectrum. All of these methods produce data-intensive applications, which introduce new challenges and requirements for ICT systems architecture, programming paradigms, and storage capabilities.

Download Full-text

Editorial: Innovative Analysis Ecosystems for HEP Data

Frontiers in Big Data ◽

10.3389/fdata.2021.736105 ◽

2021 ◽

Vol 4 ◽

Author(s):

Sezen Sekmen ◽

Gian Michele Innocenti ◽

Bo Jayatilaka

Keyword(s):

Artificial Intelligence ◽

Big Data ◽

High Energy Physics ◽

High Energy ◽

Research Topic ◽

Energy Physics

This editorial summarizes the contributions to the Frontiers Research topic “Innovative Analysis Ecosystems for HEP Data”, established under the Big Data and AI in High Energy Physics section and appearing under the Frontiers in Big Data and Frontiers in Artificial Intelligence journals.

Download Full-text