Comparison of data storage and analysis throughput in the light of high energy physics experiment MACE

A columnar data representation is known to be an efficient way for data storage, specifically in cases when the analysis is often done based only on a small fragment of the available data structures. A data representation like Apache Parquet is a step forward from a columnar representation, which splits data horizontally to allow for easy parallelization of data analysis. Based on the general idea of columnar data storage, working on the [LDRD Project], we have developed a striped data representation, which, we believe, is better suited to the needs of High Energy Physics data analysis. A traditional columnar approach allows for efficient data analysis of complex structures. While keeping all the benefits of columnar data representations, the striped mechanism goes further by enabling easy parallelization of computations without requiring special hardware. We will present an implementation and some performance characteristics of such a data representation mechanism using a distributed no-SQL database or a local file system, unified under the same API and data representation model. The representation is efficient and at the same time simple so that it allows for a common data model and APIs for wide range of underlying storage mechanisms such as distributed no-SQL databases and local file systems. Striped storage adopts Numpy arrays as its basic data representation format, which makes it easy and efficient to use in Python applications. The Striped Data Server is a web service, which allows to hide the server implementation details from the end user, easily exposes data to WAN users, and allows to utilize well known and developed data caching solutions to further increase data access efficiency. We are considering the Striped Data Server as the core of an enterprise scale data analysis platform for High Energy Physics and similar areas of data processing. We have been testing this architecture with a 2TB dataset from a CMS dark matter search and plan to expand it to multiple 100 TB or even PB scale. We will present the striped format, Striped Data Server architecture and performance test results.

Download Full-text

Data Acquistion for a High Energy Physics Experiment Using a Parallel/Serial CAMAC Driver

IEEE Transactions on Nuclear Science ◽

10.1109/tns.1981.4331877 ◽

1981 ◽

Vol 28 (5) ◽

pp. 3913-3915

Author(s):

Bert T. Yost

Keyword(s):

High Energy Physics ◽

High Energy ◽

Physics Experiment ◽

Data Acquistion ◽

Energy Physics

Download Full-text

Using of FPGA Coprocessor for Improving the Execution Speed of the Pattern Recognition Algorithm for ATLAS – High Energy Physics Experiment

Field Programmable Logic and Application - Lecture Notes in Computer Science ◽

10.1007/978-3-540-30117-2_80 ◽

2004 ◽

pp. 791-800

Author(s):

Christian Hinkelbein ◽

Andrei Khomich ◽

Andreas Kugel ◽

Reinhard Männer ◽

Matthias Müller

Keyword(s):

Pattern Recognition ◽

High Energy Physics ◽

High Energy ◽

Recognition Algorithm ◽

Pattern Recognition Algorithm ◽

Execution Speed ◽

Physics Experiment ◽

Energy Physics

Download Full-text

General-purpose parallel computing in a high-energy physics experiment at CERN

High-Performance Computing and Networking - Lecture Notes in Computer Science ◽

10.1007/3-540-61142-8_555 ◽

1996 ◽

pp. 251-257 ◽

Cited By ~ 1

Author(s):

J. Apostolakis ◽

L. M. Bertolotto ◽

C. E. Bruschini ◽

P. Calafiura ◽

F. Gagliardi ◽

...

Keyword(s):

Parallel Computing ◽

High Energy Physics ◽

High Energy ◽

General Purpose ◽

Physics Experiment ◽

Energy Physics

Download Full-text

The Event Buffer Management for MT-SNiPER

EPJ Web of Conferences ◽

10.1051/epjconf/201921405026 ◽

2019 ◽

Vol 214 ◽

pp. 05026

Author(s):

Jiaheng Zou ◽

Tao Lin ◽

Weidong Li ◽

Xingtao Huang ◽

Ziyan Deng ◽

...

Keyword(s):

Correlation Analysis ◽

High Energy Physics ◽

Time Window ◽

Buffer Management ◽

High Energy ◽

General Purpose ◽

Software Framework ◽

Neutrino Experiments ◽

Physics Experiment ◽

Energy Physics

SNiPER is a general purpose offline software framework for high energy physics experiment. It provides some features that are attractive to neutrino experiments, such as the event buffer. More than one events are available in the buffer according to a customizable time window, so that it is easy for users to apply events correlation analysis. We also implemented the MT-SNiPER to support multithreading computing based on Intel TBB. In MT-SNiPER, the event loop is split into pieces, and each piece is dispatched to a task. The global buffer, an extension and enhancement to the event buffer, is implemented for MT-SNiPER. The global buffer is available by all threads. It keeps all the events being processed in memory. When there is an available task, a subset of its events is dispatched to that task. There can be overlaps between the subsets in different tasks due to the time window. However, it is ensured that each event is processed only once. In the task side, the subsets of events are locally managed by a normal event buffer. So the global buffer can be transparent to most user algorithms. Within the global buffer, the multithreading computing of MT-SNiPER becomes more practicable.

Download Full-text

The Data Ocean Project

EPJ Web of Conferences ◽

10.1051/epjconf/201921404020 ◽

2019 ◽

Vol 214 ◽

pp. 04020 ◽

Cited By ~ 2

Author(s):

Martin Barisits ◽

Fernando Barreiro ◽

Thomas Beermann ◽

Karan Bhatia ◽

Kaushik De ◽

...

Keyword(s):

High Energy Physics ◽

Workflow Management ◽

High Energy ◽

Data Management System ◽

Cloud Platform ◽

Scientific Experiments ◽

Physics Experiment ◽

Future Work ◽

Work Done ◽

Energy Physics

Transparent use of commercial cloud resources for scientific experiments is a hard problem. In this article, we describe the first steps of the Data Ocean R&D collaboration between the high-energy physics experiment ATLAS together with Google Cloud Platform, to allow seamless use of Google Compute Engine and Google Cloud Storage for physics analysis. We start by describing the three preliminary use cases that were identified at the beginning of the project. The following sections then detail the work done in the data management system Rucio and the workflow management systems PanDA and Harvester to interface Google Cloud Platform with the ATLAS distributed computing environment, and show the results of the integration tests. Afterwards, we describe the setup and results from a full ATLAS user analysis that was executed natively on Google Cloud Platform, and give estimates on projected costs. We close with a summary and and outlook on future work.

Download Full-text