Comparison of data storage and analysis throughput in the light of high energy physics experiment MACE

2020 ◽  
Vol 33 ◽  
pp. 100409
Author(s):  
D. Sarkar ◽  
Mahesh P. ◽  
Padmini S. ◽  
N. Chouhan ◽  
C. Borwankar ◽  
...  
2019 ◽  
Author(s):  
Juan Carlos Cabanillas Noris ◽  
Ildefonso León Monzón ◽  
Mario Iván Martínez Hernández ◽  
Solangel Rojas Torres

2020 ◽  
Vol 245 ◽  
pp. 06042
Author(s):  
Oliver Gutsche ◽  
Igor Mandrichenko

A columnar data representation is known to be an efficient way for data storage, specifically in cases when the analysis is often done based only on a small fragment of the available data structures. A data representation like Apache Parquet is a step forward from a columnar representation, which splits data horizontally to allow for easy parallelization of data analysis. Based on the general idea of columnar data storage, working on the [LDRD Project], we have developed a striped data representation, which, we believe, is better suited to the needs of High Energy Physics data analysis. A traditional columnar approach allows for efficient data analysis of complex structures. While keeping all the benefits of columnar data representations, the striped mechanism goes further by enabling easy parallelization of computations without requiring special hardware. We will present an implementation and some performance characteristics of such a data representation mechanism using a distributed no-SQL database or a local file system, unified under the same API and data representation model. The representation is efficient and at the same time simple so that it allows for a common data model and APIs for wide range of underlying storage mechanisms such as distributed no-SQL databases and local file systems. Striped storage adopts Numpy arrays as its basic data representation format, which makes it easy and efficient to use in Python applications. The Striped Data Server is a web service, which allows to hide the server implementation details from the end user, easily exposes data to WAN users, and allows to utilize well known and developed data caching solutions to further increase data access efficiency. We are considering the Striped Data Server as the core of an enterprise scale data analysis platform for High Energy Physics and similar areas of data processing. We have been testing this architecture with a 2TB dataset from a CMS dark matter search and plan to expand it to multiple 100 TB or even PB scale. We will present the striped format, Striped Data Server architecture and performance test results.


Author(s):  
J. Apostolakis ◽  
L. M. Bertolotto ◽  
C. E. Bruschini ◽  
P. Calafiura ◽  
F. Gagliardi ◽  
...  

2019 ◽  
Vol 214 ◽  
pp. 05026
Author(s):  
Jiaheng Zou ◽  
Tao Lin ◽  
Weidong Li ◽  
Xingtao Huang ◽  
Ziyan Deng ◽  
...  

SNiPER is a general purpose offline software framework for high energy physics experiment. It provides some features that are attractive to neutrino experiments, such as the event buffer. More than one events are available in the buffer according to a customizable time window, so that it is easy for users to apply events correlation analysis. We also implemented the MT-SNiPER to support multithreading computing based on Intel TBB. In MT-SNiPER, the event loop is split into pieces, and each piece is dispatched to a task. The global buffer, an extension and enhancement to the event buffer, is implemented for MT-SNiPER. The global buffer is available by all threads. It keeps all the events being processed in memory. When there is an available task, a subset of its events is dispatched to that task. There can be overlaps between the subsets in different tasks due to the time window. However, it is ensured that each event is processed only once. In the task side, the subsets of events are locally managed by a normal event buffer. So the global buffer can be transparent to most user algorithms. Within the global buffer, the multithreading computing of MT-SNiPER becomes more practicable.


2019 ◽  
Vol 214 ◽  
pp. 04020 ◽  
Author(s):  
Martin Barisits ◽  
Fernando Barreiro ◽  
Thomas Beermann ◽  
Karan Bhatia ◽  
Kaushik De ◽  
...  

Transparent use of commercial cloud resources for scientific experiments is a hard problem. In this article, we describe the first steps of the Data Ocean R&D collaboration between the high-energy physics experiment ATLAS together with Google Cloud Platform, to allow seamless use of Google Compute Engine and Google Cloud Storage for physics analysis. We start by describing the three preliminary use cases that were identified at the beginning of the project. The following sections then detail the work done in the data management system Rucio and the workflow management systems PanDA and Harvester to interface Google Cloud Platform with the ATLAS distributed computing environment, and show the results of the integration tests. Afterwards, we describe the setup and results from a full ATLAS user analysis that was executed natively on Google Cloud Platform, and give estimates on projected costs. We close with a summary and and outlook on future work.


1987 ◽  
Vol 34 (4) ◽  
pp. 980-983
Author(s):  
Bernard Wadsworth ◽  
Richard Lanza ◽  
M. J. LeVine ◽  
R. A. Scheetz ◽  
Flemming Videbaek

Sign in / Sign up

Export Citation Format

Share Document