On the road to a scientific data lake for the High Luminosity LHC era

2020 ◽  
Vol 35 (33) ◽  
pp. 2030022
Author(s):  
Aleksandr Alekseev ◽  
Simone Campana ◽  
Xavier Espinal ◽  
Stephane Jezequel ◽  
Andrey Kirianov ◽  
...  

The experiments at CERN’s Large Hadron Collider use the Worldwide LHC Computing Grid, the WLCG, for its distributed computing infrastructure. Through the distributed workload and data management systems, they provide seamless access to hundreds of grid, HPC and cloud based computing and storage resources that are distributed worldwide to thousands of physicists. LHC experiments annually process more than an exabyte of data using an average of 500,000 distributed CPU cores, to enable hundreds of new scientific results from the collider. However, the resources available to the experiments have been insufficient to meet data processing, simulation and analysis needs over the past five years as the volume of data from the LHC has grown. The problem will be even more severe for the next LHC phases. High Luminosity LHC will be a multiexabyte challenge where the envisaged Storage and Compute needs are a factor 10 to 100 above the expected technology evolution. The particle physics community needs to evolve current computing and data organization models in order to introduce changes in the way it uses and manages the infrastructure, focused on optimizations to bring performance and efficiency not forgetting simplification of operations. In this paper we highlight a recent R&D project related to scientific data lake and federated data storage.

Author(s):  
D. Britton ◽  
A.J. Cass ◽  
P.E.L. Clarke ◽  
J. Coles ◽  
D.J. Colling ◽  
...  

The start-up of the Large Hadron Collider (LHC) at CERN, Geneva, presents a huge challenge in processing and analysing the vast amounts of scientific data that will be produced. The architecture of the worldwide grid that will handle 15 PB of particle physics data annually from this machine is based on a hierarchical tiered structure. We describe the development of the UK component (GridPP) of this grid from a prototype system to a full exploitation grid for real data analysis. This includes the physical infrastructure, the deployment of middleware, operational experience and the initial exploitation by the major LHC experiments.


Author(s):  
M. L. R. Lagahit ◽  
Y. H. Tseng

Abstract. The concept of Autonomous Vehicles (AV) or self-driving cars has been increasingly popular these past few years. As such, research and development of AVs have also escalated around the world. One of those researches is about High-Definition (HD) maps. HD Maps are basically very detailed maps that provide all the geometric and semantic information on the road, which helps the AV in positioning itself on the lanes as well as mapping objects and markings on the road. This research will focus on the early stages of updating said HD maps. The methodology mainly consists of (1) running YOLOv3, a real-time object detection system, on a photo taken from a stereo camera to detect the object of interest, in this case a traffic cone, (2) applying the theories of stereo-photogrammetry to determine the 3D coordinates of the traffic cone, and (3) executing all of it at the same time on a Python-based platform. Results have shown centimeter-level accuracy in terms of obtained distance and height of the detected traffic cone from the camera setup. In future works, observed coordinates can be uploaded to a database and then connected to an application for real-time data storage/management and interactive visualization.


2020 ◽  
Vol 14 (2) ◽  
pp. 212-217 ◽  
Author(s):  
Bernhard Axmann ◽  
Harmoko Harmoko

This research aims to establish an assessment tool for assessing the readiness of small and medium enterprises (SME) in industry 4.0. The assessment of the current and future status is crucial for companies to decide on the right strategy and actions on the road to a digital company. First will be compared existing tools such as: IMPULS (VDMA), PwC and Uni-Warwick. On that basis, a tool for SME will be introduce. The tool has 12 categories: data sharing, data storage, data quality, data processing, product design and development, smart material planning, smart production, smart maintenance, smart logistic, IT security, machines readiness and communication between machines. Those categories are grouped into three: data, software and hardware. Each category has five levels of readiness (from 1 to 5), with particular criteria that refer to literature studies and expert’s opinion.


2020 ◽  
Vol 226 ◽  
pp. 01007
Author(s):  
Alexei Klimentov ◽  
Douglas Benjamin ◽  
Alessandro Di Girolamo ◽  
Kaushik De ◽  
Johannes Elmsheuser ◽  
...  

The ATLAS experiment at CERN’s Large Hadron Collider uses theWorldwide LHC Computing Grid, the WLCG, for its distributed computing infrastructure. Through the workload management system PanDA and the distributed data management system Rucio, ATLAS provides seamless access to hundreds of WLCG grid and cloud based resources that are distributed worldwide, to thousands of physicists. PanDA annually processes more than an exabyte of data using an average of 350,000 distributed batch slots, to enable hundreds of new scientific results from ATLAS. However, the resources available to the experiment have been insufficient to meet ATLAS simulation needs over the past few years as the volume of data from the LHC has grown. The problem will be even more severe for the next LHC phases. High Luminosity LHC will be a multiexabyte challenge where the envisaged Storage and Compute needs are a factor 10 to 100 above the expected technology evolution. The High Energy Physics (HEP) community needs to evolve current computing and data organization models in order to introduce changes in the way it uses and manages the infrastructure, focused on optimizations to bring performance and efficiency not forgetting simplification of operations. In this paper we highlight recent R&D projects in HEP related to data lake prototype, federated data storage and data carousel.


2016 ◽  
Author(s):  
◽  
Matthew Dickinson

[ACCESS RESTRICTED TO THE UNIVERSITY OF MISSOURI AT REQUEST OF AUTHOR.] In recent years, most scientific research in both academia and industry has become increasingly data-driven. According to market estimates, spending related to supporting scientific data-intensive research is expected to increase to $5.8 billion by 2018. Particularly for data-intensive scientific fields such as bioscience, or particle physics within academic environments, data storage/processing facilities, expert collaborators and specialized computing resources do not always reside within campus boundaries. With the growing trend of large collaborative partnerships involving researchers, expensive scientific instruments and high performance computing centers, experiments and simulations produce peta-bytes of data viz., Big Data, that is likely to be shared and analyzed by scientists in multi-disciplinary areas. Federated multi-cloud resource allocation for data-intensive application workflows is generally performed based on performance or quality of service (i.e., QSpecs) considerations. At the same time, end-to-end security requirements of these workflows across multiple domains are considered as an afterthought due to lack of standardized formalization methods. Consequently, diverse/heterogenous domain resource and security policies cause inter-conflicts between application's security and performance requirements that lead to sub-optimal resource allocations, especially when multiple such applications contend for limited resources. In this thesis, a joint performance and security-driven federated resource allocation scheme for data-intensive scientific applications is presented. In order to aid joint resource brokering among multi-cloud domains with diverse/heterogenous security postures, the definition and characterization of a data-intensive application's security specifications (i.e., SSpecs) is required. Next, an alignment technique inspired by Portunes Algebra to homogenize the various domain resource policies (i.e., RSpecs) along an application's workflow lifecycle stages is presented. Using such formalization and alignment, a near optimal cost-aware joint QSpecs-SSpecs-driven, RSpecs-compliant resource allocation algorithm for multi-cloud computing resource domain/location selection as well as network path selection, is proposed. We implement our security formalization, alignment, and allocation scheme as a framework, viz., "OnTimeURB" and validate it in a multi-cloud environment with exemplar data-intensive application workflows involving distributed computing and remote instrumentation use cases with different performance and security requirements.


2021 ◽  
Vol 251 ◽  
pp. 03061
Author(s):  
Gordon Watts

Array operations are one of the most concise ways of expressing common filtering and simple aggregation operations that are the hallmark of a particle physics analysis: selection, filtering, basic vector operations, and filling histograms. The High Luminosity run of the Large Hadron Collider (HL-LHC), scheduled to start in 2026, will require physicists to regularly skim datasets that are over a PB in size, and repeatedly run over datasets that are 100’s of TB’s – too big to fit in memory. Declarative programming techniques are a way of separating the intent of the physicist from the mechanics of finding the data and using distributed computing to process and make histograms. This paper describes a library that implements a declarative distributed framework based on array programming. This prototype library provides a framework for different sub-systems to cooperate in producing plots via plug-in’s. This prototype has a ServiceX data-delivery sub-system and an awkward array sub-system cooperating to generate requested data or plots. The ServiceX system runs against ATLAS xAOD data and flat ROOT TTree’s and awkward on the columnar data produced by ServiceX.


2020 ◽  
Vol 1690 ◽  
pp. 012166
Author(s):  
A Alekseev ◽  
A Kiryanov ◽  
A Klimentov ◽  
T Korchuganova ◽  
V Mitsyn ◽  
...  

2021 ◽  
Vol 5 (1) ◽  
Author(s):  
Yutaro Iiyama ◽  
Benedikt Maier ◽  
Daniel Abercrombie ◽  
Maxim Goncharov ◽  
Christoph Paus

AbstractDynamo is a full-stack software solution for scientific data management. Dynamo’s architecture is modular, extensible, and customizable, making the software suitable for managing data in a wide range of installation scales, from a few terabytes stored at a single location to hundreds of petabytes distributed across a worldwide computing grid. This article documents the core system design of Dynamo and describes the applications that implement various data management tasks. A brief report is also given on the operational experiences of the system at the CMS experiment at the CERN Large Hadron Collider and at a small-scale analysis facility.


2022 ◽  
Vol 137 (1) ◽  
Author(s):  
Alain Blondel ◽  
Patrick Janot

AbstractWith its high luminosity, its clean experimental conditions, and a range of energies that cover the four heaviest particles known today, FCC-ee offers a wealth of physics possibilities, with high potential for discoveries. The FCC-ee is an essential and complementary step towards a 100 TeV hadron collider, and as such offers a uniquely powerful combined physics program. This vision is the backbone of the 2020 European Strategy for Particle Physics. One of the main challenges is now to design experimental systems that can, demonstrably, fully exploit these extraordinary opportunities.


Sign in / Sign up

Export Citation Format

Share Document