Software Training in HEP

AbstractThe long-term sustainability of the high-energy physics (HEP) research software ecosystem is essential to the field. With new facilities and upgrades coming online throughout the 2020s, this will only become increasingly important. Meeting the sustainability challenge requires a workforce with a combination of HEP domain knowledge and advanced software skills. The required software skills fall into three broad groups. The first is fundamental and generic software engineering (e.g., Unix, version control, C++, and continuous integration). The second is knowledge of domain-specific HEP packages and practices (e.g., the ROOT data format and analysis framework). The third is more advanced knowledge involving specialized techniques, including parallel programming, machine learning and data science tools, and techniques to maintain software projects at all scales. This paper discusses the collective software training program in HEP led by the HEP Software Foundation (HSF) and the Institute for Research and Innovation in Software in HEP (IRIS-HEP). The program equips participants with an array of software skills that serve as ingredients for the solution of HEP computing challenges. Beyond serving the community by ensuring that members are able to pursue research goals, the program serves individuals by providing intellectual capital and transferable skills important to careers in the realm of software and computing, inside or outside HEP.

Download Full-text

High Performance Numerical Computing for High Energy Physics: A New Challenge for Big Data Science

Advances in High Energy Physics ◽

10.1155/2014/507690 ◽

2014 ◽

Vol 2014 ◽

pp. 1-13 ◽

Cited By ~ 3

Author(s):

Florin Pop

Keyword(s):

Monte Carlo ◽

Big Data ◽

Numerical Methods ◽

High Performance ◽

Data Science ◽

Experimental Validation ◽

High Energy Physics ◽

High Energy ◽

Performance Computing ◽

Energy Physics

Modern physics is based on both theoretical analysis and experimental validation. Complex scenarios like subatomic dimensions, high energy, and lower absolute temperature are frontiers for many theoretical models. Simulation with stable numerical methods represents an excellent instrument for high accuracy analysis, experimental validation, and visualization. High performance computing support offers possibility to make simulations at large scale, in parallel, but the volume of data generated by these experiments creates a new challenge for Big Data Science. This paper presents existing computational methods for high energy physics (HEP) analyzed from two perspectives: numerical methods and high performance computing. The computational methods presented are Monte Carlo methods and simulations of HEP processes, Markovian Monte Carlo, unfolding methods in particle physics, kernel estimation in HEP, and Random Matrix Theory used in analysis of particles spectrum. All of these methods produce data-intensive applications, which introduce new challenges and requirements for ICT systems architecture, programming paradigms, and storage capabilities.

Download Full-text

Library Carpentry: Software Skills Training for Library Professionals

International Journal of Digital Curation ◽

10.2218/ijdc.v12i2.576 ◽

2018 ◽

Vol 12 (2) ◽

pp. 266-273

Author(s):

Jez Cope ◽

James Baker

Keyword(s):

Data Science ◽

Skills Training ◽

Training Programme ◽

Lessons Learned ◽

Non Profit ◽

Professional Career ◽

Research Software ◽

Direct Benefits ◽

Software Skills ◽

Time And Energy

Much time and energy is now being devoted to developing the skills of researchers in the related areas of data analysis and data management. However, less attention is currently paid to developing the data skills of librarians themselves: these skills are often brought in by recruitment in niche areas rather than considered as a wider development need for the library workforce, and are not widely recognised as important to the professional career development of librarians. We believe that building computational and data science capacity within academic libraries will have direct benefits for both librarians and the users we serve. Library Carpentry is a global effort to provide training to librarians in technical areas that have traditionally been seen as the preserve of researchers, IT support and systems librarians. Established non-profit volunteer organisations, such as Software Carpentry and Data Carpentry, offer introductory research software skills training with a focus on the needs and requirements of research scientists. Library Carpentry is a comparable introductory software skills training programme with a focus on the needs and requirements of library and information professionals. This paper describes how the material was developed and delivered, and reports on challenges faced, lessons learned and future plans.

Download Full-text

Software packaging and distribution for LHCb using Nix

EPJ Web of Conferences ◽

10.1051/epjconf/201921405005 ◽

2019 ◽

Vol 214 ◽

pp. 05005

Author(s):

Chris Burr ◽

Marco Clemencic ◽

Ben Couturier

Keyword(s):

Data Science ◽

High Energy Physics ◽

Simulated Data ◽

High Energy ◽

Science Community ◽

Complex Dependency ◽

Energy Physics

Software is an essential and rapidly evolving component of modern high energy physics research. The ability to be agile and take advantage of new and updated packages from the wider data science community is allowing physicists to efficiently utilise the data available to them. However, these packages often introduce complex dependency chains and evolve rapidly introducing specific, and sometimes conflicting, version requirements which can make managing environments challenging. Additionally, there is a need to replicate old environments when generating simulated data and to utilise pre-existing datasets. Nix is a “purely functional package manager” which allows for software to be built and distributed with fully specified dependencies, making packages independent from those available on the host. Builds are reproducible and multiple versions/configurations of each package can coexist with the build configuration of each perfectly preserved. Here we will give an overview of Nix followed by the work that has been done to use Nix in LHCb and the advantages and challenges that this brings.

Download Full-text

Increasing the Execution Speed of Containerized Analysis Workflows Using an Image Snapshotter in Combination With CVMFS

Frontiers in Big Data ◽

10.3389/fdata.2021.673163 ◽

2021 ◽

Vol 4 ◽

Author(s):

Simone Mosciatti ◽

Clemens Lange ◽

Jakob Blomer

Keyword(s):

High Energy Physics ◽

File System ◽

High Energy ◽

Host System ◽

Research Software ◽

On Demand ◽

The Past ◽

Performance Benchmarks ◽

Execution Speed ◽

Energy Physics

The past years have shown a revolution in the way scientific workloads are being executed thanks to the wide adoption of software containers. These containers run largely isolated from the host system, ensuring that the development and execution environments are the same everywhere. This enables full reproducibility of the workloads and therefore also the associated scientific analyses performed. However, as the research software used becomes increasingly complex, the software images grow easily to sizes of multiple gigabytes. Downloading the full image onto every single compute node on which the containers are executed becomes unpractical. In this paper, we describe a novel way of distributing software images on the Kubernetes platform, with which the container can start before the entire image contents become available locally (so-called “lazy pulling”). Each file required for the execution is fetched individually and subsequently cached on-demand using the CernVM file system (CVMFS), enabling the execution of very large software images on potentially thousands of Kubernetes nodes with very little overhead. We present several performance benchmarks making use of typical high-energy physics analysis workloads.

Download Full-text

Undergraduate data science degrees emphasize computer science and statistics but fall short in ethics training and domain-specific context

PeerJ Computer Science ◽

10.7717/peerj-cs.441 ◽

2021 ◽

Vol 7 ◽

pp. e441

Author(s):

Jeffrey C. Oliver ◽

Torbet McNeil

Keyword(s):

Science Education ◽

Computer Science ◽

Domain Knowledge ◽

Data Science ◽

The United States ◽

Current Data ◽

Data Intensive ◽

Domain Specific ◽

Interdisciplinary Field ◽

Undergraduate Programs

The interdisciplinary field of data science, which applies techniques from computer science and statistics to address questions across domains, has enjoyed recent considerable growth and interest. This emergence also extends to undergraduate education, whereby a growing number of institutions now offer degree programs in data science. However, there is considerable variation in what the field actually entails and, by extension, differences in how undergraduate programs prepare students for data-intensive careers. We used two seminal frameworks for data science education to evaluate undergraduate data science programs at a subset of 4-year institutions in the United States; developing and applying a rubric, we assessed how well each program met the guidelines of each of the frameworks. Most programs scored high in statistics and computer science and low in domain-specific education, ethics, and areas of communication. Moreover, the academic unit administering the degree program significantly influenced the course-load distribution of computer science and statistics/mathematics courses. We conclude that current data science undergraduate programs provide solid grounding in computational and statistical approaches, yet may not deliver sufficient context in terms of domain knowledge and ethical considerations necessary for appropriate data science applications. Additional refinement of the expectations for undergraduate data science education is warranted.

Download Full-text

FuncADL: Functional Analysis Description Language

EPJ Web of Conferences ◽

10.1051/epjconf/202125103068 ◽

2021 ◽

Vol 251 ◽

pp. 03068

Author(s):

Mason Proffitt ◽

Gordon Watts

Keyword(s):

Functional Programming ◽

High Energy Physics ◽

Traditional Approach ◽

High Energy ◽

Analysis Software ◽

Description Language ◽

Large Dataset ◽

Research And Innovation ◽

Storage Format ◽

Energy Physics

The traditional approach in HEP analysis software is to loop over every event and every object via the ROOT framework. This method follows an imperative paradigm, in which the code is tied to the storage format and steps of execution. A more desirable strategy would be to implement a declarative language, such that the storage medium and execution are not included in the abstraction model. This will become increasingly important to managing the large dataset collected by the LHC and the HL-LHC. A new analysis description language (ADL) inspired by functional programming, FuncADL, was developed using Python as a host language. The expressiveness of this language was tested by implementing example analysis tasks designed to benchmark the functionality of ADLs. Many simple selections are expressible in a declarative way with FuncADL, which can be used as an interface to retrieve filtered data. Some limitations were identified, but the design of the language allows for future extensions to add missing features. FuncADL is part of a suite of analysis software tools being developed by the Institute for Research and Innovation in Software for High Energy Physics (IRIS-HEP). These tools will be available to develop highly scalable physics analyses for the LHC.

Download Full-text

Enabling interoperable data and application services in a federated ScienceMesh

EPJ Web of Conferences ◽

10.1051/epjconf/202125102041 ◽

2021 ◽

Vol 251 ◽

pp. 02041

Author(s):

Ishank Arora ◽

Samuel Alfageme Sainz ◽

Pedro Ferreira ◽

Hugo Gonzalez Labrador ◽

Jakub Moscicki

Keyword(s):

Data Science ◽

High Energy Physics ◽

New Technologies ◽

Data Transfer ◽

High Energy ◽

Collaborative Editing ◽

Share Data ◽

User Groups ◽

Transfer Services ◽

Energy Physics

In recent years, cloud sync & share storage services, provided by academic and research institutions, have become a daily workplace environment for many local user groups in the High Energy Physics (HEP) community. These, however, are primarily disconnected and deployed in isolation from one another, even though new technologies have been developed and integrated to further increase the value of data. The EU-funded CS3MESH4EOSC project is connecting locally and individually provided sync and share services, and scaling them up to the European level and beyond. It aims to deliver the ScienceMesh service, an interoperable platform to easily sync and share data across institutions and extend functionalities by connecting to other research services using streamlined sets of interoperable protocols, APIs and deployment methodologies. This supports multiple distributed application workflows: data science environments, collaborative editing and data transfer services. In this paper, we present the architecture of ScienceMesh and the technical design of its reference implementation, a platform that allows organizations to join the federated service infrastructure easily and to access application services outof-the-box. We discuss the challenges faced during the process, which include diversity of sync & share platforms (Nextcloud, Owncloud, Seafile and others), absence of global user identities and user discovery, lack of interoperable protocols and APIs, and access control and protection of data endpoints. We present the rationale for the design decisions adopted to tackle these challenges and describe our deployment architecture based on Kubernetes, which enabled us to utilize monitoring and tracing functionalities. We conclude by reporting on the early user experience with ScienceMesh.

Download Full-text

HDR UK supporting mobilising computable biomedical knowledge in the UK

BMJ Health & Care Informatics ◽

10.1136/bmjhci-2019-100122 ◽

2020 ◽

Vol 27 (2) ◽

pp. e100122

Author(s):

Neil J Sebire ◽

Caroline Cake ◽

Andrew D Morris

Keyword(s):

Health Informatics ◽

Data Science ◽

Open Data ◽

Health Data ◽

Biomedical Knowledge ◽

Domain Specific ◽

Research And Innovation ◽

Use Of Data ◽

Main Components ◽

The Uk

Computable biomedical knowledge (CBK) represents an evolving area of health informatics, with potential for rapid translational patient benefit. Health Data Research UK (HDR UK) is the national Institute for Health Data Science, whose aim is to unite the UK’s health data to enable discoveries that improve people’s lives. The three main components include the UK HDR Alliance of data custodians, committed to making health data available for research and innovation purposes for public benefit while ensuring safe use of data and building public trust, the HDR Hubs, as centres of expertise for curating data and providing expert domain-specific services, and the HDR Innovation Gateway (‘Gateway’), providing discovery, accessibility, security and interoperability services. To support CBK developments, HDR UK is encouraging use of open data standards for research purposes, with guidance around areas in which standards are emerging, aims to work closely with the international CBK community to support initiatives and aid with evaluation and collaboration, and has established a phenomics workstream to create a national platform for dissemination of machine readable and computable phenotypical algorithms to reduce duplication of effort and improve reproducibility in clinical studies.

Download Full-text

Accelerating GAN training using highly parallel hardware on public cloud

EPJ Web of Conferences ◽

10.1051/epjconf/202125102073 ◽

2021 ◽

Vol 251 ◽

pp. 02073

Author(s):

Renato Cardoso ◽

Dejan Golubovic ◽

Ignacio Peluaga Lozada ◽

Ricardo Rocha ◽

João Fernandes ◽

...

Keyword(s):

Data Science ◽

High Energy Physics ◽

High Energy ◽

Cloud Services ◽

Easy Access ◽

Full Potential ◽

Public Cloud ◽

Training Process ◽

Generative Adversarial Network ◽

Multiple Gpus

With the increasing number of Machine and Deep Learning applications in High Energy Physics, easy access to dedicated infrastructure represents a requirement for fast and efficient R&D. This work explores different types of cloud services to train a Generative Adversarial Network (GAN) in a parallel environment, using Tensorflow data parallel strategy. More specifically, we parallelize the training process on multiple GPUs and Google Tensor Processing Units (TPU) and we compare two algorithms: the TensorFlow built-in logic and a custom loop, optimised to have higher control of the elements assigned to each GPU worker or TPU core. The quality of the generated data is compared to Monte Carlo simulation. Linear speed-up of the training process is obtained, while retaining most of the performance in terms of physics results. Additionally, we benchmark the aforementioned approaches, at scale, over multiple GPU nodes, deploying the training process on different public cloud providers, seeking for overall efficiency and cost-effectiveness. The combination of data science, cloud deployment options and associated economics allows to burst out heterogeneously, exploring the full potential of cloud-based services.

Download Full-text

High-energy physics and the nature of matter

Uspekhi Fizicheskih Nauk ◽

10.3367/ufnr.0086.196508a.0589 ◽

1965 ◽

Vol 86 (8) ◽

pp. 589-590

Author(s):

E.V. Shpol'skii

Keyword(s):

High Energy Physics ◽

High Energy ◽

Energy Physics ◽

Nature Of Matter

Download Full-text