scholarly journals Exploiting network restricted compute resources with HTCondor: a CMS experiment experience

2020 ◽  
Vol 245 ◽  
pp. 09007
Author(s):  
Carles Acosta-Silva ◽  
Antonio Delgado Peris ◽  
José Flix Molina ◽  
Jaime Frey ◽  
José M. Hernández ◽  
...  

In view of the increasing computing needs for the HL-LHC era, the LHC experiments are exploring new ways to access, integrate and use non-Grid compute resources. Accessing and making efficient use of Cloud and High Performance Computing (HPC) resources present a diversity of challenges for the CMS experiment. In particular, network limitations at the compute nodes in HPC centers prevent CMS pilot jobs to connect to its central HTCondor pool in order to receive payload jobs to be executed. To cope with this limitation, new features have been developed in both HTCondor and the CMS resource acquisition and workload management infrastructure. In this novel approach, a bridge node is set up outside the HPC center and the communications between HTCondor daemons are relayed through a shared file system. This conforms the basis of the CMS strategy to enable the exploitation of the Barcelona Supercomputing Center (BSC) resources, the main Spanish HPC site. CMS payloads are claimed by HTCondor condor_startd daemons running at the nearby PIC Tier-1 center and routed to BSC compute nodes through the bridge. This fully enables the connectivity of CMS HTCondor-based central infrastructure to BSC resources via the PIC HTCondor pool. Other challenges include building custom singularity images with CMS software releases, bringing conditions data to payload jobs, and custom data handling between BSC and PIC. This report describes the initial technical prototype, its deployment and tests, and future steps. A key aspect of the technique described in this contribution is that it could be universally employed in similar network-restrictive HPC environments elsewhere.

Author(s):  
Sang Boem Lim ◽  
Joon Woo ◽  
Guohua Li

Recently, cloud service providers have been gradually changing from virtual machine-based cloud infrastructures to container-based cloud-native infrastructures that consider performance and workload-management issues. Several data network performance issues for virtual instances have arisen, and various networking solutions have been newly developed or utilized. In this paper, we propose a solution suitable for a high-performance computing (HPC) cloud through a performance comparison analysis of container-based networking solutions. We constructed a supercomputer-based test-bed cluster to evaluate the serviceability by executing HPC jobs.


2020 ◽  
Vol 245 ◽  
pp. 02020
Author(s):  
Kevin Pedro

The HL-LHC and the corresponding detector upgrades for the CMS experiment will present extreme challenges for the full simulation. In particular, increased precision in models of physics processes may be required for accurate reproduction of particle shower measurements from the upcoming High Granularity Calorimeter. The CPU performance impacts of several proposed physics models will be discussed. There are several ongoing research and development efforts to make efficient use of new computing architectures and high performance computing systems for simulation. The integration of these new R&D products in the CMS software framework and corresponding CPU performance improvements will be presented.


2019 ◽  
Vol 16 (2) ◽  
pp. 709-714
Author(s):  
D. Sasikumar ◽  
S. Saravanakumar

High performance computing (HPC) is a part of period that focuses on joining the power of figuring devices. HPC created because of increment sought after of handling pace. At some phase in pervious innovation HIPC applications have constantly required huge no of PCs interconnected in a network together with group. Groups are hard to set up and keep both in fact and monetarily. It will turn out to be less confounded to introduce hip bundles inside the cloud without stressing over the expenses related with it additionally convey ensures at the phenomenal of contributions (Qos). It enables ordinary's question wind up more quick witted and intelligent. The goal of the utilization of cloud principally based HIPC is to address quicker handling of immense measure of data and higher throughputs for all multiplicities of certainties and information. Cloud in HPC decreases the expense of framework and programming program and numerous others. It manages entire guarantee to the data to get right of passage to autonomously and sufficiently. Issues in cloud concur with jail issues, confidentiality, Authenticity, Authorization and well being.


2019 ◽  
Vol 214 ◽  
pp. 03024
Author(s):  
Vladimir Brik ◽  
David Schultz ◽  
Gonzalo Merino

Here we report IceCube’s first experiences of running GPU simulations on the Titan supercomputer. This undertaking was non-trivial because Titan is designed for High Performance Computing (HPC) workloads, whereas IceCube’s workloads fall under the High Throughput Computing (HTC) category. In particular: (i) Titan’s design, policies, and tools are geared heavily toward large MPI applications, while IceCube’s workloads consist of large numbers of relatively small independent jobs, (ii) Titan compute nodes run Cray Linux, which is not directly compatible with IceCube software, and (iii) Titan compute nodes cannot access outside networks, making it impossible to access IceCube’s CVMFS repositories and workload management systems. This report examines our experience of packaging our application in Singularity containers and using HTCondor as the second-level scheduler on the Titan supercomputer.


2020 ◽  
Author(s):  
Ambarish Kumar ◽  
Ali Haider Bangash

AbstractGenomics has emerged as one of the major sources of big data. The task of augmenting data-driven challenges into bioinformatics can be met using technologies of parallel and distributed computing. GATK4 tools for genomic variants detection are enabled for high-performance computing platforms – SPARK Map Reduce framework. GATK4+WDL+CROMWELL+SPARK+DOCKER is proposed as the way forward in achieving automation, reproducibility, reusability, customization, portability and scalability. SPARK-based tools perform equally well in genomic variants detection with that of standard implementation of GATK4 tools over a command-line interface. Implementation of workflows over cloud-based high-performance computing platforms will enhance usability and will be a way forward in community research and infrastructure development for genomic variant discovery.


2021 ◽  
pp. 425-432
Author(s):  
Debabrata Samanta ◽  
Soumi Dutta ◽  
Mohammad Gouse Galety ◽  
Sabyasachi Pramanik

2020 ◽  
Vol 245 ◽  
pp. 01024
Author(s):  
Chiara Rovelli

The CMS experiment at the LHC features an electromagnetic calorimeter (ECAL) made of lead tungstate scintillating crystals. The ECAL energy response is fundamental for both triggering purposes and offline analysis. Due to the challenging LHC radiation environment, the response of both crystals and photodetectors to particles evolves with time. Therefore continuous monitoring and correction of the ageing effects are crucial. Fast, reliable and efficient workflows are set up to have a first set of corrections computed within 48 hours from data-taking, making use of dedicated data streams and processing. Such corrections, stored in relational databases, are then accessed during the prompt offline reconstruction of the CMS data. Twice a week, the calibrations used in the trigger are also updated in the database and accessed during the data-taking. In this note, the design of the CMS ECAL data handling and processing is reviewed.


2020 ◽  
Vol 245 ◽  
pp. 07060
Author(s):  
Ran Du ◽  
Jingyan Shi ◽  
Xiaowei Jiang ◽  
Jiaheng Zou

HTCondor was adopted to manage the High Throughput Computing (HTC) cluster at IHEP in 2016. In 2017 a Slurm cluster was set up to run High Performance Computing (HPC) jobs. To provide accounting services for these two clusters, we implemented a unified accounting system named Cosmos. Multiple workloads bring different accounting requirements. Briefly speaking, there are four types of jobs to account. First of all, 30 million single-core jobs run in the HTCondor cluster every year. Secondly, Virtual Machine (VM) jobs run in the legacy HTCondor VM cluster. Thirdly, parallel jobs run in the Slurm cluster, and some of these jobs are run on the GPU worker nodes to accelerate computing. Lastly, some selected HTC jobs are migrated from the HTCondor cluster to the Slurm cluster for research purposes. To satisfy all the mentioned requirements, Cosmos is implemented with four layers: acquisition, integration, statistics and presentation. Details about the issues and solutions of each layer will be presented in the paper. Cosmos has run in production for two years, and the status shows that it is a well-functioning system, also meets the requirements of the HTCondor and Slurm clusters.


Sign in / Sign up

Export Citation Format

Share Document