Exploiting network restricted compute resources with HTCondor: a CMS experiment experience

In view of the increasing computing needs for the HL-LHC era, the LHC experiments are exploring new ways to access, integrate and use non-Grid compute resources. Accessing and making efficient use of Cloud and High Performance Computing (HPC) resources present a diversity of challenges for the CMS experiment. In particular, network limitations at the compute nodes in HPC centers prevent CMS pilot jobs to connect to its central HTCondor pool in order to receive payload jobs to be executed. To cope with this limitation, new features have been developed in both HTCondor and the CMS resource acquisition and workload management infrastructure. In this novel approach, a bridge node is set up outside the HPC center and the communications between HTCondor daemons are relayed through a shared file system. This conforms the basis of the CMS strategy to enable the exploitation of the Barcelona Supercomputing Center (BSC) resources, the main Spanish HPC site. CMS payloads are claimed by HTCondor condor_startd daemons running at the nearby PIC Tier-1 center and routed to BSC compute nodes through the bridge. This fully enables the connectivity of CMS HTCondor-based central infrastructure to BSC resources via the PIC HTCondor pool. Other challenges include building custom singularity images with CMS software releases, bringing conditions data to payload jobs, and custom data handling between BSC and PIC. This report describes the initial technical prototype, its deployment and tests, and future steps. A key aspect of the technique described in this contribution is that it could be universally employed in similar network-restrictive HPC environments elsewhere.

Download Full-text

Performance analysis of container-based networking solutions for high-performance computing cloud

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v10i2.pp1507-1514 ◽

2020 ◽

Vol 10 (2) ◽

pp. 1507

Author(s):

Sang Boem Lim ◽

Joon Woo ◽

Guohua Li

Keyword(s):

High Performance Computing ◽

High Performance ◽

Network Performance ◽

Service Providers ◽

Cloud Service ◽

Performance Comparison ◽

Test Bed ◽

Workload Management ◽

Cloud Infrastructures ◽

Performance Computing

Recently, cloud service providers have been gradually changing from virtual machine-based cloud infrastructures to container-based cloud-native infrastructures that consider performance and workload-management issues. Several data network performance issues for virtual instances have arisen, and various networking solutions have been newly developed or utilized. In this paper, we propose a solution suitable for a high-performance computing (HPC) cloud through a performance comparison analysis of container-based networking solutions. We constructed a supercomputer-based test-bed cluster to evaluate the serviceability by executing HPC jobs.

Download Full-text

Integration and Performance of New Technologies in the CMS Simulation

EPJ Web of Conferences ◽

10.1051/epjconf/202024502020 ◽

2020 ◽

Vol 245 ◽

pp. 02020

Author(s):

Kevin Pedro

Keyword(s):

High Performance ◽

New Technologies ◽

Software Framework ◽

Computing Systems ◽

Ongoing Research ◽

Performance Improvements ◽

Cms Experiment ◽

And Performance ◽

Performance Computing ◽

Full Simulation

The HL-LHC and the corresponding detector upgrades for the CMS experiment will present extreme challenges for the full simulation. In particular, increased precision in models of physics processes may be required for accurate reproduction of particle shower measurements from the upcoming High Granularity Calorimeter. The CPU performance impacts of several proposed physics models will be discussed. There are several ongoing research and development efforts to make efficient use of new computing architectures and high performance computing systems for simulation. The integration of these new R&D products in the CMS software framework and corresponding CPU performance improvements will be presented.

Download Full-text

Cozy and Safety Developments Up to Date in High Overall Performance Computing the Use of Cloud Computing

Journal of Computational and Theoretical Nanoscience ◽

10.1166/jctn.2019.7795 ◽

2019 ◽

Vol 16 (2) ◽

pp. 709-714

Author(s):

D. Sasikumar ◽

S. Saravanakumar

Keyword(s):

Cloud Computing ◽

High Performance Computing ◽

High Performance ◽

Well Being ◽

Overall Performance ◽

Set Up ◽

Performance Computing

High performance computing (HPC) is a part of period that focuses on joining the power of figuring devices. HPC created because of increment sought after of handling pace. At some phase in pervious innovation HIPC applications have constantly required huge no of PCs interconnected in a network together with group. Groups are hard to set up and keep both in fact and monetarily. It will turn out to be less confounded to introduce hip bundles inside the cloud without stressing over the expenses related with it additionally convey ensures at the phenomenal of contributions (Qos). It enables ordinary's question wind up more quick witted and intelligent. The goal of the utilization of cloud principally based HIPC is to address quicker handling of immense measure of data and higher throughputs for all multiplicities of certainties and information. Cloud in HPC decreases the expense of framework and programming program and numerous others. It manages entire guarantee to the data to get right of passage to autonomously and sufficiently. Issues in cloud concur with jail issues, confidentiality, Authenticity, Authorization and well being.

Download Full-text

Running IceCube GPU simulations on Titan

EPJ Web of Conferences ◽

10.1051/epjconf/201921403024 ◽

2019 ◽

Vol 214 ◽

pp. 03024

Author(s):

Vladimir Brik ◽

David Schultz ◽

Gonzalo Merino

Keyword(s):

High Performance Computing ◽

High Throughput ◽

High Performance ◽

Management Systems ◽

Workload Management ◽

High Throughput Computing ◽

Large Numbers ◽

Mpi Applications ◽

Performance Computing

Here we report IceCube’s first experiences of running GPU simulations on the Titan supercomputer. This undertaking was non-trivial because Titan is designed for High Performance Computing (HPC) workloads, whereas IceCube’s workloads fall under the High Throughput Computing (HTC) category. In particular: (i) Titan’s design, policies, and tools are geared heavily toward large MPI applications, while IceCube’s workloads consist of large numbers of relatively small independent jobs, (ii) Titan compute nodes run Cray Linux, which is not directly compatible with IceCube software, and (iii) Titan compute nodes cannot access outside networks, making it impossible to access IceCube’s CVMFS repositories and workload management systems. This report examines our experience of packaging our application in Singularity containers and using HTCondor as the second-level scheduler on the Titan supercomputer.

Download Full-text

Geospatial Big Data Handling with High Performance Computing: Current Approaches and Future Directions

Geotechnologies and the Environment - High Performance Computing for Geospatial Applications ◽

10.1007/978-3-030-47998-5_4 ◽

2020 ◽

pp. 53-76

Author(s):

Zhenlong Li

Keyword(s):

Big Data ◽

High Performance Computing ◽

High Performance ◽

Data Handling ◽

Future Directions ◽

Performance Computing

Download Full-text

Implementation of Genomic Variant Calling: A Novel Approach

10.1101/2020.05.31.126144 ◽

2020 ◽

Author(s):

Ambarish Kumar ◽

Ali Haider Bangash

Keyword(s):

High Performance Computing ◽

High Performance ◽

Variant Calling ◽

Community Research ◽

Infrastructure Development ◽

Genomic Variants ◽

Novel Approach ◽

Computing Platforms ◽

Performance Computing ◽

Genomic Variant

AbstractGenomics has emerged as one of the major sources of big data. The task of augmenting data-driven challenges into bioinformatics can be met using technologies of parallel and distributed computing. GATK4 tools for genomic variants detection are enabled for high-performance computing platforms – SPARK Map Reduce framework. GATK4+WDL+CROMWELL+SPARK+DOCKER is proposed as the way forward in achieving automation, reproducibility, reusability, customization, portability and scalability. SPARK-based tools perform equally well in genomic variants detection with that of standard implementation of GATK4 tools over a command-line interface. Implementation of workflows over cloud-based high-performance computing platforms will enhance usability and will be a way forward in community research and infrastructure development for genomic variant discovery.

Download Full-text

A Novel Approach for Web Mining Taxonomy for High-Performance Computing

10.1007/978-981-16-4284-5_37 ◽

2021 ◽

pp. 425-432

Author(s):

Debabrata Samanta ◽

Soumi Dutta ◽

Mohammad Gouse Galety ◽

Sabyasachi Pramanik

Keyword(s):

High Performance Computing ◽

High Performance ◽

Web Mining ◽

Novel Approach ◽

Performance Computing

Download Full-text

A Novel Approach to Deploying High Performance Computing Applications on Cloud Platform

Lecture Notes in Electrical Engineering - Mechatronics and Automatic Control Systems ◽

10.1007/978-3-319-01273-5_39 ◽

2013 ◽

pp. 353-361

Author(s):

Jinyong Yin ◽

Li Yuan ◽

Zhenpeng Xu ◽

Weini Zeng

Keyword(s):

High Performance Computing ◽

High Performance ◽

Cloud Platform ◽

Novel Approach ◽

Performance Computing

Download Full-text

The CMS Electromagnetic Calorimeter workflow

EPJ Web of Conferences ◽

10.1051/epjconf/202024501024 ◽

2020 ◽

Vol 245 ◽

pp. 01024

Author(s):

Chiara Rovelli

Keyword(s):

Data Streams ◽

Relational Databases ◽

Continuous Monitoring ◽

Electromagnetic Calorimeter ◽

Radiation Environment ◽

Data Handling ◽

Energy Response ◽

Lead Tungstate ◽

Cms Experiment ◽

Set Up

The CMS experiment at the LHC features an electromagnetic calorimeter (ECAL) made of lead tungstate scintillating crystals. The ECAL energy response is fundamental for both triggering purposes and offline analysis. Due to the challenging LHC radiation environment, the response of both crystals and photodetectors to particles evolves with time. Therefore continuous monitoring and correction of the ageing effects are crucial. Fast, reliable and efficient workflows are set up to have a first set of corrections computed within 48 hours from data-taking, making use of dedicated data streams and processing. Such corrections, stored in relational databases, are then accessed during the prompt offline reconstruction of the CMS data. Twice a week, the calibrations used in the trigger are also updated in the database and accessed during the data-taking. In this note, the design of the CMS ECAL data handling and processing is reviewed.

Download Full-text

Cosmos : A Unified Accounting System both for the HTCondor and Slurm Clusters at IHEP

EPJ Web of Conferences ◽

10.1051/epjconf/202024507060 ◽

2020 ◽

Vol 245 ◽

pp. 07060

Author(s):

Ran Du ◽

Jingyan Shi ◽

Xiaowei Jiang ◽

Jiaheng Zou

Keyword(s):

High Performance Computing ◽

Virtual Machine ◽

High Throughput ◽

High Performance ◽

Accounting System ◽

High Throughput Computing ◽

Parallel Jobs ◽

The Status ◽

Set Up ◽

Performance Computing

HTCondor was adopted to manage the High Throughput Computing (HTC) cluster at IHEP in 2016. In 2017 a Slurm cluster was set up to run High Performance Computing (HPC) jobs. To provide accounting services for these two clusters, we implemented a unified accounting system named Cosmos. Multiple workloads bring different accounting requirements. Briefly speaking, there are four types of jobs to account. First of all, 30 million single-core jobs run in the HTCondor cluster every year. Secondly, Virtual Machine (VM) jobs run in the legacy HTCondor VM cluster. Thirdly, parallel jobs run in the Slurm cluster, and some of these jobs are run on the GPU worker nodes to accelerate computing. Lastly, some selected HTC jobs are migrated from the HTCondor cluster to the Slurm cluster for research purposes. To satisfy all the mentioned requirements, Cosmos is implemented with four layers: acquisition, integration, statistics and presentation. Details about the issues and solutions of each layer will be presented in the paper. Cosmos has run in production for two years, and the status shows that it is a well-functioning system, also meets the requirements of the HTCondor and Slurm clusters.

Download Full-text