Scientific Computing, High-Performance Computing and Data Science in Higher Education

The higher energy and luminosity from the LHC in Run 2 have put increased pressure on CMS computing resources. Extrapolating to even higher luminosities (and thus higher event complexities and trigger rates) beyond Run 3, it becomes clear that simply scaling up the the current model of CMS computing alone will become economically unfeasible. High Performance Computing (HPC) facilities, widely used in scientific computing outside of HEP, have the potential to help fill the gap. Here we describe the U.S.CMS efforts to integrate US HPC resources into CMS Computing via the HEPCloud project at Fermilab. We present advancements in our ability to use NERSC resources at scale and efforts to integrate other HPC sites as well. We present experience in the elastic use of HPC resources, quickly scaling up use when so required by CMS workflows. We also present performance studies of the CMS multi-threaded framework on both Haswell and KNL HPC resources.

Download Full-text

Understanding the landscape of scientific software used on high-performance computing platforms

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019899451 ◽

2020 ◽

Vol 34 (4) ◽

pp. 465-477

Author(s):

A Grannan ◽

K Sood ◽

B Norris ◽

A Dubey

Keyword(s):

High Performance Computing ◽

High Performance ◽

Scientific Computing ◽

Scientific Discovery ◽

Simulation Software ◽

Scientific Software ◽

Survey Paper ◽

Software Productivity ◽

Computing Platforms ◽

Performance Computing

Scientific discovery increasingly relies on computation through simulations, analytics, and machine and deep learning. Of these, simulations on high-performance computing (HPC) platforms have been the cornerstone of scientific computing for more than two decades. However, the development of simulation software has, in general, occurred through accretion, with a few exceptions. With an increase in scientific understanding, models have become more complex, rendering an accretion mode untenable to the point where software productivity and sustainability have become active concerns in scientific computing. In this survey paper, we examine a modest set of HPC scientific simulation applications that are already using cutting-edge HPC platforms. Several have been in existence for a decade or more. Our objective in this survey is twofold: first, to understand the landscape of scientific computing on HPC platforms in order to distill the currently scattered knowledge about software practices that have helped both developer and software productivity, and second, to understand the kind of tools and methodologies that need attention for continued productivity.

Download Full-text

High-performance computing for SARS-CoV-2 RNAs clustering: a data science‒based genomics approach

Genomics & Informatics ◽

10.5808/gi.21056 ◽

2021 ◽

Vol 19 (4) ◽

pp. e49

Author(s):

Anas Oujja ◽

Mohamed Riduan Abid ◽

Jaouad Boumhidi ◽

Safae Bourhnane ◽

Asmaa Mourhir ◽

...

Keyword(s):

High Performance Computing ◽

High Performance ◽

Data Science ◽

Longest Common Subsequence ◽

Rna Sequences ◽

Hadoop Mapreduce ◽

Common Subsequence ◽

Ict Tools ◽

Clustering Approach ◽

Performance Computing

Nowadays, Genomic data constitutes one of the fastest growing datasets in the world. As of 2025, it is supposed to become the fourth largest source of Big Data, and thus mandating adequate high-performance computing (HPC) platform for processing. With the latest unprecedented and unpredictable mutations in severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2), the research community is in crucial need for ICT tools to process SARS-CoV-2 RNA data, e.g., by classifying it (i.e., clustering) and thus assisting in tracking virus mutations and predict future ones. In this paper, we are presenting an HPC-based SARS-CoV-2 RNAs clustering tool. We are adopting a data science approach, from data collection, through analysis, to visualization. In the analysis step, we present how our clustering approach leverages on HPC and the longest common subsequence (LCS) algorithm. The approach uses the Hadoop MapReduce programming paradigm and adapts the LCS algorithm in order to efficiently compute the length of the LCS for each pair of SARS-CoV-2 RNA sequences. The latter are extracted from the U.S. National Center for Biotechnology Information (NCBI) Virus repository. The computed LCS lengths are used to measure the dissimilarities between RNA sequences in order to work out existing clusters. In addition to that, we present a comparative study of the LCS algorithm performance based on variable workloads and different numbers of Hadoop worker nodes.

Download Full-text