Compass SPMD: a SPMD vectorized tracking algorithm

Compass is a SPMD (Single Program Multiple Data) tracking algorithm for the upcoming LHCb upgrade in 2021. 40 Tb/s need to be processed in real-time to select events. Alternative frameworks, algorithms and architectures are being tested to cope with the deluge of data. Allen is a research and development project aiming to run the full HLT1 (High Level Trigger) on GPUs (Graphics Processing Units). Allen’s architecture focuses on data-oriented layout and algorithms to better exploit parallel architectures. GPUs already proved to exploit the framework efficiently with the algorithms developed for Allen, implemented and optimized for GPU architectures. We explore opportunities for the SIMD (Single Instruction Multiple Data) paradigm in CPUs through the Compass algorithm. We use the Intel SPMD Program Compiler (ISPC) to achieve good readability, maintainability and performance writing “GPU-like” source code, preserving the main design of the algorithm.

Download Full-text

Bringing heterogeneity to the CMS software framework

EPJ Web of Conferences ◽

10.1051/epjconf/202024505009 ◽

2020 ◽

Vol 245 ◽

pp. 05009

Author(s):

Andrea Bocci ◽

David Dagenhart ◽

Vincenzo Innocente ◽

Christopher Jones ◽

Matti Kortelainen ◽

...

Keyword(s):

Data Processing ◽

Graphics Processing Units ◽

Building Blocks ◽

Current Data ◽

Level Trigger ◽

Field Programmable ◽

Programmable Gate Arrays ◽

High Level ◽

Graphics Processing ◽

Leadership Class

The advent of computing resources with co-processors, for example Graphics Processing Units (GPU) or Field-Programmable Gate Arrays (FPGA), for use cases like the CMS High-Level Trigger (HLT) or data processing at leadership-class supercomputers imposes challenges for the current data processing frameworks. These challenges include developing a model for algorithms to offload their computations on the co-processors as well as keeping the traditional CPU busy doing other work. The CMS data processing framework, CMSSW, implements multithreading using the Intel Threading Building Blocks (TBB) library, that utilizes tasks as concurrent units of work. In this paper we will discuss a generic mechanism to interact effectively with non-CPU resources that has been implemented in CMSSW. In addition, configuring such a heterogeneous system is challenging. In CMSSW an application is configured with a configuration file written in the Python language. The algorithm types are part of the configuration. The challenge therefore is to unify the CPU and co-processor settings while allowing their implementations to be separate. We will explain how we solved these challenges while minimizing the necessary changes to the CMSSW framework. We will also discuss on a concrete example how algorithms would offload work to NVIDIA GPUs using directly the CUDA API.

Download Full-text

Accelerated FDPS: Algorithms to use accelerators with FDPS

Publications of the Astronomical Society of Japan ◽

10.1093/pasj/psz133 ◽

2020 ◽

Vol 72 (1) ◽

Cited By ~ 2

Author(s):

Masaki Iwasawa ◽

Daisuke Namekata ◽

Keigo Nitadori ◽

Kentaro Nomura ◽

Long Wang ◽

...

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

General Purpose ◽

Performance Model ◽

Performance Tuning ◽

Data Types ◽

Interaction Function ◽

Current Implementation ◽

And Performance ◽

Graphics Processing

Abstract We describe algorithms implemented in FDPS (Framework for Developing Particle Simulators) to make efficient use of accelerator hardware such as GPGPUs (general-purpose computing on graphics processing units). We have developed FDPS to make it possible for researchers to develop their own high-performance parallel particle-based simulation programs without spending large amounts of time on parallelization and performance tuning. FDPS provides a high-performance implementation of parallel algorithms for particle-based simulations in a “generic” form, so that researchers can define their own particle data structure and interparticle interaction functions. FDPS compiled with user-supplied data types and interaction functions provides all the necessary functions for parallelization, and researchers can thus write their programs as though they are writing simple non-parallel code. It has previously been possible to use accelerators with FDPS by writing an interaction function that uses the accelerator. However, the efficiency was limited by the latency and bandwidth of communication between the CPU and the accelerator, and also by the mismatch between the available degree of parallelism of the interaction function and that of the hardware parallelism. We have modified the interface of the user-provided interaction functions so that accelerators are more efficiently used. We also implemented new techniques which reduce the amount of work on the CPU side and the amount of communication between CPU and accelerators. We have measured the performance of N-body simulations on a system with an NVIDIA Volta GPGPU using FDPS and the achieved performance is around 27% of the theoretical peak limit. We have constructed a detailed performance model, and found that the current implementation can achieve good performance on systems with much smaller memory and communication bandwidth. Thus, our implementation will be applicable to future generations of accelerator system.

Download Full-text

Implementation and performance of the high level trigger electron and photon selection for the ATLAS experiment at the LHC

IEEE Transactions on Nuclear Science ◽

10.1109/tns.2006.871902 ◽

2006 ◽

Vol 53 (3) ◽

pp. 1424-1429

Author(s):

C. Schiavi

Keyword(s):

Atlas Experiment ◽

Level Trigger ◽

Selection For ◽

And Performance ◽

High Level ◽

Electron And Photon

Download Full-text

Design and Performance of the Belle II High Level Trigger

10.22323/1.390.0769 ◽

2021 ◽

Author(s):

Markus Tobias Prim ◽

N. Braun ◽

Y. Guan ◽

O. Hartbrich ◽

R. Itoh ◽

...

Keyword(s):

Level Trigger ◽

Belle Ii ◽

And Performance ◽

High Level

Download Full-text

Kernel specialization for improved adaptability and performance on graphics processing units (GPUs)

10.17760/d20003226 ◽

2012 ◽

Author(s):

Nicholas John Moore

Keyword(s):

Graphics Processing Units ◽

And Performance ◽

Graphics Processing

Download Full-text

Convolving Pre-Trained Convolutional Neural Networks at Various Magnifications to Extract Diagnostic Features for Digital Pathology

10.1101/333773 ◽

2018 ◽

Author(s):

John-William Sidhom ◽

Alexander S. Baras

Keyword(s):

Deep Learning ◽

Graphics Processing Units ◽

Visual Recognition ◽

Digital Pathology ◽

Diagnostic Features ◽

High Level ◽

Graphics Processing ◽

Computational Analyses ◽

Radiology And Pathology

ABSTRACTDeep learning is an area of artificial intelligence that has received much attention in the past few years due to both an increase in computational power with the increased use of graphics processing units (GPU’s) for computational analyses and the performance of these class of algorithms on visual recognition tasks. They have found utility in applications ranging from image search to facial recognition for security and social media purposes. Their continued success has propelled their use across many new domains including the medical field, in areas of radiology and pathology in particular, as these fields are thought to be driven by visual recognition tasks. In this paper, we present an application of deep learning, termed ‘transfer learning’, using ResNet50, a pre-trained convolutional neural network (CNN) to act as a ‘feature-detector’ at various magnifications to identify low and high level features in digital pathology images of various breast lesions for the purpose of classifying them correctly into the labels of normal, benign, in-situ, or invasive carcinoma as provided in the ICIAR 2018 Breast Cancer Histology Challenge (BACH).

Download Full-text

Developing Extensible Lattice-Boltzmann Simulators for General-Purpose Graphics-Processing Units

Communications in Computational Physics ◽

10.4208/cicp.351011.260112s ◽

2013 ◽

Vol 13 (3) ◽

pp. 867-879 ◽

Cited By ~ 6

Author(s):

Stuart D. C. Walsh ◽

Martin O. Saar

Keyword(s):

Code Generation ◽

Lattice Boltzmann ◽

Graphics Processing Units ◽

Parallel Implementation ◽

General Purpose ◽

Lattice Boltzmann Simulation ◽

Lattice Boltzmann Simulations ◽

Gpu Architectures ◽

Automatic Code ◽

Graphics Processing

AbstractLattice-Boltzmann methods are versatile numerical modeling techniques capable of reproducing a wide variety of fluid-mechanical behavior. These methods are well suited to parallel implementation, particularly on the single-instruction multiple data (SIMD) parallel processing environments found in computer graphics processing units (GPUs).Although recent programming tools dramatically improve the ease with which GPUbased applications can be written, the programming environment still lacks the flexibility available to more traditional CPU programs. In particular, it may be difficult to develop modular and extensible programs that require variable on-device functionality with current GPU architectures.This paper describes a process of automatic code generation that overcomes these difficulties for lattice-Boltzmann simulations. It details the development of GPU-based modules for an extensible lattice-Boltzmann simulation package – LBHydra. The performance of the automatically generated code is compared to equivalent purposewritten codes for both single-phase,multiphase, andmulticomponent flows. The flexibility of the new method is demonstrated by simulating a rising, dissolving droplet moving through a porous medium with user generated lattice-Boltzmann models and subroutines.

Download Full-text

Use of GPU Computing for Uncertainty Quantification in Computational Mechanics: A Case Study

Scientific Programming ◽

10.1155/2011/730213 ◽

2011 ◽

Vol 19 (4) ◽

pp. 199-212 ◽

Cited By ~ 3

Author(s):

Gaurav ◽

Steven F. Wojtkiewicz

Keyword(s):

Parallel Computing ◽

Uncertainty Quantification ◽

Graphics Processing Units ◽

Computational Mechanics ◽

Gpu Computing ◽

Single Instruction Multiple Data ◽

Performance Constraints ◽

Multiple Data ◽

Graphics Processing

Graphics processing units (GPUs) are rapidly emerging as a more economical and highly competitive alternative to CPU-based parallel computing. As the degree of software control of GPUs has increased, many researchers have explored their use in non-gaming applications. Recent studies have shown that GPUs consistently outperform their best corresponding CPU-based parallel computing alternatives in single-instruction multiple-data (SIMD) strategies. This study explores the use of GPUs for uncertainty quantification in computational mechanics. Five types of analysis procedures that are frequently utilized for uncertainty quantification of mechanical and dynamical systems have been considered and their GPU implementations have been developed. The numerical examples presented in this study show that considerable gains in computational efficiency can be obtained for these procedures. It is expected that the GPU implementations presented in this study will serve as initial bases for further developments in the use of GPUs in the field of uncertainty quantification and will (i) aid the understanding of the performance constraints on the relevant GPU kernels and (ii) provide some guidance regarding the computational and the data structures to be utilized in these novel GPU implementations.

Download Full-text

Implementation and Performance of the High-Level Trigger electron and photon selection for the ATLAS experiment at the LHC

Astroparticle, Particle and Space Physics, Detectors and Medical Physics Applications ◽

10.1142/9789812819093_0083 ◽

2008 ◽

Author(s):

F. Monticelli ◽

X. Anduaga ◽

J. Baines ◽

K. Benslama ◽

T. Berry ◽

...

Keyword(s):

Atlas Experiment ◽

Level Trigger ◽

Selection For ◽

And Performance ◽

High Level ◽

Electron And Photon

Download Full-text

High-level programming for heterogeneous and hierarchical parallel systems

The International Journal of High Performance Computing Applications ◽

10.1177/1094342018807840 ◽

2018 ◽

Vol 32 (6) ◽

pp. 804-806

Author(s):

Javier García-Blas ◽

Christopher Brown

Keyword(s):

Graphics Processing Units ◽

Heterogeneous Computing ◽

Timing Analysis ◽

Parallel Systems ◽

General Purpose ◽

Ongoing Work ◽

High Level ◽

Tools And Techniques ◽

Graphics Processing ◽

Refactoring Tools

High-Level Heterogeneous and Hierarchical Parallel Systems (HLPGPU) aims to bring together researchers and practitioners to present new results and ongoing work on those aspects of high-level programming relevant, or specific to general-purpose computing on graphics processing units (GPGPUs) and new architectures. The 2016 HLPGPU symposium was an event co-located with the HiPEAC conference in Prague, Czech Republic. HLPGPU is targeted at high-level parallel techniques, including programming models, libraries and languages, algorithmic skeletons, refactoring tools and techniques for parallel patterns, tools and systems to aid parallel programming, heterogeneous computing, timing analysis and statistical performance models.

Download Full-text