Bringing heterogeneity to the CMS software framework

The advent of computing resources with co-processors, for example Graphics Processing Units (GPU) or Field-Programmable Gate Arrays (FPGA), for use cases like the CMS High-Level Trigger (HLT) or data processing at leadership-class supercomputers imposes challenges for the current data processing frameworks. These challenges include developing a model for algorithms to offload their computations on the co-processors as well as keeping the traditional CPU busy doing other work. The CMS data processing framework, CMSSW, implements multithreading using the Intel Threading Building Blocks (TBB) library, that utilizes tasks as concurrent units of work. In this paper we will discuss a generic mechanism to interact effectively with non-CPU resources that has been implemented in CMSSW. In addition, configuring such a heterogeneous system is challenging. In CMSSW an application is configured with a configuration file written in the Python language. The algorithm types are part of the configuration. The challenge therefore is to unify the CPU and co-processor settings while allowing their implementations to be separate. We will explain how we solved these challenges while minimizing the necessary changes to the CMSSW framework. We will also discuss on a concrete example how algorithms would offload work to NVIDIA GPUs using directly the CUDA API.

Download Full-text

A SCALABLE HYBRID FPGA/GPU FX CORRELATOR

Journal of Astronomical Instrumentation ◽

10.1142/s2251171714500020 ◽

2014 ◽

Vol 03 (01) ◽

pp. 1450002 ◽

Cited By ~ 10

Author(s):

J. KOCZ ◽

L. J. GREENHILL ◽

B. R. BARSDELL ◽

G. BERNARDI ◽

A. JAMESON ◽

...

Keyword(s):

Graphics Processing Units ◽

Field Testing ◽

Outer Product ◽

Gate Arrays ◽

Large Numbers ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Hybrid Fpga ◽

Graphics Processing ◽

O 10

Radio astronomical imaging arrays comprising large numbers of antennas, O(102–103), have posed a signal processing challenge because of the required O (N2) cross correlation of signals from each antenna and requisite signal routing. This motivated the implementation of a Packetized Correlator architecture that applies Field Programmable Gate Arrays (FPGAs) to the O (N) "F-stage" transforming time domain to frequency domain data, and Graphics Processing Units (GPUs) to the O (N2) "X-stage" performing an outer product among spectra for each antenna. The design is readily scalable to at least O(103) antennas. Fringes, visibility amplitudes and sky image results obtained during field testing are presented.

Download Full-text

Implementing phase unwrapping using Field Programmable Gate Arrays or Graphics Processing Units: A comparison

2008 Second International Workshop on High-Performance Reconfigurable Computing Technology and Applications ◽

10.1109/hprcta.2008.4745687 ◽

2008 ◽

Author(s):

Sherman Braganza ◽

Miriam Leeser

Keyword(s):

Graphics Processing Units ◽

Field Programmable Gate Arrays ◽

Phase Unwrapping ◽

Gate Arrays ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Graphics Processing

Download Full-text

State-of-the-art in Heterogeneous Computing

Scientific Programming ◽

10.1155/2010/540159 ◽

2010 ◽

Vol 18 (1) ◽

pp. 1-33 ◽

Cited By ~ 96

Author(s):

Andre R. Brodtkorb ◽

Christopher Dyken ◽

Trond R. Hagen ◽

Jon M. Hjelmervik ◽

Olaf O. Storaasli

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Heterogeneous Computing ◽

State Of The Art ◽

Peak Performance ◽

Fine Grained ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Cost Efficient ◽

Graphics Processing

Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.

Download Full-text

The Murchison Widefield Array Correlator

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2015.5 ◽

2015 ◽

Vol 32 ◽

Cited By ~ 31

Author(s):

S. M. Ord ◽

B. Crosse ◽

D. Emrich ◽

D. Pallot ◽

R. B. Wayth ◽

...

Keyword(s):

Graphics Processing Units ◽

Hybrid Approach ◽

General Purpose ◽

Gate Arrays ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Signal Path ◽

Murchison Widefield Array ◽

Offline Processing ◽

Graphics Processing

AbstractThe Murchison Widefield Array is a Square Kilometre Array Precursor. The telescope is located at the Murchison Radio–astronomy Observatory in Western Australia. The MWA consists of 4 096 dipoles arranged into 128 dual polarisation aperture arrays forming a connected element interferometer that cross-correlates signals from all 256 inputs. A hybrid approach to the correlation task is employed, with some processing stages being performed by bespoke hardware, based on Field Programmable Gate Arrays, and others by Graphics Processing Units housed in general purpose rack mounted servers. The correlation capability required is approximately 8 tera floating point operations per second. The MWA has commenced operations and the correlator is generating 8.3 TB day−1 of correlation products, that are subsequently transferred 700 km from the MRO to Perth (WA) in real-time for storage and offline processing. In this paper, we outline the correlator design, signal path, and processing elements and present the data format for the internal and external interfaces.

Download Full-text

Compass SPMD: a SPMD vectorized tracking algorithm

EPJ Web of Conferences ◽

10.1051/epjconf/202024501006 ◽

2020 ◽

Vol 245 ◽

pp. 01006

Author(s):

Placido Fernandez Declara ◽

J. Daniel Garcia

Keyword(s):

Graphics Processing Units ◽

Development Project ◽

Tracking Algorithm ◽

Level Trigger ◽

Multiple Data ◽

And Performance ◽

Data Tracking ◽

Gpu Architectures ◽

High Level ◽

Graphics Processing

Compass is a SPMD (Single Program Multiple Data) tracking algorithm for the upcoming LHCb upgrade in 2021. 40 Tb/s need to be processed in real-time to select events. Alternative frameworks, algorithms and architectures are being tested to cope with the deluge of data. Allen is a research and development project aiming to run the full HLT1 (High Level Trigger) on GPUs (Graphics Processing Units). Allen’s architecture focuses on data-oriented layout and algorithms to better exploit parallel architectures. GPUs already proved to exploit the framework efficiently with the algorithms developed for Allen, implemented and optimized for GPU architectures. We explore opportunities for the SIMD (Single Instruction Multiple Data) paradigm in CPUs through the Compass algorithm. We use the Intel SPMD Program Compiler (ISPC) to achieve good readability, maintainability and performance writing “GPU-like” source code, preserving the main design of the algorithm.

Download Full-text

Evolution of the ALICE Software Framework for Run 3

EPJ Web of Conferences ◽

10.1051/epjconf/201921405010 ◽

2019 ◽

Vol 214 ◽

pp. 05010 ◽

Cited By ~ 1

Author(s):

Giulio Eulisse ◽

Piotr Konopka ◽

Mikolaj Krzewicki ◽

Matthias Richter ◽

David Rohr ◽

...

Keyword(s):

Data Processing ◽

Data Model ◽

Message Passing ◽

Software Framework ◽

Distributed Software ◽

Central Collisions ◽

Modular Software ◽

Level Trigger ◽

Data Throughput ◽

High Level

ALICE is one of the four major LHC experiments at CERN. When the accelerator enters the Run 3 data-taking period, starting in 2021, ALICE expects almost 100 times more Pb-Pb central collisions than now, resulting in a large increase of data throughput. In order to cope with this new challenge, the collaboration had to extensively rethink the whole data processing chain, with a tighter integration between Online and Offline computing worlds. Such a system, code-named ALICE O2, is being developed in collaboration with the FAIR experiments at GSI. It is based on the ALFA framework which provides a generalized implementation of the ALICE High Level Trigger approach, designed around distributed software entities coordinating and communicating via message passing. We will highlight our efforts to integrate ALFA within the ALICE O2 environment. We analyze the challenges arising from the different running environments for production and development, and conclude on requirements for a flexible and modular software framework. In particular we will present the ALICE O2 Data Processing Layer which deals with ALICE specific requirements in terms of Data Model. The main goal is to reduce the complexity of development of algorithms and managing a distributed system, and by that leading to a significant simplification for the large majority of the ALICE users.

Download Full-text

Embedded GPU Implementation for High-Performance Ultrasound Imaging

Electronics ◽

10.3390/electronics10080884 ◽

2021 ◽

Vol 10 (8) ◽

pp. 884

Author(s):

Stefano Rossi ◽

Enrico Boni

Keyword(s):

High Performance ◽

Graphics Processing Unit ◽

Digital Signal ◽

Processing Unit ◽

Embedded Computing ◽

Field Programmable ◽

Peripheral Component Interconnect ◽

Programmable Gate Arrays ◽

Graphics Processing ◽

Signal Processors

Methods of increasing complexity are currently being proposed for ultrasound (US) echographic signal processing. Graphics Processing Unit (GPU) resources allowing massive exploitation of parallel computing are ideal candidates for these tasks. Many high-performance US instruments, including open scanners like ULA-OP 256, have an architecture based only on Field-Programmable Gate Arrays (FPGAs) and/or Digital Signal Processors (DSPs). This paper proposes the implementation of the embedded NVIDIA Jetson Xavier AGX module on board ULA-OP 256. The system architecture was revised to allow the introduction of a new Peripheral Component Interconnect Express (PCIe) communication channel, while maintaining backward compatibility with all other embedded computing resources already on board. Moreover, the Input/Output (I/O) peripherals of the module make the ultrasound system independent, freeing the user from the need to use an external controlling PC.

Download Full-text

Field Programmable Photonic Gate Arrays

Programmable Integrated Photonics ◽

10.1093/oso/9780198844402.003.0009 ◽

2020 ◽

pp. 301-330

Author(s):

José Capmany ◽

Daniel Pérez

Keyword(s):

Building Blocks ◽

Design Flow ◽

Power Splitting ◽

Gate Arrays ◽

Field Programmable ◽

Reconfigurable Processing ◽

Gate Array ◽

Photonic Interconnects ◽

High Level ◽

Electronic Field

The field programmable photonic gate array (FPPGA) is an integrated photonic device/subsystem that operates similarly to a field programmable gate array in electronics. It is a set of programmable photonics analogue blocks (PPABs) and of reconfigurable photonic interconnects (RPIs) implemented over a photonic chip. The PPABs provide the building blocks for implementing basic optical analogue operations (reconfigurable/independent power splitting and phase shifting). Broadly they enable reconfigurable processing just like configurable logic elements (CLE) or programmable logic blocks (PLBs) carry digital operations in electronic FPGAs or configurable analogue blocks (CABs) carry analogue operations in electronic field programmable analogue arrays (FPAAs). Reconfigurable interconnections between PPABs are provided by the RPIs. This chapter presents basic principles of integrated FPPGAs. It describes their main building blocks and discusses alternatives for their high-level layouts, design flow, technology mapping and physical implementation. Finally, it shows that waveguide meshes lead naturally to a compact solution.

Download Full-text

Heterogenous Computing on Iris Matching with OpenCL

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.850.129 ◽

2016 ◽

Vol 850 ◽

pp. 129-135

Author(s):

Buğra Şimşek ◽

Nursel Akçam

Keyword(s):

Graphics Processing Units ◽

Iris Recognition ◽

Heterogeneous Computing ◽

Hamming Distance ◽

Heterogeneous Systems ◽

Digital Signal ◽

Mobile Platforms ◽

Central Processing ◽

Field Programmable ◽

Graphics Processing

This study presents parallelization of Hamming Distance algorithm, which is used for iris comparison on iris recognition systems, for heterogeneous systems that can be included Central Processing Units (CPUs), Graphics Processing Units (GPUs), Digital Signal Processing (DSP) boards, Field Programmable Gate Array (FPGA) and some other mobile platforms with OpenCL. OpenCL allows to run same code on CPUs, GPUs, FPGAs and DSP boards. Heterogeneous computing refers to systems include different kind of devices (CPUs, GPUs, FPGAs and other accelerators). Heterogeneous computing gains performance or reduces power for suitable algorithms on these OpenCL supported devices. In this study, Hamming Distance algorithm has been coded with C++ as a sequential code and has been parallelized a designated method by us with OpenCL. Our OpenCL code has been executed on Nvidia GT430 GPU and Intel Xeon 5650 processor. The OpenCL code implementation demonstrates that speed up to 87 times with parallelization. Also our study differs from other studies, which accelerate iris matching, with regard to ensure heterogeneous computing by using OpenCL.

Download Full-text

SS-OCT and FD-OCT System Prototyping Using LabVIEW FPGA

10.1115/biomed2011-66007 ◽

2011 ◽

Author(s):

Zach Olson

Keyword(s):

Real Time ◽

Graphics Processing Units ◽

Image Data ◽

Light Sources ◽

Field Programmable ◽

Programmable Gate Arrays ◽

Labview Fpga ◽

And Performance ◽

Control Light ◽

The Cost

Optical coherence tomography (OCT) techniques have opened up a number of new medical imaging applications in research and clinical applications. Key application areas include cancer research, vascular applications such as imaging arterial plaque, and ophthalmology applications such as pre and post-operative cataract surgery imaging. Emerging Technologies in galvo control, light sources, detector technologies, and parallel hardware-based processing are increasing the quality and performance of images, as well as reducing the cost and footprint of OCT systems. The parallel computing capabilities of field programmable gate arrays (FPGAs), multi-core processors, and graphics processing units (GPUs) have enabled real-time OCT image processing, which provides real-time image data to support surgical procedures.

Download Full-text