scholarly journals Bringing heterogeneity to the CMS software framework

2020 ◽  
Vol 245 ◽  
pp. 05009
Author(s):  
Andrea Bocci ◽  
David Dagenhart ◽  
Vincenzo Innocente ◽  
Christopher Jones ◽  
Matti Kortelainen ◽  
...  

The advent of computing resources with co-processors, for example Graphics Processing Units (GPU) or Field-Programmable Gate Arrays (FPGA), for use cases like the CMS High-Level Trigger (HLT) or data processing at leadership-class supercomputers imposes challenges for the current data processing frameworks. These challenges include developing a model for algorithms to offload their computations on the co-processors as well as keeping the traditional CPU busy doing other work. The CMS data processing framework, CMSSW, implements multithreading using the Intel Threading Building Blocks (TBB) library, that utilizes tasks as concurrent units of work. In this paper we will discuss a generic mechanism to interact effectively with non-CPU resources that has been implemented in CMSSW. In addition, configuring such a heterogeneous system is challenging. In CMSSW an application is configured with a configuration file written in the Python language. The algorithm types are part of the configuration. The challenge therefore is to unify the CPU and co-processor settings while allowing their implementations to be separate. We will explain how we solved these challenges while minimizing the necessary changes to the CMSSW framework. We will also discuss on a concrete example how algorithms would offload work to NVIDIA GPUs using directly the CUDA API.

2014 ◽  
Vol 03 (01) ◽  
pp. 1450002 ◽  
Author(s):  
J. KOCZ ◽  
L. J. GREENHILL ◽  
B. R. BARSDELL ◽  
G. BERNARDI ◽  
A. JAMESON ◽  
...  

Radio astronomical imaging arrays comprising large numbers of antennas, O(102–103), have posed a signal processing challenge because of the required O (N2) cross correlation of signals from each antenna and requisite signal routing. This motivated the implementation of a Packetized Correlator architecture that applies Field Programmable Gate Arrays (FPGAs) to the O (N) "F-stage" transforming time domain to frequency domain data, and Graphics Processing Units (GPUs) to the O (N2) "X-stage" performing an outer product among spectra for each antenna. The design is readily scalable to at least O(103) antennas. Fringes, visibility amplitudes and sky image results obtained during field testing are presented.


2010 ◽  
Vol 18 (1) ◽  
pp. 1-33 ◽  
Author(s):  
Andre R. Brodtkorb ◽  
Christopher Dyken ◽  
Trond R. Hagen ◽  
Jon M. Hjelmervik ◽  
Olaf O. Storaasli

Node level heterogeneous architectures have become attractive during the last decade for several reasons: compared to traditional symmetric CPUs, they offer high peak performance and are energy and/or cost efficient. With the increase of fine-grained parallelism in high-performance computing, as well as the introduction of parallelism in workstations, there is an acute need for a good overview and understanding of these architectures. We give an overview of the state-of-the-art in heterogeneous computing, focusing on three commonly found architectures: the Cell Broadband Engine Architecture, graphics processing units (GPUs), and field programmable gate arrays (FPGAs). We present a review of hardware, available software tools, and an overview of state-of-the-art techniques and algorithms. Furthermore, we present a qualitative and quantitative comparison of the architectures, and give our view on the future of heterogeneous computing.


Author(s):  
S. M. Ord ◽  
B. Crosse ◽  
D. Emrich ◽  
D. Pallot ◽  
R. B. Wayth ◽  
...  

AbstractThe Murchison Widefield Array is a Square Kilometre Array Precursor. The telescope is located at the Murchison Radio–astronomy Observatory in Western Australia. The MWA consists of 4 096 dipoles arranged into 128 dual polarisation aperture arrays forming a connected element interferometer that cross-correlates signals from all 256 inputs. A hybrid approach to the correlation task is employed, with some processing stages being performed by bespoke hardware, based on Field Programmable Gate Arrays, and others by Graphics Processing Units housed in general purpose rack mounted servers. The correlation capability required is approximately 8 tera floating point operations per second. The MWA has commenced operations and the correlator is generating 8.3 TB day−1 of correlation products, that are subsequently transferred 700 km from the MRO to Perth (WA) in real-time for storage and offline processing. In this paper, we outline the correlator design, signal path, and processing elements and present the data format for the internal and external interfaces.


2020 ◽  
Vol 245 ◽  
pp. 01006
Author(s):  
Placido Fernandez Declara ◽  
J. Daniel Garcia

Compass is a SPMD (Single Program Multiple Data) tracking algorithm for the upcoming LHCb upgrade in 2021. 40 Tb/s need to be processed in real-time to select events. Alternative frameworks, algorithms and architectures are being tested to cope with the deluge of data. Allen is a research and development project aiming to run the full HLT1 (High Level Trigger) on GPUs (Graphics Processing Units). Allen’s architecture focuses on data-oriented layout and algorithms to better exploit parallel architectures. GPUs already proved to exploit the framework efficiently with the algorithms developed for Allen, implemented and optimized for GPU architectures. We explore opportunities for the SIMD (Single Instruction Multiple Data) paradigm in CPUs through the Compass algorithm. We use the Intel SPMD Program Compiler (ISPC) to achieve good readability, maintainability and performance writing “GPU-like” source code, preserving the main design of the algorithm.


2019 ◽  
Vol 214 ◽  
pp. 05010 ◽  
Author(s):  
Giulio Eulisse ◽  
Piotr Konopka ◽  
Mikolaj Krzewicki ◽  
Matthias Richter ◽  
David Rohr ◽  
...  

ALICE is one of the four major LHC experiments at CERN. When the accelerator enters the Run 3 data-taking period, starting in 2021, ALICE expects almost 100 times more Pb-Pb central collisions than now, resulting in a large increase of data throughput. In order to cope with this new challenge, the collaboration had to extensively rethink the whole data processing chain, with a tighter integration between Online and Offline computing worlds. Such a system, code-named ALICE O2, is being developed in collaboration with the FAIR experiments at GSI. It is based on the ALFA framework which provides a generalized implementation of the ALICE High Level Trigger approach, designed around distributed software entities coordinating and communicating via message passing. We will highlight our efforts to integrate ALFA within the ALICE O2 environment. We analyze the challenges arising from the different running environments for production and development, and conclude on requirements for a flexible and modular software framework. In particular we will present the ALICE O2 Data Processing Layer which deals with ALICE specific requirements in terms of Data Model. The main goal is to reduce the complexity of development of algorithms and managing a distributed system, and by that leading to a significant simplification for the large majority of the ALICE users.


Electronics ◽  
2021 ◽  
Vol 10 (8) ◽  
pp. 884
Author(s):  
Stefano Rossi ◽  
Enrico Boni

Methods of increasing complexity are currently being proposed for ultrasound (US) echographic signal processing. Graphics Processing Unit (GPU) resources allowing massive exploitation of parallel computing are ideal candidates for these tasks. Many high-performance US instruments, including open scanners like ULA-OP 256, have an architecture based only on Field-Programmable Gate Arrays (FPGAs) and/or Digital Signal Processors (DSPs). This paper proposes the implementation of the embedded NVIDIA Jetson Xavier AGX module on board ULA-OP 256. The system architecture was revised to allow the introduction of a new Peripheral Component Interconnect Express (PCIe) communication channel, while maintaining backward compatibility with all other embedded computing resources already on board. Moreover, the Input/Output (I/O) peripherals of the module make the ultrasound system independent, freeing the user from the need to use an external controlling PC.


Author(s):  
José Capmany ◽  
Daniel Pérez

The field programmable photonic gate array (FPPGA) is an integrated photonic device/subsystem that operates similarly to a field programmable gate array in electronics. It is a set of programmable photonics analogue blocks (PPABs) and of reconfigurable photonic interconnects (RPIs) implemented over a photonic chip. The PPABs provide the building blocks for implementing basic optical analogue operations (reconfigurable/independent power splitting and phase shifting). Broadly they enable reconfigurable processing just like configurable logic elements (CLE) or programmable logic blocks (PLBs) carry digital operations in electronic FPGAs or configurable analogue blocks (CABs) carry analogue operations in electronic field programmable analogue arrays (FPAAs). Reconfigurable interconnections between PPABs are provided by the RPIs. This chapter presents basic principles of integrated FPPGAs. It describes their main building blocks and discusses alternatives for their high-level layouts, design flow, technology mapping and physical implementation. Finally, it shows that waveguide meshes lead naturally to a compact solution.


2016 ◽  
Vol 850 ◽  
pp. 129-135
Author(s):  
Buğra Şimşek ◽  
Nursel Akçam

This study presents parallelization of Hamming Distance algorithm, which is used for iris comparison on iris recognition systems, for heterogeneous systems that can be included Central Processing Units (CPUs), Graphics Processing Units (GPUs), Digital Signal Processing (DSP) boards, Field Programmable Gate Array (FPGA) and some other mobile platforms with OpenCL. OpenCL allows to run same code on CPUs, GPUs, FPGAs and DSP boards. Heterogeneous computing refers to systems include different kind of devices (CPUs, GPUs, FPGAs and other accelerators). Heterogeneous computing gains performance or reduces power for suitable algorithms on these OpenCL supported devices. In this study, Hamming Distance algorithm has been coded with C++ as a sequential code and has been parallelized a designated method by us with OpenCL. Our OpenCL code has been executed on Nvidia GT430 GPU and Intel Xeon 5650 processor. The OpenCL code implementation demonstrates that speed up to 87 times with parallelization. Also our study differs from other studies, which accelerate iris matching, with regard to ensure heterogeneous computing by using OpenCL.


2011 ◽  
Author(s):  
Zach Olson

Optical coherence tomography (OCT) techniques have opened up a number of new medical imaging applications in research and clinical applications. Key application areas include cancer research, vascular applications such as imaging arterial plaque, and ophthalmology applications such as pre and post-operative cataract surgery imaging. Emerging Technologies in galvo control, light sources, detector technologies, and parallel hardware-based processing are increasing the quality and performance of images, as well as reducing the cost and footprint of OCT systems. The parallel computing capabilities of field programmable gate arrays (FPGAs), multi-core processors, and graphics processing units (GPUs) have enabled real-time OCT image processing, which provides real-time image data to support surgical procedures.


Sign in / Sign up

Export Citation Format

Share Document