uBlasCL: Architecture Agnostic Massively Parallel Linear Algebra System

Author(s):  
Athanasios Iliopoulos ◽  
John G. Michopoulos

The need for more efficient, more abstract and easier to use parallel programming interfaces has been recently intensified with the introduction and remarkable evolution of technologies such as the General Purpose Graphics Processing Units (GPG-PUs) and multi-core Central Processing Units (CPUs). In the present paper we present the introduction of the uBlasCL system as a Domain Specific Embedded Language within C++ that implements a Basic Linear Algebra Interface for OpenCL. The system is architecture agnostic, in the sense that it can be programmed independently of the targeted architecture, is massively parallel, and achieves efficiency that tracks well the increase in hardware performance advances. Our effort is based on the utilization of template metaprogramming and domain specific languages fundamentals, for developing a system that has the syntactic flexibility of a symbolic term processing system for expressing mathematics, and the semantic and executional power to exploit the parallelism offered by the hardware in an automated, transparent to the user, and efficiently mapped on the hardware manner. We also describe its relation to C++, template programming, domain specific languages and OpenCL. In the effort to develop uBlasCL we also developed a middleware library named CL++, as a convenient C++ interface to OpenCL. After the architectural and the implementation descriptions of the system, we present performance testing results demonstrating its potential power.

2018 ◽  
Vol 11 (11) ◽  
pp. 4621-4635 ◽  
Author(s):  
Istvan Z. Reguly ◽  
Daniel Giles ◽  
Devaraj Gopinathan ◽  
Laure Quivy ◽  
Joakim H. Beck ◽  
...  

Abstract. In this paper, we present the VOLNA-OP2 tsunami model and implementation; a finite-volume non-linear shallow-water equation (NSWE) solver built on the OP2 domain-specific language (DSL) for unstructured mesh computations. VOLNA-OP2 is unique among tsunami solvers in its support for several high-performance computing platforms: central processing units (CPUs), the Intel Xeon Phi, and graphics processing units (GPUs). This is achieved in a way that the scientific code is kept separate from various parallel implementations, enabling easy maintainability. It has already been used in production for several years; here we discuss how it can be integrated into various workflows, such as a statistical emulator. The scalability of the code is demonstrated on three supercomputers, built with classical Xeon CPUs, the Intel Xeon Phi, and NVIDIA P100 GPUs. VOLNA-OP2 shows an ability to deliver productivity as well as performance and portability to its users across a number of platforms.


Author(s):  
Nikitas Papangelopoulos ◽  
Dimitrios Vlachakis ◽  
Arianna Filntisi ◽  
Paraskevas Fakourelis ◽  
Louis Papageorgiou ◽  
...  

The exponential growth of available biological data in recent years coupled with their increasing complexity has made their analysis a computationally challenging process. Traditional central processing unist (CPUs) are reaching their limit in processing power and are not designed primarily for multithreaded applications. Graphics processing units (GPUs) on the other hand are affordable, scalable computer powerhouses that, thanks to the ever increasing demand for higher quality graphics, have yet to reach their limit. Typically high-end CPUs have 8-16 cores, whereas GPUs can have more than 2,500 cores. GPUs are also, by design, highly parallel, multicore and multithreaded, able of handling thousands of threads doing the same calculation on different subsets of a large data set. This ability is what makes them perfectly suited for biological analysis tasks. Lately this potential has been realized by many bioinformatics researches and a huge variety of tools and algorithms have been ported to GPUs, or designed from the ground up to maximize the usage of available cores. Here, we present a comprehensive review of available bioinformatics tools ranging from sequence and image analysis to protein structure prediction and systems biology that use NVIDIA Compute Unified Device Architecture (CUDA) general-purpose computing on graphics processing units (GPGPU) programming language.


2021 ◽  
Vol 47 (2) ◽  
pp. 1-28
Author(s):  
Goran Flegar ◽  
Hartwig Anzt ◽  
Terry Cojean ◽  
Enrique S. Quintana-Ortí

The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator–like a preconditioner–in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.


2011 ◽  
Vol 28 (1) ◽  
pp. 1-14 ◽  
Author(s):  
W. van Straten ◽  
M. Bailes

Abstractdspsr is a high-performance, open-source, object-oriented, digital signal processing software library and application suite for use in radio pulsar astronomy. Written primarily in C++, the library implements an extensive range of modular algorithms that can optionally exploit both multiple-core processors and general-purpose graphics processing units. After over a decade of research and development, dspsr is now stable and in widespread use in the community. This paper presents a detailed description of its functionality, justification of major design decisions, analysis of phase-coherent dispersion removal algorithms, and demonstration of performance on some contemporary microprocessor architectures.


Nanophotonics ◽  
2020 ◽  
Vol 9 (13) ◽  
pp. 4097-4108 ◽  
Author(s):  
Moustafa Ahmed ◽  
Yas Al-Hadeethi ◽  
Ahmed Bakry ◽  
Hamed Dalir ◽  
Volker J. Sorger

AbstractThe technologically-relevant task of feature extraction from data performed in deep-learning systems is routinely accomplished as repeated fast Fourier transforms (FFT) electronically in prevalent domain-specific architectures such as in graphics processing units (GPU). However, electronics systems are limited with respect to power dissipation and delay, due to wire-charging challenges related to interconnect capacitance. Here we present a silicon photonics-based architecture for convolutional neural networks that harnesses the phase property of light to perform FFTs efficiently by executing the convolution as a multiplication in the Fourier-domain. The algorithmic executing time is determined by the time-of-flight of the signal through this photonic reconfigurable passive FFT ‘filter’ circuit and is on the order of 10’s of picosecond short. A sensitivity analysis shows that this optical processor must be thermally phase stabilized corresponding to a few degrees. Furthermore, we find that for a small sample number, the obtainable number of convolutions per {time, power, and chip area) outperforms GPUs by about two orders of magnitude. Lastly, we show that, conceptually, the optical FFT and convolution-processing performance is indeed directly linked to optoelectronic device-level, and improvements in plasmonics, metamaterials or nanophotonics are fueling next generation densely interconnected intelligent photonic circuits with relevance for edge-computing 5G networks by processing tensor operations optically.


Sign in / Sign up

Export Citation Format

Share Document