High Performance Graph Data Imputation on Multiple GPUs

In real applications, massive data with graph structures are often incomplete due to various restrictions. Therefore, graph data imputation algorithms have been widely used in the fields of social networks, sensor networks, and MRI to solve the graph data completion problem. To keep the data relevant, a data structure is represented by a graph-tensor, in which each matrix is the vertex value of a weighted graph. The convolutional imputation algorithm has been proposed to solve the low-rank graph-tensor completion problem that some data matrices are entirely unobserved. However, this data imputation algorithm has limited application scope because it is compute-intensive and low-performance on CPU. In this paper, we propose a scheme to perform the convolutional imputation algorithm with higher time performance on GPUs (Graphics Processing Units) by exploiting multi-core GPUs of CUDA architecture. We propose optimization strategies to achieve coalesced memory access for graph Fourier transform (GFT) computation and improve the utilization of GPU SM resources for singular value decomposition (SVD) computation. Furthermore, we design a scheme to extend the GPU-optimized implementation to multiple GPUs for large-scale computing. Experimental results show that the GPU implementation is both fast and accurate. On synthetic data of varying sizes, the GPU-optimized implementation running on a single Quadro RTX6000 GPU achieves up to 60.50× speedups over the GPU-baseline implementation. The multi-GPU implementation achieves up to 1.81× speedups on two GPUs versus the GPU-optimized implementation on a single GPU. On the ego-Facebook dataset, the GPU-optimized implementation achieves up to 77.88× speedups over the GPU-baseline implementation. Meanwhile, the GPU implementation and the CPU implementation achieve similar, low recovery errors.

Download Full-text

A lightweight approach to performance portability with targetDP

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016682071 ◽

2016 ◽

Vol 32 (2) ◽

pp. 288-301

Author(s):

Alan Gray ◽

Kevin Stratford

Keyword(s):

Particle Physics ◽

Message Passing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Processing Unit ◽

Performance Portability ◽

Graphics Processing

Leading high performance computing systems achieve their status through use of highly parallel devices such as NVIDIA graphics processing units or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus separate lattice quantum chromodynamics particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with Message Passing Interface (MPI) to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and graphics processing unit-accelerated large scale supercomputers.

Download Full-text

High-performance computing for computational modelling in built environment-related studies – a scientometric review

Journal of Engineering Design and Technology ◽

10.1108/jedt-07-2020-0294 ◽

2020 ◽

Vol ahead-of-print (ahead-of-print) ◽

Author(s):

Emmanuel Imuetinyan Aghimien ◽

Lerato Millicent Aghimien ◽

Olutomilayo Olayemi Petinrin ◽

Douglas Omoregie Aghimien

Keyword(s):

High Performance Computing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Computational Models ◽

Computational Modelling ◽

Scientometric Analysis ◽

Content Type ◽

Performance Computing ◽

Aec Industry

Purpose This paper aims to present the result of a scientometric analysis conducted using studies on high-performance computing in computational modelling. This was done with a view to showcasing the need for high-performance computers (HPC) within the architecture, engineering and construction (AEC) industry in developing countries, particularly in Africa, where the use of HPC in developing computational models (CMs) for effective problem solving is still low. Design/methodology/approach An interpretivism philosophical stance was adopted for the study which informed a scientometric review of existing studies gathered from the Scopus database. Keywords such as high-performance computing, and computational modelling were used to extract papers from the database. Visualisation of Similarities viewer (VOSviewer) was used to prepare co-occurrence maps based on the bibliographic data gathered. Findings Findings revealed the scarcity of research emanating from Africa in this area of study. Furthermore, past studies had placed focus on high-performance computing in the development of computational modelling and theory, parallel computing and improved visualisation, large-scale application software, computer simulations and computational mathematical modelling. Future studies can also explore areas such as cloud computing, optimisation, high-level programming language, natural science computing, computer graphics equipment and Graphics Processing Units as they relate to the AEC industry. Research limitations/implications The study assessed a single database for the search of related studies. Originality/value The findings of this study serve as an excellent theoretical background for AEC researchers seeking to explore the use of HPC for CMs development in the quest for solving complex problems in the industry.

Download Full-text

ADEPT: a domain independent sequence alignment strategy for gpu architectures

BMC Bioinformatics ◽

10.1186/s12859-020-03720-1 ◽

2020 ◽

Vol 21 (1) ◽

Author(s):

Muaaz G. Awan ◽

Jack Deslippe ◽

Aydin Buluc ◽

Oguz Selvitopi ◽

Steven Hofmeyr ◽

...

Keyword(s):

Sequence Alignment ◽

High Performance ◽

Large Scale ◽

Peak Performance ◽

Local Alignment ◽

Independent Sequence ◽

Multiple Gpus ◽

Alignment Strategy ◽

Gpu Architectures ◽

Domain Independent

Abstract Background Bioinformatic workflows frequently make use of automated genome assembly and protein clustering tools. At the core of most of these tools, a significant portion of execution time is spent in determining optimal local alignment between two sequences. This task is performed with the Smith-Waterman algorithm, which is a dynamic programming based method. With the advent of modern sequencing technologies and increasing size of both genome and protein databases, a need for faster Smith-Waterman implementations has emerged. Multiple SIMD strategies for the Smith-Waterman algorithm are available for CPUs. However, with the move of HPC facilities towards accelerator based architectures, a need for an efficient GPU accelerated strategy has emerged. Existing GPU based strategies have either been optimized for a specific type of characters (Nucleotides or Amino Acids) or for only a handful of application use-cases. Results In this paper, we present ADEPT, a new sequence alignment strategy for GPU architectures that is domain independent, supporting alignment of sequences from both genomes and proteins. Our proposed strategy uses GPU specific optimizations that do not rely on the nature of sequence. We demonstrate the feasibility of this strategy by implementing the Smith-Waterman algorithm and comparing it to similar CPU strategies as well as the fastest known GPU methods for each domain. ADEPT’s driver enables it to scale across multiple GPUs and allows easy integration into software pipelines which utilize large scale computational systems. We have shown that the ADEPT based Smith-Waterman algorithm demonstrates a peak performance of 360 GCUPS and 497 GCUPs for protein based and DNA based datasets respectively on a single GPU node (8 GPUs) of the Cori Supercomputer. Overall ADEPT shows 10x faster performance in a node-to-node comparison against a corresponding SIMD CPU implementation. Conclusions ADEPT demonstrates a performance that is either comparable or better than existing GPU strategies. We demonstrated the efficacy of ADEPT in supporting existing bionformatics software pipelines by integrating ADEPT in MetaHipMer a high-performance denovo metagenome assembler and PASTIS a high-performance protein similarity graph construction pipeline. Our results show 10% and 30% boost of performance in MetaHipMer and PASTIS respectively.

Download Full-text

On the Efficient Evaluation of the Exchange Correlation Potential on Graphics Processing Unit Clusters

Frontiers in Chemistry ◽

10.3389/fchem.2020.581058 ◽

2020 ◽

Vol 8 ◽

Author(s):

David B. Williams-Young ◽

Wibe A. de Jong ◽

Hubertus J. J. van Dam ◽

Chao Yang

Keyword(s):

Graphics Processing Units ◽

High Performance ◽

Density Functional ◽

Materials Science ◽

Basis Set ◽

Theoretical Treatment ◽

Processing Unit ◽

Multiple Gpus ◽

Exchange Correlation ◽

Graphics Processing

The predominance of Kohn–Sham density functional theory (KS-DFT) for the theoretical treatment of large experimentally relevant systems in molecular chemistry and materials science relies primarily on the existence of efficient software implementations which are capable of leveraging the latest advances in modern high-performance computing (HPC). With recent trends in HPC leading toward increasing reliance on heterogeneous accelerator-based architectures such as graphics processing units (GPU), existing code bases must embrace these architectural advances to maintain the high levels of performance that have come to be expected for these methods. In this work, we purpose a three-level parallelism scheme for the distributed numerical integration of the exchange-correlation (XC) potential in the Gaussian basis set discretization of the Kohn–Sham equations on large computing clusters consisting of multiple GPUs per compute node. In addition, we purpose and demonstrate the efficacy of the use of batched kernels, including batched level-3 BLAS operations, in achieving high levels of performance on the GPU. We demonstrate the performance and scalability of the implementation of the purposed method in the NWChemEx software package by comparing to the existing scalable CPU XC integration in NWChem.

Download Full-text

An FPGA Implementation of Deep Spiking Neural Networks for Low-Power and Fast Classification

Neural Computation ◽

10.1162/neco_a_01245 ◽

2020 ◽

Vol 32 (1) ◽

pp. 182-204 ◽

Cited By ~ 3

Author(s):

Xiping Ju ◽

Biao Fang ◽

Rui Yan ◽

Xiaoliang Xu ◽

Huajin Tang

Keyword(s):

Neural Networks ◽

High Performance ◽

Large Scale ◽

Hardware Architecture ◽

Clock Frequency ◽

Data Set ◽

Speed Up ◽

Fast Classification ◽

Spike Signals ◽

Gpu Implementation

A spiking neural network (SNN) is a type of biological plausibility model that performs information processing based on spikes. Training a deep SNN effectively is challenging due to the nondifferention of spike signals. Recent advances have shown that high-performance SNNs can be obtained by converting convolutional neural networks (CNNs). However, the large-scale SNNs are poorly served by conventional architectures due to the dynamic nature of spiking neurons. In this letter, we propose a hardware architecture to enable efficient implementation of SNNs. All layers in the network are mapped on one chip so that the computation of different time steps can be done in parallel to reduce latency. We propose new spiking max-pooling method to reduce computation complexity. In addition, we apply approaches based on shift register and coarsely grained parallels to accelerate convolution operation. We also investigate the effect of different encoding methods on SNN accuracy. Finally, we validate the hardware architecture on the Xilinx Zynq ZCU102. The experimental results on the MNIST data set show that it can achieve an accuracy of 98.94% with eight-bit quantized weights. Furthermore, it achieves 164 frames per second (FPS) under 150 MHz clock frequency and obtains 41[Formula: see text] speed-up compared to CPU implementation and 22 times lower power than GPU implementation.

Download Full-text

High-performance computing in water resources hydrodynamics

Journal of Hydroinformatics ◽

10.2166/hydro.2020.163 ◽

2020 ◽

Vol 22 (5) ◽

pp. 1217-1235 ◽

Cited By ~ 3

Author(s):

M. Morales-Hernández ◽

M. B. Sharif ◽

S. Gangrade ◽

T. T. Dullo ◽

S.-C. Kao ◽

...

Keyword(s):

Water Resources ◽

High Performance Computing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Test Case ◽

Processing Unit ◽

Central Processing ◽

Graphics Processing ◽

Performance Computing

Abstract This work presents a vision of future water resources hydrodynamics codes that can fully utilize the strengths of modern high-performance computing (HPC). The advances to computing power, formerly driven by the improvement of central processing unit processors, now focus on parallel computing and, in particular, the use of graphics processing units (GPUs). However, this shift to a parallel framework requires refactoring the code to make efficient use of the data as well as changing even the nature of the algorithm that solves the system of equations. These concepts along with other features such as the precision for the computations, dry regions management, and input/output data are analyzed in this paper. A 2D multi-GPU flood code applied to a large-scale test case is used to corroborate our statements and ascertain the new challenges for the next-generation parallel water resources codes.

Download Full-text

Hierarchical algorithms on hierarchical architectures

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2019.0055 ◽

2020 ◽

Vol 378 (2166) ◽

pp. 20190055 ◽

Cited By ~ 4

Author(s):

D. E. Keyes ◽

H. Ltaief ◽

G. Turkiyyah

Keyword(s):

High Performance ◽

Large Scale ◽

Fast Multipole Method ◽

Numerical Algorithms ◽

Building Blocks ◽

Linear Operators ◽

Low Rank ◽

Processing Power ◽

Storage Complexity ◽

Time And Energy

A traditional goal of algorithmic optimality, squeezing out flops, has been superseded by evolution in architecture. Flops no longer serve as a reasonable proxy for all aspects of complexity. Instead, algorithms must now squeeze memory, data transfers, and synchronizations, while extra flops on locally cached data represent only small costs in time and energy. Hierarchically low-rank matrices realize a rarely achieved combination of optimal storage complexity and high-computational intensity for a wide class of formally dense linear operators that arise in applications for which exascale computers are being constructed. They may be regarded as algebraic generalizations of the fast multipole method. Methods based on these hierarchical data structures and their simpler cousins, tile low-rank matrices, are well proportioned for early exascale computer architectures, which are provisioned for high processing power relative to memory capacity and memory bandwidth. They are ushering in a renaissance of computational linear algebra. A challenge is that emerging hardware architecture possesses hierarchies of its own that do not generally align with those of the algorithm. We describe modules of a software toolkit, hierarchical computations on manycore architectures, that illustrate these features and are intended as building blocks of applications, such as matrix-free higher-order methods in optimization and large-scale spatial statistics. Some modules of this open-source project have been adopted in the software libraries of major vendors. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Download Full-text

A high-performance integrated hydrodynamic modelling system for urban flood simulations

Journal of Hydroinformatics ◽

10.2166/hydro.2015.029 ◽

2015 ◽

Vol 17 (4) ◽

pp. 518-533 ◽

Cited By ~ 28

Author(s):

Qiuhua Liang ◽

Luke S. Smith

Keyword(s):

High Resolution ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Surface Flow ◽

Source Surface ◽

Hydrodynamic Modelling ◽

Flood Inundation ◽

Urban Flood ◽

Modelling System

A new High-Performance Integrated hydrodynamic Modelling System (Hi-PIMS) is tested for urban flood simulation. The software solves the two-dimensional shallow water equations using a first-order accurate Godunov-type shock-capturing scheme incorporated with the Harten, Lax and van Leer approximate Riemann solver with the contact wave restored (HLLC) for flux evaluation. The benefits of modern graphics processing units are explored to accelerate large-scale high-resolution simulations. In order to test its performance, the tool is applied to predict flood inundation due to rainfall and a point source surface flow in Glasgow, Scotland, and a hypothetical inundation event at different spatial resolutions in Thamesmead, England, caused by embankment failure. Numerical experiments demonstrate potential benefits for high-resolution modelling of urban flood inundation, and a much-improved level of performance without compromising result quality.

Download Full-text

The GPU version of LICOM3 under HIP framework and its large-scale application

10.5194/gmd-2020-323 ◽

2020 ◽

Author(s):

Pengfei Wang ◽

Jinrong Jiang ◽

Pengfei Lin ◽

Mengrong Ding ◽

Junlin Wei ◽

...

Keyword(s):

Graphics Processing Units ◽

General Circulation ◽

High Performance ◽

Large Scale ◽

Data Transfer ◽

Circulation Model ◽

Ocean General Circulation Model ◽

Ocean Model ◽

Global Ocean ◽

Almost All

Abstract. A high-resolution (1/20°) global ocean general circulation model with Graphics processing units (GPUs) code implementations is developed based on the LASG/IAP Climate system Ocean Model version 3 (LICOM3) under Heterogeneous-compute Interface for Portability (HIP) framework. The dynamic core and physics package of LICOM3 are both ported to the GPU, and 3-dimensional parallelization is applied. The HIP version of the LICOM3 (LICOM3-HIP) is 42 times faster than what the same number of CPU cores dose, when 384 AMD GPUs and CPU cores are used. The LICOM3-HIP has excellent scalability; it can still obtain speedup of more than four on 9216 GPUs comparing to 384 GPUs. In this phase, we successfully performed a test of 1/20° LICOM3-HIP using 6550 nodes and 26200 GPUs, and at the grand scale, the model’s time to solution can still obtain an increasing, about 2.72 simulated years per day (SYPD). The high performance was due to putting almost all of computation processes inside GPUs, and thus greatly reduces the time cost of data transfer between CPUs and GPUs. At the same time, a 14-year spin-up integration following the phase 2 of Ocean Model Intercomparison Project (OMIP-2) protocol of surface forcing has been conducted, and the preliminary results have been evaluated. We found that the model results have little differences from the CPU version. Further comparison with observations and lower-resolution LICOM3 results suggests that the 1/20° LICOM3-HIP can not only reproduce the observations, but also produce much smaller scale activities, such as submesoscale eddies and frontal scales structures.

Download Full-text

Nonconvex Low-Rank Tensor Completion from Noisy Data

Operations Research ◽

10.1287/opre.2021.2106 ◽

2021 ◽

Author(s):

Changxiao Cai ◽

Gen Li ◽

H. Vincent Poor ◽

Yuxin Chen

Keyword(s):

Large Scale ◽

Linear Time ◽

Practical Interest ◽

Optimal Estimation ◽

Low Rank ◽

Estimation Accuracy ◽

Tensor Completion ◽

Minimal Sample ◽

Completion Problem ◽

Rank Tensor

This paper investigates a problem of broad practical interest, namely, the reconstruction of a large-dimensional low-rank tensor from highly incomplete and randomly corrupted observations of its entries. Although a number of papers have been dedicated to this tensor completion problem, prior algorithms either are computationally too expensive for large-scale applications or come with suboptimal statistical performance. Motivated by this, we propose a fast two-stage nonconvex algorithm—a gradient method following a rough initialization—that achieves the best of both worlds: optimal statistical accuracy and computational efficiency. Specifically, the proposed algorithm provably completes the tensor and retrieves all low-rank factors within nearly linear time, while at the same time enjoying near-optimal statistical guarantees (i.e., minimal sample complexity and optimal estimation accuracy). The insights conveyed through our analysis of nonconvex optimization might have implications for a broader family of tensor reconstruction problems beyond tensor completion.

Download Full-text