PFEAST: A High Performance Sparse Eigenvalue Solver Using Distributed-Memory Linear Solvers

Locality is an algorithm characteristic describing a usage level of fast access memory. For example, in case of distributed memory computers we focus on memory of each computational node. To achieve the high performance of algorithm implementation one should choose the best possible locality option. Studying the parallel algorithm locality is to estimate the number and volume of data communications. In this work, we formulate and prove the statements for computers with distributed memory that allow us to estimate the asymptotic volume of data communication operations. These estimation results are useful while comparing alternative versions of parallel algorithms during data communication cost analysis.

The Old and the New: Can Physics-Informed Deep-Learning Replace Traditional Linear Solvers?

Frontiers in Big Data ◽

10.3389/fdata.2021.669097 ◽

2021 ◽

Vol 4 ◽

Author(s):

Stefano Markidis

Keyword(s):

Neural Networks ◽

Deep Learning ◽

High Performance ◽

Critical Role ◽

Low Frequency ◽

Accurate Solution ◽

Limiting Factor ◽

Data Set ◽

Linear Solvers ◽

Computational Performance

Physics-Informed Neural Networks (PINN) are neural networks encoding the problem governing equations, such as Partial Differential Equations (PDE), as a part of the neural network. PINNs have emerged as a new essential tool to solve various challenging problems, including computing linear systems arising from PDEs, a task for which several traditional methods exist. In this work, we focus first on evaluating the potential of PINNs as linear solvers in the case of the Poisson equation, an omnipresent equation in scientific computing. We characterize PINN linear solvers in terms of accuracy and performance under different network configurations (depth, activation functions, input data set distribution). We highlight the critical role of transfer learning. Our results show that low-frequency components of the solution converge quickly as an effect of the F-principle. In contrast, an accurate solution of the high frequencies requires an exceedingly long time. To address this limitation, we propose integrating PINNs into traditional linear solvers. We show that this integration leads to the development of new solvers whose performance is on par with other high-performance solvers, such as PETSc conjugate gradient linear solvers, in terms of performance and accuracy. Overall, while the accuracy and computational performance are still a limiting factor for the direct use of PINN linear solvers, hybrid strategies combining old traditional linear solver approaches with new emerging deep-learning techniques are among the most promising methods for developing a new class of linear solvers.

Optimizing High Performance Distributed Memory Parallel Hash Tables for DNA k-mer Counting

SC18: International Conference for High Performance Computing, Networking, Storage and Analysis ◽

10.1109/sc.2018.00014 ◽

2018 ◽

Cited By ~ 6

Author(s):

Tony C. Pan ◽

Sanchit Misra ◽

Srinivas Aluru

Keyword(s):

High Performance ◽

Distributed Memory ◽

Hash Tables

A Framework for HI Spectral Source Finding Using Distributed-Memory Supercomputing

Publications of the Astronomical Society of Australia ◽

10.1017/pasa.2014.18 ◽

2014 ◽

Vol 31 ◽

Cited By ~ 2

Author(s):

Stefan Westerlund ◽

Christopher Harris

Keyword(s):

High Performance ◽

Distributed Memory ◽

Computing Systems ◽

Sky Surveys ◽

Local Statistics ◽

Wide Range ◽

Gaussian Source ◽

High Bandwidth ◽

Traditional Approaches ◽

Performance Computing

AbstractThe latest generation of radio astronomy interferometers will conduct all sky surveys with data products consisting of petabytes of spectral line data. Traditional approaches to identifying and parameterising the astrophysical sources within this data will not scale to datasets of this magnitude, since the performance of workstations will not keep up with the real-time generation of data. For this reason, it is necessary to employ high performance computing systems consisting of a large number of processors connected by a high-bandwidth network. In order to make use of such supercomputers substantial modifications must be made to serial source finding code. To ease the transition, this work presents the Scalable Source Finder Framework, a framework providing storage access, networking communication and data composition functionality, which can support a wide range of source finding algorithms provided they can be applied to subsets of the entire image. Additionally, the Parallel Gaussian Source Finder was implemented using SSoFF, utilising Gaussian filters, thresholding, and local statistics. PGSF was able to search on a 256GB simulated dataset in under 24 minutes, significantly less than the 8 to 12 hour observation that would generate such a dataset.

ON MESSAGE PACKAGING IN TASK SCHEDULING FOR DISTRIBUTED MEMORY PARALLEL MACHINES

International Journal of Foundations of Computer Science ◽

10.1142/s0129054101000497 ◽

2001 ◽

Vol 12 (03) ◽

pp. 285-306 ◽

Cited By ~ 2

Author(s):

NORIYUKI FUJIMOTO ◽

TOMOKI BABA ◽

TAKASHI HASHIMOTO ◽

KENICHI HAGIHARA

Keyword(s):

Task Scheduling ◽

High Performance ◽

Distributed Memory ◽

Parallel Machines ◽

Scheduling Algorithm ◽

Parallel Programs ◽

Parallel Program ◽

Interprocessor Communication ◽

Task Scheduling Algorithm ◽

Software Overhead

In this paper, we report a performance gap betweeen a schedule with small makespan on the task scheduling model and the corresponding parallel program on distributed memory parallel machines. The main reason of the gap is the software overhead in the interprocessor communication. Therefore, speedup ratios of schedules on the model do not approximate well to those of parallel programs on the machines. The purpose of the paper is to get a task scheduling algorithm that generates a schedule with good approximation to the corresponding parallel program and with small makespan. For this purpose, we propose algorithm BCSH that generates only bulk synchronous schedules. In those schedules, no-communication phases and communication phases appear alternately. All interprocessor communications are done only in the latter phases, and thus the corresponding parallel programs can make better use of the message packaging technique easily. It reduces many software overheads of messages form a source processor to the same destination processor to almost one software overhead, and improves the performance of a parallel program significantly. Finally, we show some experimental results of performance gaps on BCSH, Kruatrachue's algorithm DSH, and Ahmad et al's algorithm ECPFD. The schedules by DSH and ECPFD are famous for their small makespans, but message packaging can not be effectively applied to the corresponding program. The results show that a bulk synchronous schedule with small makespan has advantages that the gap is small and the corresponding program is a high performance parallel one.

Abstract: Preliminary Report for a High Precision Distributed Memory Parallel Eigenvalue Solver

2012 SC Companion: High Performance Computing, Networking Storage and Analysis ◽

10.1109/sc.companion.2012.255 ◽

2012 ◽

Author(s):

T. Imamura ◽

S. Yamada ◽

M. Machida

Keyword(s):

High Precision ◽

Preliminary Report ◽

Distributed Memory ◽

Eigenvalue Solver

Extending OpenMP for NUMA Machines

Scientific Programming ◽

10.1155/2000/464182 ◽

2000 ◽

Vol 8 (3) ◽

pp. 163-181 ◽

Cited By ~ 16

Author(s):

John Bircsak ◽

Peter Craig ◽

RaeLyn Crowell ◽

Zarka Cvetanovic ◽

Jonathan Harris ◽

...

Keyword(s):

Shared Memory ◽

High Performance ◽

Distributed Memory ◽

Parallel Programs ◽

Compiler Optimizations ◽

High Performance Fortran ◽

Efficient Code ◽

Memory Architectures ◽

Shared Memory Architectures ◽

Fast Access

This paper describes extensions to OpenMP that implement data placement features needed for NUMA architectures. OpenMP is a collection of compiler directives and library routines used to write portable parallel programs for shared-memory architectures. Writing efficient parallel programs for NUMA architectures, which have characteristics of both shared-memory and distributed-memory architectures, requires that a programmer control the placement of data in memory and the placement of computations that operate on that data. Optimal performance is obtained when computations occur on processors that have fast access to the data needed by those computations. OpenMP -- designed for shared-memory architectures -- does not by itself address these issues. The extensions to OpenMP Fortran presented here have been mainly taken from High Performance Fortran. The paper describes some of the techniques that the Compaq Fortran compiler uses to generate efficient code based on these extensions. It also describes some additional compiler optimizations, and concludes with some preliminary results.

Parallel implementation of inverse adding-doubling and Monte Carlo multi-layered programs for high performance computing systems with shared and distributed memory

Computer Physics Communications ◽

10.1016/j.cpc.2015.02.029 ◽

2015 ◽

Vol 194 ◽

pp. 64-75 ◽

Cited By ~ 5

Author(s):

Svyatoslav Chugunov ◽

Changying Li

Keyword(s):

Monte Carlo ◽

High Performance Computing ◽

High Performance ◽

Distributed Memory ◽

Parallel Implementation ◽

Computing Systems ◽

Performance Computing

The use of computational kernels in full and sparse linear solvers, efficient code design on high-performance RISC processors

Vector and Parallel Processing — VECPAR'96 - Lecture Notes in Computer Science ◽

10.1007/3-540-62828-2_116 ◽

1997 ◽

pp. 108-139 ◽

Cited By ~ 4

Author(s):

Michel J. Daydé ◽

Iain S. Duff

Keyword(s):

High Performance ◽

Code Design ◽

Linear Solvers ◽

Efficient Code ◽

Sparse Linear Solvers

PGHPF – An Optimizing High Performance Fortran Compiler for Distributed Memory Machines

Scientific Programming ◽

10.1155/1997/705102 ◽

1997 ◽

Vol 6 (1) ◽

pp. 29-40 ◽

Cited By ~ 9

Author(s):

Zeki Bozkus ◽

Larry Meadows ◽

Steven Nakamoto ◽

Vincent Schuster ◽

Mark Young

Keyword(s):

High Performance ◽

Distributed Memory ◽

Parallel Machines ◽

High Efficiency ◽

Memory Systems ◽

Production Quality ◽

Distributed Memory Machines ◽

High Performance Fortran ◽

Application Developers ◽

Efficient Software

High Performance Fortran (HPF) is the first widely supported, efficient, and portable parallel programming language for shared and distributed memory systems. HPF is realized through a set of directive-based extensions to Fortran 90. It enables application developers and Fortran end-users to write compact, portable, and efficient software that will compile and execute on workstations, shared memory servers, clusters, traditional supercomputers, or massively parallel processors. This article describes a production-quality HPF compiler for a set of parallel machines. Compilation techniques such as data and computation distribution, communication generation, run-time support, and optimization issues are elaborated as the basis for an HPF compiler implementation on distributed memory machines. The performance of this compiler on benchmark programs demonstrates that high efficiency can be achieved executing HPF code on parallel architectures.