Assessing the robustness and scalability of the accelerated pseudo-transient method towards exascale computing

Mapping Intimacies ◽

10.5194/gmd-2021-411 ◽

2022 ◽

Author(s):

Ludovic Räss ◽

Ivan Utkin ◽

Thibault Duretz ◽

Samuel Omlin ◽

Yuri Y. Podladchikov

Keyword(s):

High Performance ◽

Degrees Of Freedom ◽

Variable Viscosity ◽

Shear Banding ◽

Numerical Algorithms ◽

Nonlinear Problems ◽

Iterative Solvers ◽

Arbitrary Distribution ◽

The Road ◽

Flow Configurations

Abstract. The development of highly efficient, robust and scalable numerical algorithms lags behind the rapid increase in massive parallelism of modern hardware. We address this challenge with the accelerated pseudo-transient iterative method and present here a physically motivated derivation. We analytically determine optimal iteration parameters for a variety of basic physical processes and confirm the validity of theoretical predictions with numerical experiments. We provide an efficient numerical implementation of pseudo-transient solvers on graphical processing units (GPUs) using the Julia language. We achieve a parallel efficiency over 96 % on 2197 GPUs in distributed memory parallelisation weak scaling benchmarks. 2197 GPUs allow for unprecedented terascale solutions of 3D variable viscosity Stokes flow on 49953 grid cells involving over 1.2 trillion degrees of freedom. We verify the robustness of the method by handling contrasts up to 9 orders of magnitude in material parameters such as viscosity, and arbitrary distribution of viscous inclusions for different flow configurations. Moreover, we show that this method is well suited to tackle strongly nonlinear problems such as shear-banding in a visco-elasto-plastic medium. A GPU-based implementation can outperform CPU-based direct-iterative solvers in terms of wall-time even at relatively low resolution. We additionally motivate the accessibility of the method by its conciseness, flexibility, physically motivated derivation and ease of implementation. This solution strategy has thus a great potential for future high-performance computing applications, and for paving the road to exascale in the geosciences and beyond.

Download Full-text

Numerical algorithms for high-performance computational science

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2019.0066 ◽

2020 ◽

Vol 378 (2166) ◽

pp. 20190066 ◽

Cited By ~ 2

Author(s):

Jack Dongarra ◽

Laura Grigori ◽

Nicholas J. Higham

Keyword(s):

High Performance ◽

Numerical Algorithms ◽

Computational Science ◽

Floating Point ◽

Important Criterion ◽

Data Movement ◽

Floating Point Arithmetic ◽

High Performance Computers ◽

Point Arithmetic ◽

Speed And Accuracy

A number of features of today’s high-performance computers make it challenging to exploit these machines fully for computational science. These include increasing core counts but stagnant clock frequencies; the high cost of data movement; use of accelerators (GPUs, FPGAs, coprocessors), making architectures increasingly heterogeneous; and multi- ple precisions of floating-point arithmetic, including half-precision. Moreover, as well as maximizing speed and accuracy, minimizing energy consumption is an important criterion. New generations of algorithms are needed to tackle these challenges. We discuss some approaches that we can take to develop numerical algorithms for high-performance computational science, with a view to exploiting the next generation of supercomputers. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

Download Full-text

Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

ACM Transactions on Mathematical Software ◽

10.1145/3441850 ◽

2021 ◽

Vol 47 (2) ◽

pp. 1-28

Author(s):

Goran Flegar ◽

Hartwig Anzt ◽

Terry Cojean ◽

Enrique S. Quintana-Ortí

Keyword(s):

Linear Algebra ◽

Graphics Processing Units ◽

High Performance ◽

Numerical Algorithms ◽

Mixed Precision ◽

Before And After ◽

Memory Accesses ◽

Specialized Hardware ◽

The Individual ◽

Graphics Processing

The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator–like a preconditioner–in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.

Download Full-text

High-performance magneto-rheological clutches for direct-drive actuation: Design and development

Journal of Intelligent Material Systems and Structures ◽

10.1177/1045389x211006902 ◽

2021 ◽

pp. 1045389X2110069

Author(s):

Sergey Pisetskiy ◽

Mehrdad Kermani

Keyword(s):

High Performance ◽

Degrees Of Freedom ◽

Dynamic Range ◽

Complete Analysis ◽

Mass Ratio ◽

Direct Drive ◽

Element Analysis ◽

Prototype Development ◽

Magneto Rheological ◽

Hall Sensors

This paper presents an improved design, complete analysis, and prototype development of high torque-to-mass ratio Magneto-Rheological (MR) clutches. The proposed MR clutches are intended as the main actuation mechanism of a robotic manipulator with five degrees of freedom. Multiple steps to increase the toque-to-mass ratio of the clutch are evaluated and implemented in one design. First, we focus on the Hall sensors’ configuration. Our proposed MR clutches feature embedded Hall sensors for the indirect torque measurement. A new arrangement of the sensors with no effect on the magnetic reluctance of the clutch is presented. Second, we improve the magnetization of the MR clutch. We utilize a new hybrid design that features a combination of an electromagnetic coil and a permanent magnet for improved torque-to-mass ratio. Third, the gap size reduction in the hybrid MR clutch is introduced and the effect of such reduction on maximum torque and the dynamic range of MR clutch is investigated. Finally, the design for a pair of MR clutches with a shared magnetic core for antagonistic actuation of the robot joint is presented and experimentally validated. The details of each approach are discussed and the results of the finite element analysis are used to highlight the required engineering steps and to demonstrate the improvements achieved. Using the proposed design, several prototypes of the MR clutch with various torque capacities ranging from 15 to 200 N·m are developed, assembled, and tested. The experimental results demonstrate the performance of the proposed design and validate the accuracy of the analysis used for the development.

Download Full-text

Providing large-scale disk storage at CERN

EPJ Web of Conferences ◽

10.1051/epjconf/201921404033 ◽

2019 ◽

Vol 214 ◽

pp. 04033

Author(s):

Hervé Rousseau ◽

Belinda Chan Kwok Cheong ◽

Cristian Contescu ◽

Xavier Espinal Curull ◽

Jan Iven ◽

...

Keyword(s):

High Performance ◽

Large Scale ◽

Distributed Storage ◽

Storage System ◽

Primary Data ◽

Disk Storage ◽

The Road ◽

Ongoing Work ◽

Software Distribution ◽

Microsoft Office

The CERN IT Storage group operates multiple distributed storage systems and is responsible for the support of the infrastructure to accommodate all CERN storage requirements, from the physics data generated by LHC and non-LHC experiments to the personnel users' files. EOS is now the key component of the CERN Storage strategy. It allows to operate at high incoming throughput for experiment data-taking while running concurrent complex production work-loads. This high-performance distributed storage provides now more than 250PB of raw disks and it is the key component behind the success of CERNBox, the CERN cloud synchronisation service which allows syncing and sharing files on all major mobile and desktop platforms to provide offline availability to any data stored in the EOS infrastructure. CERNBox recorded an exponential growth in the last couple of year in terms of files and data stored thanks to its increasing popularity inside CERN users community and thanks to its integration with a multitude of other CERN services (Batch, SWAN, Microsoft Office). In parallel CASTOR is being simplified and transitioning from an HSM into an archival system, focusing mainly in the long-term data recording of the primary data from the detectors, preparing the road to the next-generation tape archival system, CTA. The storage services at CERN cover as well the needs of the rest of our community: Ceph as data back-end for the CERN OpenStack infrastructure, NFS services and S3 functionality; AFS for legacy home directory filesystem services and its ongoing phase-out and CVMFS for software distribution. In this paper we will summarise our experience in supporting all our distributed storage system and the ongoing work in evolving our infrastructure, testing very-dense storage building block (nodes with more than 1PB of raw space) for the challenges waiting ahead.

Download Full-text

Lane Detection Algorithm Based on Genetic Algorithm and its Parallel Computing Realization

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.479-481.65 ◽

2012 ◽

Vol 479-481 ◽

pp. 65-70

Author(s):

Xiao Hui Zhang ◽

Liu Qing ◽

Mu Li

Keyword(s):

Genetic Algorithm ◽

Data Transmission ◽

High Speed ◽

High Performance ◽

Large Data ◽

Detection Algorithm ◽

Lane Detection ◽

The Road ◽

Time Problem ◽

High Speed Data

Based on the target detection of alignment template, the paper designs a lane alignment template by using correlation matching method, and combines with genetic algorithm for template stochastic matching and optimization to realize the lane detection. In order to solve the real-time problem of lane detection algorithm based on genetic algorithm, this paper uses the high performance multi-core DSP chip TMS320C6474 as the core, combines with high-speed data transmission technology of Rapid10, realizes the hardware parallel processing of the lane detection algorithm. By Rapid10 bus, the data transmission speed between the DSP and the DSP can reach 3.125Gbps, it basically realizes transmission without delay, and thereby solves the high speed transmission of the large data quantity between processor. The experimental results show that, no matter the calculated lane line, or the running time is better than the single DSP and PC at the parallel C6474 platform. In addition, the road detection is accurate and reliable, and it has good robustness.

Download Full-text

Flood Inundation Prediction

Annual Review of Fluid Mechanics ◽

10.1146/annurev-fluid-030121-113138 ◽

2021 ◽

Vol 54 (1) ◽

Author(s):

Paul D. Bates

Keyword(s):

Fluid Mechanics ◽

Land Surface ◽

High Performance ◽

Numerical Algorithms ◽

Economic Damage ◽

Annual Review ◽

Publication Date ◽

Flood Inundation ◽

Damage Mapping ◽

Performance Computing

Every year flood events lead to thousands of casualties and significant economic damage. Mapping the areas at risk of flooding is critical to reducing these losses, yet until the last few years such information was available for only a handful of well-studied locations. This review surveys recent progress to address this fundamental issue through a novel combination of appropriate physics, efficient numerical algorithms, high-performance computing, new sources of big data, and model automation frameworks. The review describes the fluid mechanics of inundation and the models used to predict it, before going on to consider the developments that have led in the last five years to the creation of the first true fluid mechanics models of flooding over the entire terrestrial land surface. Expected final online publication date for the Annual Review of Fluid Mechanics, Volume 54 is January 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.

Download Full-text

Guest Editorial: High-Level Parallel Programming and the Road to High Performance

International Journal of Parallel Programming ◽

10.1007/s10766-018-0606-6 ◽

2018 ◽

Vol 47 (2) ◽

pp. 161-163

Author(s):

J. Daniel García ◽

Arturo Gonzalez-Escribano

Keyword(s):

Parallel Programming ◽

High Performance ◽

Guest Editorial ◽

The Road ◽

High Level

Download Full-text

A SURVEY OF CHECKPOINT/RESTART TECHNIQUES ON DISTRIBUTED MEMORY SYSTEMS

Parallel Processing Letters ◽

10.1142/s0129626413400112 ◽

2013 ◽

Vol 23 (04) ◽

pp. 1340011 ◽

Cited By ~ 7

Author(s):

FAISAL SHAHZAD ◽

MARKUS WITTMANN ◽

MORITZ KREUTZER ◽

THOMAS ZEISER ◽

GEORG HAGER ◽

...

Keyword(s):

High Performance ◽

Building Blocks ◽

Memory Systems ◽

Time To Failure ◽

Flow Solver ◽

The Road ◽

System A ◽

Node Level ◽

Mean Time ◽

Performance Computing

The road to exascale computing poses many challenges for the High Performance Computing (HPC) community. Each step on the exascale path is mainly the result of a higher level of parallelism of the basic building blocks (i.e., CPUs, memory units, networking components, etc.). The reliability of each of these basic components does not increase at the same rate as the rate of hardware parallelism. This results in a reduction of the mean time to failure (MTTF) of the whole system. A fault tolerance environment is thus indispensable to run large applications on such clusters. Checkpoint/Restart (C/R) is the classic and most popular method to minimize failure damage. Its ease of implementation makes it useful, but typically it introduces significant overhead to the application. Several efforts have been made to reduce the C/R overhead. In this paper we compare various C/R techniques for their overheads by implementing them on two different categories of applications. These approaches are based on parallel-file-system (PFS)-level checkpoints (synchronous/asynchronous) and node-level checkpoints. We utilize the Scalable Checkpoint/Restart (SCR) library for the comparison of node-level checkpoints. For asynchronous PFS-level checkpoints, we use the Damaris library, the SCR asynchronous feature, and application-based checkpointing via dedicated threads. Our baseline for overhead comparison is the naïve application-based synchronous PFS-level checkpointing method. A 3D lattice-Boltzmann (LBM) flow solver and a Lanczos eigenvalue solver are used as prototypical applications in which all the techniques considered here may be applied.

Download Full-text

On the Road to DiPOSH: Adventures in High-Performance OpenSHMEM

Parallel Processing and Applied Mathematics - Lecture Notes in Computer Science ◽

10.1007/978-3-030-43229-4_22 ◽

2020 ◽

pp. 250-260

Author(s):

Camille Coti ◽

Allen D. Malony

Keyword(s):

High Performance ◽

The Road ◽

On The Road

Download Full-text

VIRTUAL LABORATORIES AS STRATEGY FOR TEACHING IMPROVEMENT IN MATH SCIENCES AND ENGINEERING IN BOLIVIA

International Journal of Engineering Education ◽

10.14710/ijee.2.1.52-62 ◽

2020 ◽

Vol 2 (1) ◽

pp. 52-62

Author(s):

Francisco Vargas

Keyword(s):

Differential Equations ◽

Matrix Algebra ◽

Degrees Of Freedom ◽

Inverted Pendulum ◽

Numerical Algorithms ◽

Three Dimensions ◽

Technological Advancement ◽

Teaching Improvement ◽

Two Degrees Of Freedom ◽

Virtual Laboratories

The vertiginous technological advancement has made necessary the use of computersoftware that contributes to the improvement of teaching in math sciences and engineering.It is in this context that the last five years the strategy presented in this article has been disseminatedin the main universities of Bolivia, a country where the schools have not yet been ableto offer basic disciplines such as calculus, matrix algebra, physics and/or differential equationsto solve problems considering applicative aspects. To establish this connection, it is necessaryto deduce differential equations associated with practical problems, solve these equationswith different numerical algorithms, and establish the concept of simulation to later introducelanguages like Python/VPython free of license to elaborate Virtual Laboratories that allow obtainingthe solutions in two and three dimensions. The classical problems addressed for thispurpose are the satellite of two degrees of freedom and the inverted pendulum.

Download Full-text