A massively parallel semi-Lagrangian solver for the six-dimensional Vlasov–Poisson equation

This article presents an optimized and scalable semi-Lagrangian solver for the Vlasov–Poisson system in six-dimensional phase space. Grid-based solvers of the Vlasov equation are known to give accurate results. At the same time, these solvers are challenged by the curse of dimensionality resulting in very high memory requirements, and moreover, requiring highly efficient parallelization schemes. In this article, we consider the 6-D Vlasov–Poisson problem discretized by a split-step semi-Lagrangian scheme, using successive 1-D interpolations on 1-D stripes of the 6-D domain. Two parallelization paradigms are compared, a remapping scheme and a domain decomposition approach applied to the full 6-D problem. From numerical experiments, the latter approach is found to be superior in the massively parallel case in various respects. We address the challenge of artificial time step restrictions due to the decomposition of the domain by introducing a blocked one-sided communication scheme for the purely electrostatic case and a rotating mesh for the case with a constant magnetic field. In addition, we propose a pipelining scheme that enables to hide the costs for the halo communication between neighbor processes efficiently behind useful computation. Parallel scalability on up to 65,536 processes is demonstrated for benchmark problems on a supercomputer.

Download Full-text

Numerical and Physical Diffusion: Can Wave Prediction Models Resolve Directional Spread?

Journal of Atmospheric and Oceanic Technology ◽

10.1175/jtech1723.1 ◽

2005 ◽

Vol 22 (7) ◽

pp. 886-895 ◽

Cited By ~ 6

Author(s):

F. Ardhuin ◽

T. H. C. Herbers

Keyword(s):

Continental Shelf ◽

Prediction Models ◽

Spectral Band ◽

Wave Spectrum ◽

Single Frequency ◽

Propagation Time ◽

Time Step ◽

Numerical Diffusion ◽

Wave Prediction ◽

Lagrangian Scheme

Abstract A new semi-Lagrangian advection scheme called multistep ray advection is proposed for solving the spectral energy balance equation of ocean surface gravity waves. Existing so-called piecewise ray methods advect wave energy over a single time step using “pieces” of ray trajectories, after which the spectrum is updated with source terms representing various physical processes. The generalized scheme presented here allows for an arbitrary number N of advection time steps along the same rays, thus reducing numerical diffusion, and still including source-term variations every time step. Tests are performed for alongshore uniform bottom topography, and the effects of two types of discretizations of the wave spectrum are investigated, a finite-bandwidth representation and a single frequency and direction per spectral band. In the limit of large N, both the accuracy and computation cost of the method increase, approaching a nondiffusive fully Lagrangian scheme. Even for N = 1 the semi-Lagrangian scheme test results show less numerical diffusion than predictions of the commonly used first-order upwind finite-difference scheme. Application to the refraction and shoaling of narrow swell spectra across a continental shelf illustrates the importance of controlling numerical diffusion. Numerical errors in a single-step (Δt = 600 s) scheme implemented on the North Carolina continental shelf (typical swell propagation time across the shelf is about 3 h) are shown to be comparable to the angular diffusion predicted by the wave–bottom Bragg scattering theory, in particular for narrow directional spectra, suggesting that the true directional spread of swell may not always be resolved in existing wave prediction models, because of excessive numerical diffusion. This diffusion is effectively suppressed in cases presented here with a four-step semi-Lagrangian scheme, using the same value of Δt.

Download Full-text

Studying Inertia Effects in Open Channel Flow Using Saint-Venant Equations

Water ◽

10.3390/w10111652 ◽

2018 ◽

Vol 10 (11) ◽

pp. 1652

Author(s):

Dong-Sin Shih ◽

Gour-Tsyh Yeh

Keyword(s):

Stokes Equations ◽

Field Application ◽

Benchmark Problems ◽

Algebraic Equations ◽

Step Size ◽

Time Step ◽

Cross Sectional ◽

Inertia Effects ◽

Time Step Size ◽

Saint Venant Equations

One-dimensional (1D) Saint-Venant equations, which originated from the Navier–Stokes equations, are usually applied to express the transient stream flow. The governing equation is based on the mass continuity and momentum equivalence. Its momentum equation, partially comprising the inertia, pressure, gravity, and friction-induced momentum loss terms, can be expressed as kinematic wave (KIW), diffusion wave (DIW), and fully dynamic wave (DYW) flow. In this study, the method of characteristics (MOCs) is used for solving the diagonalized Saint-Venant equations. A computer model, CAMP1DF, including KIW, DIW, and DYW approximations, is developed. Benchmark problems from MacDonald et al. (1997) are examined to study the accuracy of the CAMP1DF model. The simulations revealed that CAMP1DF can simulate almost identical results that are valid for various fluvial conditions. The proposed scheme that not only allows a large time step size but also solves half of the simultaneous algebraic equations. Simulations of accuracy and efficiency are both improved. Based on the physical relevance, the simulations clearly showed that the DYW approximation has the best performance, whereas the KIW approximation results in the largest errors. Moreover, the field non-prismatic case of the Zhuoshui River in central Taiwan is studied. The simulations indicate that the DYW approach does not ensure achievement of a better simulation result than the other two approximations. The investigated cross-sectional geometries play an important role in stream routing. Because of the consideration of the acceleration terms, the simulated hydrograph of a DYW reveals more physical characteristics, particularly regarding the raising and recession of limbs. Note that the KIW does not require assignment of a downstream boundary condition, making it more convenient for field application.

Download Full-text

Efficient parallelization of SPH algorithm on modern multi-core CPUs and massively parallel GPUs

International Journal of Modeling Simulation and Scientific Computing ◽

10.1142/s1793962321500549 ◽

2021 ◽

pp. 2150054

Author(s):

Pravin Jagtap ◽

Rupesh Nasre ◽

V. S. Sanapala ◽

B. S. V. Patnaik

Keyword(s):

High Performance ◽

Performance Metrics ◽

Computational Simulation ◽

Massively Parallel ◽

Benchmark Problems ◽

Processing Unit ◽

Central Processing ◽

Neighbor Search ◽

Computational Performance ◽

Sph Algorithm

Smoothed Particle Hydrodynamics (SPH) is fast emerging as a practically useful computational simulation tool for a wide variety of engineering problems. SPH is also gaining popularity as the back bone for fast and realistic animations in graphics and video games. The Lagrangian and mesh-free nature of the method facilitates fast and accurate simulation of material deformation, interface capture, etc. Typically, particle-based methods would necessitate particle search and locate algorithms to be implemented efficiently, as continuous creation of neighbor particle lists is a computationally expensive step. Hence, it is advantageous to implement SPH, on modern multi-core platforms with the help of High-Performance Computing (HPC) tools. In this work, the computational performance of an SPH algorithm is assessed on multi-core Central Processing Unit (CPU) as well as massively parallel General Purpose Graphical Processing Units (GP-GPU). Parallelizing SPH faces several challenges such as, scalability of the neighbor search process, force calculations, minimizing thread divergence, achieving coalesced memory access patterns, balancing workload, ensuring optimum use of computational resources, etc. While addressing some of these challenges, detailed analysis of performance metrics such as speedup, global load efficiency, global store efficiency, warp execution efficiency, occupancy, etc. is evaluated. The OpenMP and Compute Unified Device Architecture[Formula: see text] parallel programming models have been used for parallel computing on Intel Xeon[Formula: see text] E5-[Formula: see text] multi-core CPU and NVIDIA Quadro M[Formula: see text] and NVIDIA Tesla p[Formula: see text] massively parallel GPU architectures. Standard benchmark problems from the Computational Fluid Dynamics (CFD) literature are chosen for the validation. The key concern of how to identify a suitable architecture for mesh-less methods which essentially require heavy workload of neighbor search and evaluation of local force fields from neighbor interactions is addressed.

Download Full-text

Massively parallel case-based reasoning with probabilistic similarity metrics

Topics in Case-Based Reasoning - Lecture Notes in Computer Science ◽

10.1007/3-540-58330-0_83 ◽

1994 ◽

pp. 144-154 ◽

Cited By ~ 24

Author(s):

Petri Myllymäki ◽

Henry Tirri

Keyword(s):

Massively Parallel ◽

Case Based Reasoning ◽

Similarity Metrics ◽

Parallel Case ◽

Case Based

Download Full-text

Implementation of 2D Domain Decomposition in the UCAN Gyrokinetic Particle-in-Cell Code and Resulting Performance of UCAN2

Communications in Computational Physics ◽

10.4208/cicp.070115.030715a ◽

2016 ◽

Vol 19 (1) ◽

pp. 205-225 ◽

Cited By ~ 2

Author(s):

Jean-Noel G. Leboeuf ◽

Viktor K. Decyk ◽

David E. Newman ◽

Raul Sanchez

Keyword(s):

Domain Decomposition ◽

Parallel Implementation ◽

Three Dimensional ◽

Massively Parallel ◽

Problem Size ◽

Time Step ◽

Particle In Cell ◽

Minor Radius ◽

Efficient Charge ◽

Cartesian Geometry

AbstractThe massively parallel, nonlinear, three-dimensional (3D), toroidal, electrostatic, gyrokinetic, particle-in-cell (PIC), Cartesian geometry UCAN code, with particle ions and adiabatic electrons, has been successfully exercised to identify non-diffusive transport characteristics in present day tokamak discharges. The limitation in applying UCAN to larger scale discharges is the 1D domain decomposition in the toroidal (or z-) direction for massively parallel implementation using MPI which has restricted the calculations to a few hundred ion Larmor radii or gyroradii per plasma minor radius. To exceed these sizes, we have implemented 2D domain decomposition in UCAN with the addition of the y-direction to the processor mix. This has been facilitated by use of relevant components in the P2LIB library of field and particle management routines developed for UCLA's UPIC Framework of conventional PIC codes. The gyro-averaging specific to gyrokinetic codes is simplified by the use of replicated arrays for efficient charge accumulation and force deposition. The 2D domain-decomposed UCAN2 code reproduces the original 1D domain nonlinear results within round-off. Benchmarks of UCAN2 on the Cray XC30 Edison at NERSC demonstrate ideal scaling when problem size is increased along with processor number up to the largest power of 2 available, namely 131,072 processors. These particle weak scaling benchmarks also indicate that the 1 nanosecond per particle per time step and 1 TFlops barriers are easily broken by UCAN2 with 1 billion particles or more and 2000 or more processors.

Download Full-text

Using Graphics Processing Units to Accelerate Numerical Simulations of Interfacial Incompressible Flows

Volume 1: Symposia, Parts A and B ◽

10.1115/fedsm2012-72176 ◽

2012 ◽

Cited By ~ 4

Author(s):

Stephen R. Codyer ◽

Mehdi Raessi ◽

Gaurav Khanna

Keyword(s):

Linear Algebra ◽

Graphics Processing Units ◽

Incompressible Flows ◽

Stokes Equations ◽

Surface Tension Force ◽

Double Precision ◽

Convergence Criteria ◽

Time Step ◽

Two Phase ◽

Poisson Problem

We present a GPU accelerated numerical solver for incompressible, immiscible, two-phase fluid flows. This leads to a significant simulation speed-up and thus, the capability to have finer grid sizes and/or more accurate convergence criteria. We solve the Navier-Stokes equations, which include the surface tension force, by using a two-step projection method requiring the iterative solution to a pressure Poisson problem at each time step. However, running a serial linear algebra solver on a CPU to solve the pressure Poisson problem can take 50–99.9% of the total simulation time. To remove this bottleneck, we employ the large parallelization capabilities of GPUs by developing a double-precision parallel linear algebra solver, SCGPU, using NVIDIA’s CUDA v.4.0 libraries. The performance of SCGPU in serial simulations is presented, in addition to an evaluation of two pre-packaged GPU linear algebra solvers CUSP and CULA-sparse. We also present preliminary results of a GPU-accelerated MPI CPU flow solver.

Download Full-text

Efficient Actor-Critic Algorithm with Hierarchical Model Learning and Planning

Computational Intelligence and Neuroscience ◽

10.1155/2016/4824072 ◽

2016 ◽

Vol 2016 ◽

pp. 1-15 ◽

Cited By ~ 1

Author(s):

Shan Zhong ◽

Quan Liu ◽

QiMing Fu

Keyword(s):

Convergence Rate ◽

Hierarchical Model ◽

Function Approximation ◽

Local Linear Regression ◽

Benchmark Problems ◽

Time Step ◽

Model Learning ◽

Linear Function Approximation ◽

Efficient Learning ◽

The Value Function

To improve the convergence rate and the sample efficiency, two efficient learning methods AC-HMLP and RAC-HMLP (AC-HMLP withl2-regularization) are proposed by combining actor-critic algorithm with hierarchical model learning and planning. The hierarchical models consisting of the local and the global models, which are learned at the same time during learning of the value function and the policy, are approximated by local linear regression (LLR) and linear function approximation (LFA), respectively. Both the local model and the global model are applied to generate samples for planning; the former is used only if the state-prediction error does not surpass the threshold at each time step, while the latter is utilized at the end of each episode. The purpose of taking both models is to improve the sample efficiency and accelerate the convergence rate of the whole algorithm through fully utilizing the local and global information. Experimentally, AC-HMLP and RAC-HMLP are compared with three representative algorithms on two Reinforcement Learning (RL) benchmark problems. The results demonstrate that they perform best in terms of convergence rate and sample efficiency.

Download Full-text

IMD: A Software Package for Molecular Dynamics Studies on Parallel Computers

International Journal of Modern Physics C ◽

10.1142/s0129183197000990 ◽

1997 ◽

Vol 08 (05) ◽

pp. 1131-1140 ◽

Cited By ~ 228

Author(s):

J. Stadler ◽

R. Mikulla ◽

H.-R. Trebin

Keyword(s):

Molecular Dynamics ◽

Software Package ◽

Large Scale ◽

Md Simulation ◽

Parallel Computers ◽

Massively Parallel ◽

Massively Parallel Computers ◽

Communication Scheme ◽

And Performance ◽

Dynamics Simulations

We report on implementation and performance of the program IMD, designed for short range molecular dynamics simulations on massively parallel computers. After a short explanation of the cell-based algorithm, its extension to parallel computers as well as two variants of the communication scheme are discussed. We provide performance numbers for simulations of different sizes and compare them with values found in the literature. Finally we describe two applications, namely a very large scale simulation with more than 1.23×109 atoms, to our knowledge the largest published MD simulation up to this day and a simulation of a crack propagating in a two-dimensional quasicrystal.

Download Full-text

Acceleration of the IMplicit–EXplicit nonhydrostatic unified model of the atmosphere on manycore processors

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017732395 ◽

2017 ◽

Vol 33 (2) ◽

pp. 242-267 ◽

Cited By ~ 6

Author(s):

Daniel S Abdi ◽

Francis X Giraldo ◽

Emil M Constantinescu ◽

Lester E Carr ◽

Lucas C Wilcox ◽

...

Keyword(s):

Weather Prediction ◽

Time Integration ◽

Explicit Methods ◽

Benchmark Problems ◽

Graphic Processing Units ◽

Courant Number ◽

Time Step ◽

Integration Methods ◽

Manycore Processors ◽

Time Integration Methods

We present the acceleration of an IMplicit–EXplicit (IMEX) nonhydrostatic atmospheric model on manycore processors such as graphic processing units (GPUs) and Intel’s Many Integrated Core (MIC) architecture. IMEX time integration methods sidestep the constraint imposed by the Courant–Friedrichs–Lewy condition on explicit methods through corrective implicit solves within each time step. In this work, we implement and evaluate the performance of IMEX on manycore processors relative to explicit methods. Using 3D-IMEX at Courant number C = 15, we obtained a speedup of about 4× relative to an explicit time stepping method run with the maximum allowable C = 1. Moreover, the unconditional stability of IMEX with respect to the fast waves means the speedup can increase significantly with the Courant number as long as the accuracy of the resulting solution is acceptable. We show a speedup of 100× at C = 150 using 1D-IMEX to demonstrate this point. Several improvements on the IMEX procedure were necessary in order to outperform our results with explicit methods: (a) reducing the number of degrees of freedom of the IMEX formulation by forming the Schur complement, (b) formulating a horizontally explicit vertically implicit 1D-IMEX scheme that has a lower workload and better scalability than 3D-IMEX, (c) using high-order polynomial preconditioners to reduce the condition number of the resulting system, and (d) using a direct solver for the 1D-IMEX method by performing and storing LU factorizations once to obtain a constant cost for any Courant number. Without all of these improvements, explicit time integration methods turned out to be difficult to beat. We discuss in detail the IMEX infrastructure required for formulating and implementing efficient methods on manycore processors. Several parametric studies are conducted to demonstrate the gain from each of the abovementioned improvements. Finally, we validate our results with standard benchmark problems in numerical weather prediction and evaluate the performance and scalability of the IMEX method using up to 4192 GPUs and 16 Knights Landing processors.

Download Full-text

A Lie group variational integration approach to the full discretization of a constrained geometrically exact Cosserat beam model

Multibody System Dynamics ◽

10.1007/s11044-021-09807-8 ◽

2021 ◽

Author(s):

Stefan Hante ◽

Denise Tumiotto ◽

Martin Arnold

Keyword(s):

Lie Group ◽

Group Structure ◽

Equations Of Motion ◽

Second Order ◽

The Body ◽

Benchmark Problems ◽

Beam Model ◽

Step Size ◽

Time Step ◽

Time Step Size

AbstractIn this paper, we will consider a geometrically exact Cosserat beam model taking into account the industrial challenges. The beam is represented by a framed curve, which we parametrize in the configuration space $\mathbb{S}^{3}\ltimes \mathbb{R}^{3}$ S 3 ⋉ R 3 with semi-direct product Lie group structure, where $\mathbb{S}^{3}$ S 3 is the set of unit quaternions. Velocities and angular velocities with respect to the body-fixed frame are given as the velocity vector of the configuration. We introduce internal constraints, where the rigid cross sections have to remain perpendicular to the center line to reduce the full Cosserat beam model to a Kirchhoff beam model. We derive the equations of motion by Hamilton’s principle with an augmented Lagrangian. In order to fully discretize the beam model in space and time, we only consider piecewise interpolated configurations in the variational principle. This leads, after approximating the action integral with second order, to the discrete equations of motion. Here, it is notable that we allow the Lagrange multipliers to be discontinuous in time in order to respect the derivatives of the constraint equations, also known as hidden constraints. In the last part, we will test our numerical scheme on two benchmark problems that show that there is no shear locking observable in the discretized beam model and that the errors are observed to decrease with second order with the spatial step size and the time step size.

Download Full-text