A Massively Parallel Restriction-Smoothed Basis Multiscale Solver on Multi-Core and GPU Architectures

Mapping Intimacies ◽

10.2118/203939-ms ◽

2021 ◽

Author(s):

Abdulrahman Manea

Keyword(s):

Shared Memory ◽

Parallel Implementation ◽

Real Life ◽

Parallel Architecture ◽

Industrial Applications ◽

Multiscale Methods ◽

Basis Functions ◽

Massively Parallel ◽

Gpu Architectures ◽

Gpu Implementation

Abstract Due to its simplicity, adaptability, and applicability to various grid formats, the restriction-smoothed basis multiscale method (MsRSB) (Møyne and Lie 2016) has received wide attention and has been extended to various flow problems in porous media. Unlike the standard multiscale methods, MsRSB relies on iterative smoothing to find the multiscale basis functions in an adaptive manner, giving it the ability to naturally adjust to various complex grid orientations often encountered in real-life industrial applications. In this work, we investigate the scalability of MsRSB on various state-of-the-art parallel architectures, including multi-core systems and GPUs. While MsRSB is — like most other multiscale methods — directly amenable to parallelization, the dependence on a smoother to find the basis functions creates unique control- and data-flow patterns. These patterns require careful design and implementation in parallel environments to achieve good scalability. We extend the work on parallel multiscale methods in Manea et al. (2016) and Manea and Almani (2019) to map the MsRSB special kernels to the shared-memory parallel multi-core and GPU architectures. The scalability of our optimized parallel MsRSB implementation is demonstrated using highly heterogeneous 3D problems derived from the SPE10 Benchmark (Christie and Blunt 2001). Those problems range in size from millions to tens of millions of cells. The multi-core implementation is benchmarked on a shared memory multi-core architecture consisting of two packages of Intel's Cascade Lake Xeon® Gold 6246 CPU, while the GPU implementation is benchmarked on a massively parallel architecture consisting of Nvidia Volta V100 GPUs. We compare the multi-core implementation to the GPU implementation for both the setup and solution stages. To the best of our knowledge, this is the first parallel implementation and demonstration of the versatile MsRSB method on the GPU architecture.

Download Full-text

Scalable Graphics Processing Unit–Based Multiscale Linear Solvers for Reservoir Simulation

SPE Journal ◽

10.2118/203939-pa ◽

2021 ◽

pp. 1-20

Author(s):

A. M. Manea ◽

T. Almani

Keyword(s):

Shared Memory ◽

Reservoir Simulation ◽

Graphics Processing Unit ◽

Parallel Architecture ◽

Multiscale Methods ◽

Massively Parallel ◽

Processing Unit ◽

Multicore Architecture ◽

Graphics Processing ◽

Gpu Architecture

Summary In this work, the scalability of two key multiscale solvers for the pressure equation arising from incompressible flow in heterogeneous porous media, namely, the multiscale finite volume (MSFV) solver, and the restriction-smoothed basis multiscale (MsRSB) solver, are investigated on the graphics processing unit (GPU) massively parallel architecture. The robustness and scalability of both solvers are compared against their corresponding carefully optimized implementation on the shared-memory multicore architecture in a structured problem setting. Although several components in MSFV and MsRSB algorithms are directly parallelizable, their scalability on the GPU architecture depends heavily on the underlying algorithmic details and data-structure design of every step, where one needs to ensure favorable control and data flow on the GPU, while extracting enough parallel work for a massively parallel environment. In addition, the type of algorithm chosen for each step greatly influences the overall robustness of the solver. Thus, we extend the work on the parallel multiscale methods of Manea et al. (2016) to map the MSFV and MsRSB special kernels to the massively parallel GPU architecture. The scalability of our optimized parallel MSFV and MsRSB GPU implementations are demonstrated using highly heterogeneous structured 3D problems derived from the SPE10 Benchmark (Christie and Blunt 2001). Those problems range in size from millions to tens of millions of cells. For both solvers, the multicore implementations are benchmarked on a shared-memory multicore architecture consisting of two packages of Intel® Cascade Lake Xeon Gold 6246 central processing unit (CPU), whereas the GPU implementations are benchmarked on a massively parallel architecture consisting of NVIDIA Volta V100 GPUs. We compare the multicore implementations to the GPU implementations for both the setup and solution stages. Finally, we compare the parallel MsRSB scalability to the scalability of MSFV on the multicore (Manea et al. 2016) and GPU architectures. To the best of our knowledge, this is the first parallel implementation and demonstration of these versatile multiscale solvers on the GPU architecture. NOTE: This paper is published as part of the 2021 SPE Reservoir Simulation Conference Special Issue.

Download Full-text

A Massively Parallel Implementation of the CCSD(T) Method Using the Resolution-of-the-Identity Approximation and a Hybrid Distributed/Shared Memory Parallelization Model

Journal of Chemical Theory and Computation ◽

10.1021/acs.jctc.1c00389 ◽

2021 ◽

Author(s):

Dipayan Datta ◽

Mark S. Gordon

Keyword(s):

Shared Memory ◽

Parallel Implementation ◽

Distributed Shared Memory ◽

Massively Parallel ◽

Resolution Of The Identity

Download Full-text

Design and Optimizations of Lattice Boltzmann Methods for Massively Parallel GPU-Based Clusters

Advances in Computer and Electrical Engineering - Analysis and Applications of Lattice Boltzmann Simulations ◽

10.4018/978-1-5225-4760-0.ch003 ◽

2018 ◽

pp. 54-114

Author(s):

Enrico Calore ◽

Alessandro Gabbana ◽

Sebastiano Fabio Schifano ◽

Raffaele Tripiccione

Keyword(s):

Lattice Boltzmann ◽

Parallel Architecture ◽

Implementation Strategies ◽

Theoretical Models ◽

Lessons Learned ◽

Massively Parallel ◽

Full Detail ◽

Detail Design ◽

Commodity Clusters ◽

Gpu Architectures

GPUs deliver higher performance than traditional processors, offering remarkable energy efficiency, and are quickly becoming very popular processors for HPC applications. Still, writing efficient and scalable programs for GPUs is not an easy task as codes must adapt to increasingly parallel architecture features. In this chapter, the authors describe in full detail design and implementation strategies for lattice Boltzmann (LB) codes able to meet these goals. Most of the discussion uses a state-of-the art thermal lattice Boltzmann method in 2D, but all lessons learned in this particular case can be immediately extended to most LB and other scientific applications. The authors describe the structure of the code, discussing in detail several key design choices that were guided by theoretical models of performance and experimental benchmarks, having in mind both single-GPU codes and massively parallel implementations on commodity clusters of GPUs. The authors then present and analyze performances on several recent GPU architectures, including data on energy optimization.

Download Full-text

Parallel implementation of the cascade mass-conserving semi-Lagrangian transport scheme

Russian Journal of Numerical Analysis and Mathematical Modelling ◽

10.1515/rnam-2016-0002 ◽

2016 ◽

Vol 31 (1) ◽

Cited By ~ 2

Author(s):

Vladimir V. Shashkin ◽

Mikhail A. Tolstykh

Keyword(s):

Shared Memory ◽

Large Time ◽

Parallel Implementation ◽

Distributed Shared Memory ◽

Massively Parallel ◽

Atmospheric Models ◽

Efficient Treatment ◽

Climate Simulations ◽

The Cost ◽

Future Growth

AbstractModern atmospheric models for climate simulations require accurate and efficient, locally mass-conservative and monotonic numerical schemes for treating the transport of atmospheric constituents. One of the ways to design such schemes is Finite-Volume Semi-Lagrangian approach (FVSL). FVSL schemes are characterised by the computational efficiency advantage due to the possibility of using large time-steps and efficient treatment of multiple transported quantities (tracers). This article presents massively parallel and multi-tracer efficient version of the recently developed 3D cascade FVSL transport scheme. Using hybrid distributed-shared memory parallelism with 1D MPI domain decomposition in latitude and OpenMP computations for longitude loops allows to use efficiently up to 1600 computational cores. We hope this number will grow with the future growth of the number of shared memory cores per computational node. Multi-tracer optimisations of the scheme (mostly, developing multi-tracer efficient monotonic filter) allow to reduce the cost of transporting additional tracer to 18–23% of running the scheme with one tracer.

Download Full-text

Massively Parallel Implementation of Steered Molecular Dynamics in Tinker-HP: Comparisons of Polarizable and Non-Polarizable Simulations of Realistic Systems

10.26434/chemrxiv.7771112.v2 ◽

2019 ◽

Author(s):

Frédéric Célerse ◽

Louis Lagardere ◽

Étienne Derat ◽

Jean-Philip Piquemal

Keyword(s):

Molecular Dynamics ◽

Parallel Implementation ◽

Steered Molecular Dynamics ◽

Massively Parallel

This paper is dedicated to the massively parallel implementation of Steered Molecular Dynamics in the Tinker-HP softwtare. It allows for direct comparisons of polarizable and non-polarizable simulations of realistic systems.

Download Full-text

Massively Parallel Implementation of Steered Molecular Dynamics in Tinker-HP: Comparisons of Polarizable and Non-Polarizable Simulations of Realistic Systems

10.26434/chemrxiv.7771112 ◽

2019 ◽

Author(s):

Frédéric Célerse ◽

Louis Lagardere ◽

Étienne Derat ◽

Jean-Philip Piquemal

Keyword(s):

Molecular Dynamics ◽

Parallel Implementation ◽

Steered Molecular Dynamics ◽

Massively Parallel

Download Full-text

MG-Join: A Scalable Join for Massively Parallel Multi-GPU Architectures

Proceedings of the 2021 International Conference on Management of Data ◽

10.1145/3448016.3457254 ◽

2021 ◽

Author(s):

Johns Paul ◽

Shengliang Lu ◽

Bingsheng He ◽

Chiew Tong Lau

Keyword(s):

Massively Parallel ◽

Gpu Architectures

Download Full-text

Parallel Framework for Dimensionality Reduction of Large-Scale Datasets

Scientific Programming ◽

10.1155/2015/180214 ◽

2015 ◽

Vol 2015 ◽

pp. 1-12 ◽

Cited By ~ 3

Author(s):

Sai Kiranmayee Samudrala ◽

Jaroslaw Zola ◽

Srinivas Aluru ◽

Baskar Ganapathysubramanian

Keyword(s):

Dimensionality Reduction ◽

Organic Solar Cells ◽

Large Scale ◽

Parallel Implementation ◽

High Dimensional Data ◽

Real Life ◽

Processing Parameters ◽

High Dimensional ◽

Morphology Evolution ◽

Reduction Techniques

Dimensionality reduction refers to a set of mathematical techniques used to reduce complexity of the original high-dimensional data, while preserving its selected properties. Improvements in simulation strategies and experimental data collection methods are resulting in a deluge of heterogeneous and high-dimensional data, which often makes dimensionality reduction the only viable way to gain qualitative and quantitative understanding of the data. However, existing dimensionality reduction software often does not scale to datasets arising in real-life applications, which may consist of thousands of points with millions of dimensions. In this paper, we propose a parallel framework for dimensionality reduction of large-scale data. We identify key components underlying the spectral dimensionality reduction techniques, and propose their efficient parallel implementation. We show that the resulting framework can be used to process datasets consisting of millions of points when executed on a 16,000-core cluster, which is beyond the reach of currently available methods. To further demonstrate applicability of our framework we perform dimensionality reduction of 75,000 images representing morphology evolution during manufacturing of organic solar cells in order to identify how processing parameters affect morphology evolution.

Download Full-text

Parallel algorithm for the identification of distributed scatterers in the problem of calculating the velocities of displacements of the earth’s surface by the Persistent Scaterrers method

Вычислительные технологии ◽

10.25743/ict.2021.26.4.008 ◽

2021 ◽

pp. 82-97

Author(s):

Семен Евгеньевич Попов ◽

Вадим Петрович Потапов ◽

Роман Юрьевич Замараев

Keyword(s):

Parallel Implementation ◽

General Scheme ◽

Search Time ◽

Identification Problem ◽

Radar Data ◽

Apache Spark ◽

Massively Parallel ◽

Maximization Problem ◽

Homogeneous Sample ◽

Entire Area

Описывается программная реализация быстрого алгоритма поиска распределенных рассеивателей для задачи построения скоростей смещений земной поверхности на базе платформы Apache Spark. Рассматривается полная схема расчета скоростей смещений методом постоянных рассеивателей. Предложенный алгоритм интегрируется в схему после этапа совмещения с субпиксельной точностью стека изображений временн´ой серии радарных снимков космического аппарата Sentinel-1. Алгоритм не является итерационным и может быть реализован в парадигме параллельных вычислений. Применяемая платформа Apache Spark позволила распределенно обрабатывать массивы стека радарных данных (от 60 изображений) в памяти на большом количестве физических узлов в сетевой среде. Время поиска распределенных рассеивателей удалось снизить в среднем до десяти раз по сравнению с однопроцессорной реализацией алгоритма. Приведены сравнительные результаты тестирования вычислительной системы на демонстрационном кластере. Алгоритм реализован на языке программирования Python c подробным описанием методов и объектов The article describes implementation of the software for a fast algorithm which finds distributed scatterers for the problem of plotting displacement velocities of the earth’s surface based on the Apache Spark platform. The Persistent Scatterer (PS) method is widely used for estimating the displacement rates of the earth’s surface. It consists of the identification of coherent radar targets (interferogram pixels) that demonstrate high phase stability during the entire observation period. The most advanced algorithm for solving the identification problem is the SqueeSAR algorithm. It allows searching and processing Distributed Scatterers (DS) - specific reflectors, integrating them into the general scheme for calculating displacement velocities using the PS method. A careful analysis of the SqueeSAR algorithm has identified areas that are critical to its performance. The whole algorithm is based on an enumeration of the initial data, where nontrivial transformations are performed at each step. The stages of searching for adjacent points in the design window with multiple passes over the entire area of the image and solving the maximization problem when assessing the real values of the interferometric phases turned out to be noticeably costly. To speed up the processing of images, it is proposed to use the Apache Spark massively parallel computing platform. Specialized primitives (Resilient Distributed Data) for recurrent inmemory processing are available here. This provides multiple accesses to the radar data loaded into memory from each cluster node and allows logical dividing of the snapshot stack into subareas. Thus calculations are performed independently in massively parallel mode. Based on the SqueeSAR mathematical model, it is assumed that the radar image data and the calculated geophysical parameters calculated are common for each statistically homogeneous sample of nearby pixels. In accordance with this assumption, the uniformity (homogeneity) of the pixels is estimated within a given window. The search for distributed scatterers occurs independently by the sequence of shifts of the windows over the entire area of the image. The window is shifted along the width and height of the image with a step equal to the width and height of the window. Pairs of samples in the window are composed of vectors of complex pixel values in each of the N images. The validity of the Kolmogorov-Smirnov criterion is checked for each of the pairs. To estimate the values of the phases of homogeneous pixels, the maximization problem is solved. The method of maximum likelihood estimation (MLE) is considered. The construction of the correct MLE form is carried out by analyzing the statistical properties of the coherence matrix of all images using the complex Wishart distribution. The Apache Spark platform applied here permits processing of distributed radar data stack arrays in memory on a large number of physical nodes in a network environment. The average search time for distributed scatterers turned out to be 10 times less compared to the uniprocessor implementation of the algorithm. The algorithm is implemented in the Python programming language with a detailed description of the objects and methods of the algorithm. The proposed algorithm and its parallel implementation allows applying the developed approaches to other problems and types of satellite data for remote sensing of the earth from space

Download Full-text

An Improved Massively Parallel Implementation of Colored Petri-Net Specifications

Programming Environments for Massively Parallel Distributed Systems ◽

10.1007/978-3-0348-8534-8_38 ◽

1994 ◽

pp. 373-378 ◽

Cited By ~ 1

Author(s):

François Bréant ◽

Jean-François Pradat-Peyre

Keyword(s):

Petri Net ◽

Parallel Implementation ◽

Massively Parallel ◽

Colored Petri Net

Download Full-text