Communication Optimization for Multiphase Flow Solver in the Library of OpenFOAM

Multiphase flow solvers are widely-used applications in OpenFOAM, whose scalability suffers from the costly communication overhead. Therefore, we establish communication-optimized multiphase flow solvers in OpenFOAM. In this paper, we first deliver a scalability bottleneck test on the typical multiphase flow case damBreak and reveal that the Message Passing Interface (MPI) communication in a Multidimensional Universal Limiter for Explicit Solution (MULES) and a Preconditioned Conjugate Gradient (PCG) algorithm is the short slab of multiphase flow solvers. Furthermore, an analysis of the communication behavior is carried out. We find that the redundant communication in MULES and the global synchronization in PCG are the performance limiting factors. Based on the analysis, we propose our communication optimization algorithm. For MULES, we remove the redundant communication and obtain optMULES. For PCG, we import several intermediate variables and rearrange PCG to reduce the global communication. We also overlap the computation of matrix-vector multiply and vector update with the non-blocking computation. The resulting algorithms are respectively referred to as OFPiPePCG and OFRePiPePCG. Extensive experiments show that our proposed method could dramatically increase the parallel scalability and solving speed of multiphase flow solvers in OpenFOAM approximately without the loss of accuracy.

Download Full-text

Parallel multibody separation simulation using MPI and OpenMP with communication optimization

Journal of Algorithms & Computational Technology ◽

10.1177/1748301818797062 ◽

2018 ◽

Vol 13 ◽

pp. 174830181879706 ◽

Cited By ~ 1

Author(s):

Wenpeng Ma ◽

Xiaodong Hu ◽

Xiazhen Liu

Keyword(s):

Data Structures ◽

Optimization Algorithm ◽

Load Balance ◽

Message Passing ◽

Message Passing Interface ◽

Communication Optimization ◽

Local Data ◽

Flow Solver ◽

Elapsed Time ◽

Block Based

In this paper we investigate parallel implementations of multibody separation simulation using a hybrid of message passing interface and OpenMP. We propose a mesh block-based overset communication optimization algorithm. After presenting details of local data structures, we present our strategy for parallelizing both the overset mesh assembler and the flow solver by employing message passing interface and OpenMP. Experimental results show that the mesh block-based overset communication optimization algorithm has an advantage in real elapsed time when compared to a process-based implementation. The hybrid version shows that it is suitable for improving the load balance if a large number of CPU cores are used. We report results for a standard multibody separation case.

Download Full-text

Parallel Preconditioned Conjugate Gradient Square Method Based on Normalized Approximate Inverses

Scientific Programming ◽

10.1155/2005/508607 ◽

2005 ◽

Vol 13 (2) ◽

pp. 79-91 ◽

Cited By ~ 1

Author(s):

George A. Gravvanis ◽

Konstantinos M. Giannoutakis

Keyword(s):

Linear Systems ◽

Conjugate Gradient ◽

Message Passing ◽

Message Passing Interface ◽

Distributed Memory ◽

Inverse Matrix ◽

Preconditioned Conjugate Gradient ◽

Sparse Linear Systems ◽

Approximate Inverse ◽

Approximate Inverse Matrix Techniques

A new class of normalized explicit approximate inverse matrix techniques, based on normalized approximate factorization procedures, for solving sparse linear systems resulting from the finite difference discretization of partial differential equations in three space variables are introduced. A new parallel normalized explicit preconditioned conjugate gradient square method in conjunction with normalized approximate inverse matrix techniques for solving efficiently sparse linear systems on distributed memory systems, using Message Passing Interface (MPI) communication library, is also presented along with theoretical estimates on speedups and efficiency. The implementation and performance on a distributed memory MIMD machine, using Message Passing Interface (MPI) is also investigated. Applications on characteristic initial/boundary value problems in three dimensions are discussed and numerical results are given.

Download Full-text

A GPU-Based Gibbs Sampler for a Unidimensional IRT Model

International Scholarly Research Notices ◽

10.1155/2014/368149 ◽

2014 ◽

Vol 2014 ◽

pp. 1-11

Author(s):

Yanyan Sheng ◽

William S. Welling ◽

Michelle M. Zhu

Keyword(s):

Message Passing ◽

Large Scale ◽

Message Passing Interface ◽

Cost Effective ◽

Communication Overhead ◽

Graphic Processing Units ◽

Data Dependencies ◽

High Data ◽

Irt Models ◽

Fully Bayesian

Item response theory (IRT) is a popular approach used for addressing large-scale statistical problems in psychometrics as well as in other fields. The fully Bayesian approach for estimating IRT models is usually memory and computationally expensive due to the large number of iterations. This limits the use of the procedure in many applications. In an effort to overcome such restraint, previous studies focused on utilizing the message passing interface (MPI) in a distributed memory-based Linux cluster to achieve certain speedups. However, given the high data dependencies in a single Markov chain for IRT models, the communication overhead rapidly grows as the number of cluster nodes increases. This makes it difficult to further improve the performance under such a parallel framework. This study aims to tackle the problem using massive core-based graphic processing units (GPU), which is practical, cost-effective, and convenient in actual applications. The performance comparisons among serial CPU, MPI, and compute unified device architecture (CUDA) programs demonstrate that the CUDA GPU approach has many advantages over the CPU-based approach and therefore is preferred.

Download Full-text

Adjoint of the Global Eulerian–Lagrangian Coupled Atmospheric transport model (A-GELCA v1.0): development and validation

Geoscientific Model Development Discussions ◽

10.5194/gmdd-8-5983-2015 ◽

2015 ◽

Vol 8 (7) ◽

pp. 5983-6019

Author(s):

D. A. Belikov ◽

S. Maksyutov ◽

A. Yaremchuk ◽

A. Ganshin ◽

T. Kaminski ◽

...

Keyword(s):

Message Passing ◽

Message Passing Interface ◽

Automatic Differentiation ◽

Transport Model ◽

Dispersion Model ◽

Three Dimensional ◽

Coupled Model ◽

Particle Dispersion ◽

Limiting Factors ◽

Eulerian Model

Abstract. We present the development of the Adjoint of the Global Eulerian–Lagrangian Coupled Atmospheric (A-GELCA) model that consists of the National Institute for Environmental Studies (NIES) model as an Eulerian three-dimensional transport model (TM), and FLEXPART (FLEXible PARTicle dispersion model) as the Lagrangian plume diffusion model (LPDM). The tangent and adjoint components of the Eulerian model were constructed directly from the original NIES TM code using an automatic differentiation tool known as TAF (Transformation of Algorithms in Fortran; http://www.FastOpt.com), with additional manual pre- and post-processing aimed at improving the performance of the computing, including MPI (Message Passing Interface). As results, the adjoint of Eulerian model is discrete. Construction of the adjoint of the Lagrangian component did not require any code modification, as LPDMs are able to track a significant number of particles back in time and thereby calculate the sensitivity of observations to the neighboring emissions areas. Eulerian and Lagrangian adjoint components were coupled at the time boundary in the global domain.The results are verified using a series of test experiments. The forward simulation shown the coupled model is effective in reproducing the seasonal cycle and short-term variability of CO2 even in the case of multiple limiting factors, such as high uncertainty of fluxes and the low resolution of the Eulerian model. The adjoint model demonstrates the high accuracy compared to direct forward sensitivity calculations and fast performance. The developed adjoint of the coupled model combines the flux conservation and stability of an Eulerian discrete adjoint formulation with the flexibility, accuracy, and high resolution of a Lagrangian backward trajectory formulation.

Download Full-text

Multi-level Parallelization of Genotype Imputation on Supercomputers

Current Bioinformatics ◽

10.2174/1574893615999200420071307 ◽

2020 ◽

Vol 15 ◽

Author(s):

Weiwen Zhang ◽

Long Wang ◽

Theint Theint Aye ◽

Juniarto Samsudin ◽

Yongqing Zhu

Keyword(s):

Association Study ◽

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Genome Wide Association Study ◽

Job Scheduling ◽

Genotype Imputation ◽

Job Level ◽

Multi Level ◽

High Performance Requirement

Background: Genotype imputation as a service is developed to enable researchers to estimate genotypes on haplotyped data without performing whole genome sequencing. However, genotype imputation is computation intensive and thus it remains a challenge to satisfy the high performance requirement of genome wide association study (GWAS). Objective: In this paper, we propose a high performance computing solution for genotype imputation on supercomputers to enhance its execution performance. Method: We design and implement a multi-level parallelization that includes job level, process level and thread level parallelization, enabled by job scheduling management, message passing interface (MPI) and OpenMP, respectively. It involves job distribution, chunk partition and execution, parallelized iteration for imputation and data concatenation. Due to the design of multi-level parallelization, we can exploit the multi-machine/multi-core architecture to improve the performance of genotype imputation. Results: Experiment results show that our proposed method can outperform the Hadoop-based implementation of genotype imputation. Moreover, we conduct the experiments on supercomputers to evaluate the performance of the proposed method. The evaluation shows that it can significantly shorten the execution time, thus improving the performance for genotype imputation. Conclusion: The proposed multi-level parallelization, when deployed as an imputation as a service, will facilitate bioinformatics researchers in Singapore to conduct genotype imputation and enhance the association study.

Download Full-text

Distributed Singular Value Decomposition Method for Fast Data Processing in Recommendation Systems

Energies ◽

10.3390/en14082284 ◽

2021 ◽

Vol 14 (8) ◽

pp. 2284

Author(s):

Krzysztof Przystupa ◽

Mykola Beshley ◽

Olena Hordiichuk-Bublivska ◽

Marian Kyryk ◽

Halyna Beshley ◽

...

Keyword(s):

Distributed Systems ◽

Singular Value Decomposition ◽

Data Processing ◽

Message Passing ◽

Message Passing Interface ◽

Recommendation Systems ◽

Singular Value ◽

Singular Value Decomposition Method ◽

Value Decomposition ◽

Svd Method

The problem of analyzing a big amount of user data to determine their preferences and, based on these data, to provide recommendations on new products is important. Depending on the correctness and timeliness of the recommendations, significant profits or losses can be obtained. The task of analyzing data on users of services of companies is carried out in special recommendation systems. However, with a large number of users, the data for processing become very big, which causes complexity in the work of recommendation systems. For efficient data analysis in commercial systems, the Singular Value Decomposition (SVD) method can perform intelligent analysis of information. With a large amount of processed information we proposed to use distributed systems. This approach allows reducing time of data processing and recommendations to users. For the experimental study, we implemented the distributed SVD method using Message Passing Interface, Hadoop and Spark technologies and obtained the results of reducing the time of data processing when using distributed systems compared to non-distributed ones.

Download Full-text

A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing ◽

10.1016/0167-8191(96)00024-5 ◽

1996 ◽

Vol 22 (6) ◽

pp. 789-828 ◽

Cited By ~ 1155

Author(s):

William Gropp ◽

Ewing Lusk ◽

Nathan Doss ◽

Anthony Skjellum

Keyword(s):

Message Passing ◽

High Performance ◽

Message Passing Interface

Download Full-text

A STRATEGY FOR SCHEDULING PARTIALLY ORDERED PROGRAM GRAPHS ONTO MULTICOMPUTERS

Parallel Processing Letters ◽

10.1142/s0129626495000515 ◽

1995 ◽

Vol 05 (04) ◽

pp. 575-586

Author(s):

BEN LEE ◽

ALI R. HURSON

Keyword(s):

Parallel Processing ◽

Message Passing ◽

Massively Parallel ◽

Communication Overhead ◽

Simulation Studies ◽

Global Approach ◽

Partially Ordered ◽

Massively Parallel Processing ◽

Time Scheduling ◽

Scheduling Heuristic

The issue of scalability is key to the success of massively parallel processing. Due to their distributed nature, message-passing multicomputers are appropriate for achieving scalar performance. However, the message-passing model lacks programmability due to difficulties encountered by the programmers to partition and schedule the computation over the processors and to establish efficient inter-processor communication in the user code. Therefore, this paper presents a compile-time scheduling heuristic, called BLS, that maps programs onto the processors of a message-passing multicomputer. In contrast to other methods proposed, BLS takes a more global approach in attempt to balance the tradeoff between exploiting parallelism and reducing communication overhead. To evaluate the effectiveness of BLS, simulation studies of scheduling SISAL programs are presented.

Download Full-text

Parallel implementation for HSLO(3)-FDTD with message passing interface on Distributed Memory Architecture

2006 International Conference on Computing & Informatics ◽

10.1109/icoci.2006.5276531 ◽

2006 ◽

Author(s):

Mohammad Khatim Hasan ◽

Mohamed Othman ◽

Jalil Md Desa ◽

Zulkifly Abbas ◽

Jumat Sulaiman

Keyword(s):

Message Passing ◽

Message Passing Interface ◽

Distributed Memory ◽

Parallel Implementation ◽

Memory Architecture ◽

Distributed Memory Architecture

Download Full-text

Based on Numerical Simulation of High-Performance Parallel Machine Muffler Experimental Calibration

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.718-720.1645 ◽

2013 ◽

Vol 718-720 ◽

pp. 1645-1650

Author(s):

Gen Yin Cheng ◽

Sheng Chen Yu ◽

Zhi Yong Wei ◽

Shao Jie Chen ◽

You Cheng

Keyword(s):

Numerical Simulation ◽

Finite Element ◽

Boundary Element ◽

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Parallel Machine ◽

Simulation Software ◽

Experimental Calibration ◽

The Cost

Commonly used commercial simulation software SYSNOISE and ANSYS is run on a single machine (can not directly run on parallel machine) when use the finite element and boundary element to simulate muffler effect, and it will take more than ten days, sometimes even twenty days to work out an exact solution as the large amount of numerical simulation. Use a high performance parallel machine which was built by 32 commercial computers and transform the finite element and boundary element simulation software into a program that can running under the MPI (message passing interface) parallel environment in order to reduce the cost of numerical simulation. The relevant data worked out from the simulation experiment demonstrate that the result effect of the numerical simulation is well. And the computing speed of the high performance parallel machine is 25 ~ 30 times a microcomputer.

Download Full-text