Simulation of the performance and scalability of message passing interface (MPI) communications of atmospheric models running  on exascale supercomputers

Abstract. In this study, we identify the key message passing interface (MPI) operations required in atmospheric modelling; then, we use a skeleton program and a simulation framework (based on SST/macro simulation package) to simulate these MPI operations (transposition, halo exchange, and allreduce), with the perspective of future exascale machines in mind. The experimental results show that the choice of the collective algorithm has a great impact on the performance of communications; in particular, we find that the generalized ring-k algorithm for the alltoallv operation and the generalized recursive-k algorithm for the allreduce operation perform the best. In addition, we observe that the impacts of interconnect topologies and routing algorithms on the performance and scalability of transpositions, halo exchange, and allreduce operations are significant. However, the routing algorithm has a negligible impact on the performance of allreduce operations because of its small message size. It is impossible to infinitely grow bandwidth and reduce latency due to hardware limitations. Thus, congestion may occur and limit the continuous improvement of the performance of communications. The experiments show that the performance of communications can be improved when congestion is mitigated by a proper configuration of the topology and routing algorithm, which uniformly distribute the congestion over the interconnect network to avoid the hotspots and bottlenecks caused by congestion. It is generally believed that the transpositions seriously limit the scalability of the spectral models. The experiments show that the communication time of the transposition is larger than those of the wide halo exchange for the semi-Lagrangian method and the allreduce in the generalized conjugate residual (GCR) iterative solver for the semi-implicit method below 2×105 MPI processes. The transposition whose communication time decreases quickly with increasing number of MPI processes demonstrates strong scalability in the case of very large grids and moderate latencies. The halo exchange whose communication time decreases more slowly than that of transposition with increasing number of MPI processes reveals its weak scalability. In contrast, the allreduce whose communication time increases with increasing number of MPI processes does not scale well. From this point of view, the scalability of spectral models could still be acceptable. Therefore it seems to be premature to conclude that the scalability of the grid-point models is better than that of spectral models at the exascale, unless innovative methods are exploited to mitigate the problem of the scalability presented in the grid-point models.

Download Full-text

Simulation of the Performance and Scalability of MPI Communications of Atmospheric Models running on Exascale Supercomputers

10.5194/gmd-2017-301 ◽

2018 ◽

Author(s):

Yongjun Zheng ◽

Philippe Marguinaud

Keyword(s):

Grid Point ◽

Routing Algorithm ◽

Point Of View ◽

Simulation Framework ◽

Atmospheric Models ◽

Innovative Methods ◽

Spectral Models ◽

Communication Time ◽

Better Than ◽

Interconnect Network

Abstract. In this study, we identify the key MPI operations required in atmospheric modelling; then, we use a skeleton program and a simulation framework (based on SST/macro simulation package) to simulate these MPI operations (transposition, halo exchange, and allreduce), with the perspective of future exascale machines in mind. The experimental results show that the choice of the collective algorithm has a great impact on the performance of communications, in particular we find that the generalized ring-k algorithm for the alltoallv operation and the generalized recursive-k algorithm for the allreduce operation perform the best. In addition, we observe that the impacts of interconnect topologies and routing algorithms on the performance and scalability of transpositions, halo exchange, and allreduce operations are significant, however, that the routing algorithm has a negligible impact on the performance of allreduce operations because of its small message size. It is impossible to infinitely grow bandwidth and reduce latency due to hardware limitations, thus, congestion may occur and limit the continuous improvement of the performance of communications. The experiments show that the performance of communications can be improved when congestion is mitigated by a proper configuration of the topology and routing algorithm, which uniformly distribute the congestion over the interconnect network to avoid the hotspots and bottlenecks caused by congestion. It is generally believed that the transpositions seriously limit the scalability of the spectral models. The experiments show that although the communication time of the transposition is larger than those of the wide halo exchange for the Semi-Lagrangian method and the allreduce in the GCR iterative solver for the Semi-Implicit method below 200 000 MPI processes, the transposition whose communication time decreases quickly as the number of MPI processes increases demonstrates strong scalability in the case of very large grids and moderate latencies; the halo exchange whose communication time decreases more slowly than that of transposition as the number of MPI processes increases reveals its weak scalability; in contrast, the allreduce whose communication time increases as the number of MPI processes increases does not scale well. From this point of view, the scalability of the spectral models could still be acceptable, therefore it seems to be premature to conclude that the scalability of the grid-point models is better than that of spectral models at exascale, unless innovative methods are exploited to mitigate the problem of the scalability presented in the grid-point models.

Download Full-text

Algorithm 1016

ACM Transactions on Mathematical Software ◽

10.1145/3446979 ◽

2021 ◽

Vol 47 (2) ◽

pp. 1-22

Author(s):

Jens Hahne ◽

Stephanie Friedhoff ◽

Matthias Bolten

Keyword(s):

Message Passing ◽

Message Passing Interface ◽

Space Time ◽

Point Of View ◽

Multiple Time ◽

Relaxation Schemes ◽

Time Integrators ◽

Parallel Tests ◽

Time Dependent Problems ◽

Different Levels

In this article, we introduce the Python framework PyMGRIT, which implements the multigrid-reduction-in-time (MGRIT) algorithm for solving (non-)linear systems arising from the discretization of time-dependent problems. The MGRIT algorithm is a reduction-based iterative method that allows parallel-in-time simulations, i.e., calculating multiple time steps simultaneously in a simulation, using a time-grid hierarchy. The PyMGRIT framework includes many different variants of the MGRIT algorithm, ranging from different multigrid cycle types and relaxation schemes, various coarsening strategies, including time-only and space-time coarsening, and the ability to utilize different time integrators on different levels in the multigrid hierachy. The comprehensive documentation with tutorials and many examples and the fully documented code allow an easy start into the work with the package. The functionality of the code is ensured by automated serial and parallel tests using continuous integration. PyMGRIT supports serial runs suitable for prototyping and testing of new approaches, as well as parallel runs using the Message Passing Interface (MPI). In this manuscript, we describe the implementation of the MGRIT algorithm in PyMGRIT and present the usage from both a user and a developer point of view. Three examples illustrate different aspects of the package itself, especially running tests with pure time parallelism, as well as space-time parallelism through the coupling of PyMGRIT with PETSc or Firedrake.

Download Full-text

Multi-GPU Implementation of 2D Shallow Water Equation Code with Block Uniform Quad-Tree Grids

10.29007/xcwc ◽

2018 ◽

Author(s):

Massimiliano Turchetto ◽

Renato Vacondio ◽

Alessandro Dal Palù

Keyword(s):

Shallow Water ◽

Message Passing ◽

Message Passing Interface ◽

Shallow Water Equation ◽

Processing Unit ◽

Variable Resolution ◽

Quad Tree ◽

Gpu Implementation ◽

2D Shallow Water Equations ◽

Better Than

This paper presents a multi Graphic Processing Unit (GPU) implementation of a 2D shallow water equations solver which is able to exploit the computational power of modern HPC clusters equipped with several GPUs on different nodes. The domain has been discretized by means of a Block Uniform Quadtree (BUQ) grid which allows to efficiently introduce variable resolution in a GPU-accelerated finite value code. In the present work the BUQ grid is decomposed into different partitions, and each partition is assigned to a dedicated GPU. Communications between different partitions are then handled by means of a Message Passing Interface (MPI) protocol. Computations and communications have been overlapped to reduce the overheads of the multi-GPU implementation. The strong scalability test shows an efficiency dropdown better than linear in the number of GPUs adopted by the simulation, and the weak scalability test shows that network overheads caused by border communication are completely maskable by GPU calculations.

Download Full-text

Multi-level Parallelization of Genotype Imputation on Supercomputers

Current Bioinformatics ◽

10.2174/1574893615999200420071307 ◽

2020 ◽

Vol 15 ◽

Author(s):

Weiwen Zhang ◽

Long Wang ◽

Theint Theint Aye ◽

Juniarto Samsudin ◽

Yongqing Zhu

Keyword(s):

Association Study ◽

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Genome Wide Association Study ◽

Job Scheduling ◽

Genotype Imputation ◽

Job Level ◽

Multi Level ◽

High Performance Requirement

Background: Genotype imputation as a service is developed to enable researchers to estimate genotypes on haplotyped data without performing whole genome sequencing. However, genotype imputation is computation intensive and thus it remains a challenge to satisfy the high performance requirement of genome wide association study (GWAS). Objective: In this paper, we propose a high performance computing solution for genotype imputation on supercomputers to enhance its execution performance. Method: We design and implement a multi-level parallelization that includes job level, process level and thread level parallelization, enabled by job scheduling management, message passing interface (MPI) and OpenMP, respectively. It involves job distribution, chunk partition and execution, parallelized iteration for imputation and data concatenation. Due to the design of multi-level parallelization, we can exploit the multi-machine/multi-core architecture to improve the performance of genotype imputation. Results: Experiment results show that our proposed method can outperform the Hadoop-based implementation of genotype imputation. Moreover, we conduct the experiments on supercomputers to evaluate the performance of the proposed method. The evaluation shows that it can significantly shorten the execution time, thus improving the performance for genotype imputation. Conclusion: The proposed multi-level parallelization, when deployed as an imputation as a service, will facilitate bioinformatics researchers in Singapore to conduct genotype imputation and enhance the association study.

Download Full-text

Innovative Marketing Strategies for Select E-Commerce Companies in India - A Review.

SMS Journal of Enterpreneurship & Innovation ◽

10.21844/smsjei.v5i1.15137 ◽

2018 ◽

Vol 5 (1) ◽

Author(s):

Amit Kumar Bhanja ◽

P.C Tripathy

Keyword(s):

Artificial Intelligence ◽

Internet Of Things ◽

Business Environment ◽

Marketing Strategies ◽

Point Of View ◽

Digital Marketing ◽

Customer Base ◽

Innovative Methods ◽

Online Shoppers ◽

Drastic Increase

Innovation is the key to opportunities and growth in today’s competitive and dynamic business environment. It not only nurtures but also provides companies with unique dimensions for constant reinvention of the existing way of performance which enables and facilitates them to reach out to their prospective customers more effectively. It has been estimated by Morgan Stanley that India would have 480 million shoppers buying products online by the year 2026, a drastic increase from 60 million online shoppers in the year 2016. E-commerce companies are aggressively implementing innovative methods of marketing their product offerings using tools like digital marketing, internet of things (IoT)and artificial intelligence to name a few. This paper focuses on outlining the innovative ways of marketing that the E-Commerce sector implements in orders to increase their customer base and aims at determining the future scope of this area. A conceptual comparative study of Amazon and Flipkart helps to determine which marketing strategies are more appealing and beneficial for both the customers and companies point of view.

Download Full-text

Distributed Singular Value Decomposition Method for Fast Data Processing in Recommendation Systems

Energies ◽

10.3390/en14082284 ◽

2021 ◽

Vol 14 (8) ◽

pp. 2284

Author(s):

Krzysztof Przystupa ◽

Mykola Beshley ◽

Olena Hordiichuk-Bublivska ◽

Marian Kyryk ◽

Halyna Beshley ◽

...

Keyword(s):

Distributed Systems ◽

Singular Value Decomposition ◽

Data Processing ◽

Message Passing ◽

Message Passing Interface ◽

Recommendation Systems ◽

Singular Value ◽

Singular Value Decomposition Method ◽

Value Decomposition ◽

Svd Method

The problem of analyzing a big amount of user data to determine their preferences and, based on these data, to provide recommendations on new products is important. Depending on the correctness and timeliness of the recommendations, significant profits or losses can be obtained. The task of analyzing data on users of services of companies is carried out in special recommendation systems. However, with a large number of users, the data for processing become very big, which causes complexity in the work of recommendation systems. For efficient data analysis in commercial systems, the Singular Value Decomposition (SVD) method can perform intelligent analysis of information. With a large amount of processed information we proposed to use distributed systems. This approach allows reducing time of data processing and recommendations to users. For the experimental study, we implemented the distributed SVD method using Message Passing Interface, Hadoop and Spark technologies and obtained the results of reducing the time of data processing when using distributed systems compared to non-distributed ones.

Download Full-text

ENHANCEMENT OF SCREEN FILM MAMMOGRAM UP TO A LEVEL OF DIGITAL MAMMOGRAM: EXPERIMENTAL ANALYSIS

International Journal of Image and Graphics ◽

10.1142/s0219467813400044 ◽

2013 ◽

Vol 13 (02) ◽

pp. 1340004

Author(s):

APARNA NARENDRA BHALE ◽

MANISH RATNAKAR JOSHI

Keyword(s):

Experimental Analysis ◽

Breast Imaging ◽

Point Of View ◽

Quality Analysis ◽

Digital Mammogram ◽

Total Recovery ◽

Clinical Observations ◽

Opinion Score ◽

Better Than

Breast cancer is one of the major causes of death among women. If a cancer can be detected early, the options of treatment and the chances of total recovery will increase. From a woman's point of view, the procedure practiced (compression of breasts to record an image) to obtain a digital mammogram (DM) is exactly the same that is used to obtain a screen film mammogram (SFM). The quality of DM is undoubtedly better than SFM. However, obtaining DM is costlier and very few institutions can afford DM machines. According to the National Cancer Institute 92% of breast imaging centers in India do not have digital mammography machines and they depend on the conventional SFM. Hence in this context, one should answer "Can SFM be enhanced up to a level of DM?" In this paper, we discuss our experimental analysis in this regard. We applied elementary image enhancement techniques to obtain enhanced SFM. We performed the quality analysis of DM and enhanced SFM using standard metrics like PSNR and RMSE on more than 350 mammograms. We also used mean opinion score (MOS) analysis to evaluate enhanced SFMs. The results showed that the clarity of processed SFM is as good as DM. Furthermore, we analyzed the extent of radiation exposed during SFM and DM. We presented our literally findings and clinical observations.

Download Full-text

A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing ◽

10.1016/0167-8191(96)00024-5 ◽

1996 ◽

Vol 22 (6) ◽

pp. 789-828 ◽

Cited By ~ 1155

Author(s):

William Gropp ◽

Ewing Lusk ◽

Nathan Doss ◽

Anthony Skjellum

Keyword(s):

Message Passing ◽

High Performance ◽

Message Passing Interface

Download Full-text

Parallel implementation for HSLO(3)-FDTD with message passing interface on Distributed Memory Architecture

2006 International Conference on Computing & Informatics ◽

10.1109/icoci.2006.5276531 ◽

2006 ◽

Author(s):

Mohammad Khatim Hasan ◽

Mohamed Othman ◽

Jalil Md Desa ◽

Zulkifly Abbas ◽

Jumat Sulaiman

Keyword(s):

Message Passing ◽

Message Passing Interface ◽

Distributed Memory ◽

Parallel Implementation ◽

Memory Architecture ◽

Distributed Memory Architecture

Download Full-text

Based on Numerical Simulation of High-Performance Parallel Machine Muffler Experimental Calibration

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.718-720.1645 ◽

2013 ◽

Vol 718-720 ◽

pp. 1645-1650

Author(s):

Gen Yin Cheng ◽

Sheng Chen Yu ◽

Zhi Yong Wei ◽

Shao Jie Chen ◽

You Cheng

Keyword(s):

Numerical Simulation ◽

Finite Element ◽

Boundary Element ◽

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Parallel Machine ◽

Simulation Software ◽

Experimental Calibration ◽

The Cost

Commonly used commercial simulation software SYSNOISE and ANSYS is run on a single machine (can not directly run on parallel machine) when use the finite element and boundary element to simulate muffler effect, and it will take more than ten days, sometimes even twenty days to work out an exact solution as the large amount of numerical simulation. Use a high performance parallel machine which was built by 32 commercial computers and transform the finite element and boundary element simulation software into a program that can running under the MPI (message passing interface) parallel environment in order to reduce the cost of numerical simulation. The relevant data worked out from the simulation experiment demonstrate that the result effect of the numerical simulation is well. And the computing speed of the high performance parallel machine is 25 ~ 30 times a microcomputer.

Download Full-text