Performance of Parallel Distributed Bat Algorithm using MPI on a PC Cluster

A Parallel Genetic Algorithm (PGA) is used for a simulation-based optimization of waterway project schedules. This PGA is designed to distribute a Genetic Algorithm application over multiple processors in order to speed up the solution search procedure for a very large combinational problem. The proposed PGA is based on a global parallel model, which is also called a master-slave model. A Message-Passing Interface (MPI) is used in developing the parallel computing program. A case study is presented, whose results show how the adaption of a simulation-based optimization algorithm to parallel computing can greatly reduce computation time. Additional techniques which are found to further improve the PGA performance include: (1) choosing an appropriate task distribution method, (2) distributing simulation replications instead of different solutions, (3) avoiding the simulation of duplicate solutions, (4) avoiding running multiple simulations simultaneously in shared-memory processors, and (5) avoiding using multiple processors which belong to different clusters (physical sub-networks).

Download Full-text

A Parallel Block Predictor-Corrector Method by Python-Based Distributed Computing

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.263-266.1315 ◽

2012 ◽

Vol 263-266 ◽

pp. 1315-1318

Author(s):

Kun Ming Yu ◽

Ming Gong Lee

Keyword(s):

Differential Equations ◽

Numerical Solution ◽

Numerical Method ◽

Initial Value Problem ◽

Message Passing ◽

Message Passing Interface ◽

Parallel Structure ◽

Initial Value ◽

Speed Up ◽

Predictor Corrector

This paper is to discuss how Python can be used in designing a cluster parallel computation environment in numerical solution of some block predictor-corrector method for ordinary differential equations. In the parallel process, MPI-2(message passing interface) is used as a standard of MPICH2 to communicate between CPUs. The operation of data receiving and sending are operated and controlled by mpi4py which is based on Python. Implementation of a block predictor-corrector numerical method with one and two CPUs respectively is used to test the performance of some initial value problem. Minor speed up is obtained due to small size problems and few CPUs used in the scheme, though the establishment of this scheme by Python is valuable due to very few research has been carried in this kind of parallel structure under Python.

Download Full-text

Parallel multibody separation simulation using MPI and OpenMP with communication optimization

Journal of Algorithms & Computational Technology ◽

10.1177/1748301818797062 ◽

2018 ◽

Vol 13 ◽

pp. 174830181879706 ◽

Cited By ~ 1

Author(s):

Wenpeng Ma ◽

Xiaodong Hu ◽

Xiazhen Liu

Keyword(s):

Data Structures ◽

Optimization Algorithm ◽

Load Balance ◽

Message Passing ◽

Message Passing Interface ◽

Communication Optimization ◽

Local Data ◽

Flow Solver ◽

Elapsed Time ◽

Block Based

In this paper we investigate parallel implementations of multibody separation simulation using a hybrid of message passing interface and OpenMP. We propose a mesh block-based overset communication optimization algorithm. After presenting details of local data structures, we present our strategy for parallelizing both the overset mesh assembler and the flow solver by employing message passing interface and OpenMP. Experimental results show that the mesh block-based overset communication optimization algorithm has an advantage in real elapsed time when compared to a process-based implementation. The hybrid version shows that it is suitable for improving the load balance if a large number of CPU cores are used. We report results for a standard multibody separation case.

Download Full-text

Study on the Numerical Simulation of Explosion and Impact Processes Using PC Cluster System

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.433-440.2892 ◽

2012 ◽

Vol 433-440 ◽

pp. 2892-2898

Author(s):

Guang Lei Fei ◽

Jian Guo Ning ◽

Tian Bao Ma

Keyword(s):

Numerical Simulation ◽

Operating System ◽

Parallel Computing ◽

Message Passing ◽

Message Passing Interface ◽

Parallel Program ◽

Pc Cluster ◽

Computing Platform ◽

Impact Processes ◽

Platform System

Parallel computing has been applied in many fields, and the parallel computing platform system, PC cluster based on MPI (Message Passing Interface) library under Linux operating system is a cost-effectiveness approach to parallel compute. In this paper, the key algorithm of parallel program of explosion and impact is presented. The techniques of solving data dependence and realizing communication between subdomain are proposed. From the test of program, the portability of MMIC-3D parallel program is satisfied, and compared with the single computer, PC cluster can improve the calculation speed and enlarge the scale greatly.

Download Full-text

QUARTIC: QUick pArallel algoRithms for high-Throughput sequencIng data proCessing

F1000Research ◽

10.12688/f1000research.22954.3 ◽

2020 ◽

Vol 9 ◽

pp. 240

Author(s):

Frédéric Jarlier ◽

Nicolas Joly ◽

Nicolas Fedy ◽

Thomas Magalhaes ◽

Leonor Sirotti ◽

...

Keyword(s):

High Throughput ◽

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

High Throughput Sequencing ◽

Genome Structure ◽

Sequencing Data ◽

High Throughput Sequencing Data ◽

Speed Up ◽

Time To Delivery

Life science has entered the so-called 'big data era' where biologists, clinicians and bioinformaticians are overwhelmed with high-throughput sequencing data. While they offer new insights to decipher the genome structure they also raise major challenges to use them for daily clinical practice care and diagnosis purposes as they are bigger and bigger. Therefore, we implemented a software to reduce the time to delivery for the alignment and the sorting of high-throughput sequencing data. Our solution is implemented using Message Passing Interface and is intended for high-performance computing architecture. The software scales linearly with respect to the size of the data and ensures a total reproducibility with the traditional tools. For example, a 300X whole genome can be aligned and sorted within less than 9 hours with 128 cores. The software offers significant speed-up using multi-cores and multi-nodes parallelization.

Download Full-text

An efficient MPI/OpenMP parallelization of the Hartree–Fock–Roothaan method for the first generation of Intel® Xeon Phi™ processor architecture

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017732628 ◽

2017 ◽

Vol 33 (1) ◽

pp. 212-224 ◽

Cited By ~ 5

Author(s):

Vladimir Mironov ◽

Alexander Moskovsky ◽

Michael D’Mello ◽

Yuri Alexeev

Keyword(s):

Message Passing ◽

Message Passing Interface ◽

First Generation ◽

Hot Spot ◽

Direct Consequence ◽

Xeon Phi ◽

Hartree Fock ◽

Self Consistent Field ◽

Speed Up ◽

Electron Repulsion Integrals

The Hartree–Fock method in the General Atomic and Molecular Structure System (GAMESS) quantum chemistry package represents one of the most irregular algorithms in computation today. Major steps in the calculation are the irregular computation of electron repulsion integrals and the building of the Fock matrix. These are the central components of the main self consistent field (SCF) loop, the key hot spot in electronic structure codes. By threading the Message Passing Interface (MPI) ranks in the official release of the GAMESS code, we not only speed up the main SCF loop (4× to 6× for large systems) but also achieve a significant ([Formula: see text]×) reduction in the overall memory footprint. These improvements are a direct consequence of memory access optimizations within the MPI ranks. We benchmark our implementation against the official release of the GAMESS code on the Intel® Xeon Phi™ supercomputer. Scaling numbers are reported on up to 7680 cores on Intel Xeon Phi coprocessors.

Download Full-text

Numerical Solution of 2-D Water Entry Problems Based on a CIP Method and a Parallel Computing Algorithm

Volume 7: Ocean Engineering ◽

10.1115/omae2015-41309 ◽

2015 ◽

Author(s):

Peng Wen ◽

Wei Qiu

Keyword(s):

Parallel Computing ◽

Message Passing ◽

Message Passing Interface ◽

Water Entry ◽

Navier Stokes ◽

Cip Method ◽

Constrained Interpolation ◽

Decomposition Scheme ◽

Computing Algorithm ◽

Speed Up

A constrained interpolation profile (CIP) method has been developed to solve 2-D water entry problems. This paper presents the further development of the numerical method using staggered grids and a parallel computing algorithm. In this work, the multi-phase slamming problems, governed by the Navier-Stokes (N-S) equations, are solved by a CIP-based finite difference method. The interfaces between different phases (solid, water and air) are captured using density functions. A parallel computing algorithm based on the Message Passing Interface (MPI) method and the domain decomposition scheme was implemented to speed up the computations. The effect of decomposition scheme on the solution and the speed-up were studied. Validation studies were carried out for the water entry of various 2-D wedges and a ship section. The predicted slamming force, pressure distribution and free surface elevation are compared with experimental results and other numerical results.

Download Full-text

Parallel computing efficiency of SWAN 40.91

Geoscientific Model Development ◽

10.5194/gmd-14-4241-2021 ◽

2021 ◽

Vol 14 (7) ◽

pp. 4241-4247

Author(s):

Christo Rautenbach ◽

Julia C. Mullarney ◽

Karin R. Bryan

Keyword(s):

South African ◽

Message Passing ◽

Message Passing Interface ◽

Grid Cell ◽

Computational Domain ◽

Time Saving ◽

Model Configuration ◽

Speed Up ◽

Set Up ◽

Wave Forecasting

Abstract. Effective and accurate ocean and coastal wave predictions are necessary for engineering, safety and recreational purposes. Refining predictive capabilities is increasingly critical to reduce the uncertainties faced with a changing global wave climatology. Simulating WAves in the Nearshore (SWAN) is a widely used spectral wave modelling tool employed by coastal engineers and scientists, including for operational wave forecasting purposes. Fore- and hindcasts can span hours to decades, and a detailed understanding of the computational efficiencies is required to design optimized operational protocols and hindcast scenarios. To date, there exists limited knowledge on the relationship between the size of a SWAN computational domain and the optimal amount of parallel computational threads/cores required to execute a simulation effectively. To test the scalability, a hindcast cluster of 28 computational threads/cores (1 node) was used to determine the computation efficiencies of a SWAN model configuration for southern Africa. The model extent and resolution emulate the current operational wave forecasting configuration developed by the South African Weather Service (SAWS). We implemented and compared both OpenMP and the Message Passing Interface (MPI) distributing memory architectures. Three sequential simulations (corresponding to typical grid cell numbers) were compared to various permutations of parallel computations using the speed-up ratio, time-saving ratio and efficiency tests. Generally, a computational node configuration of six threads/cores produced the most effective computational set-up based on wave hindcasts of 1-week duration. The use of more than 20 threads/cores resulted in a decrease in speed-up ratio for the smallest computation domain, owing to the increased sub-domain communication times for limited domain sizes.

Download Full-text

Simulation-Based Scheduling of Waterway Projects Using a Parallel Genetic Algorithm

Civil and Environmental Engineering ◽

10.4018/978-1-4666-9619-8.ch046 ◽

2016 ◽

pp. 1071-1084

Author(s):

Ning Yang ◽

Shiaaulir Wang ◽

Paul Schonfeld

Keyword(s):

Genetic Algorithm ◽

Parallel Computing ◽

Message Passing ◽

Message Passing Interface ◽

Computation Time ◽

Parallel Genetic Algorithm ◽

Simulation Based ◽

Multiple Processors ◽

Simulation Based Optimization ◽

Speed Up

A Parallel Genetic Algorithm (PGA) is used for a simulation-based optimization of waterway project schedules. This PGA is designed to distribute a Genetic Algorithm application over multiple processors in order to speed up the solution search procedure for a very large combinational problem. The proposed PGA is based on a global parallel model, which is also called a master-slave model. A Message-Passing Interface (MPI) is used in developing the parallel computing program. A case study is presented, whose results show how the adaption of a simulation-based optimization algorithm to parallel computing can greatly reduce computation time. Additional techniques which are found to further improve the PGA performance include: (1) choosing an appropriate task distribution method, (2) distributing simulation replications instead of different solutions, (3) avoiding the simulation of duplicate solutions, (4) avoiding running multiple simulations simultaneously in shared-memory processors, and (5) avoiding using multiple processors which belong to different clusters (physical sub-networks).

Download Full-text

Parallelization of an Implicit Algorithm for Multi-Dimensional Particle-in-Cell Simulations

Communications in Computational Physics ◽

10.4208/cicp.070813.280214a ◽

2014 ◽

Vol 16 (3) ◽

pp. 599-611 ◽

Cited By ~ 6

Author(s):

George M. Petrov ◽

Jack Davis

Keyword(s):

Domain Decomposition ◽

Message Passing ◽

Message Passing Interface ◽

Computation Time ◽

Three Dimensions ◽

Maximum Speed ◽

Particle In Cell ◽

Speed Up ◽

Ultrashort Pulse Lasers ◽

Mpi Implementation

AbstractThe implicit 2D3V particle-in-cell (PIC) code developed to study the interaction of ultrashort pulse lasers with matter [G. M. Petrov and J. Davis, Computer Phys. Comm. 179, 868 (2008); Phys. Plasmas 18, 073102 (2011)] has been parallelized using MPI (Message Passing Interface). The parallelization strategy is optimized for a small number of computer cores, up to about 64. Details on the algorithm implementation are given with emphasis on code optimization by overlapping computations with communications. Performance evaluation for 1D domain decomposition has been made on a small Linux cluster with 64 computer cores for two typical regimes of PIC operation: “particle dominated”, for which the bulk of the computation time is spent on pushing particles, and “field dominated”, for which computing the fields is prevalent. For a small number of computer cores, less than 32, the MPI implementation offers a significant numerical speed-up. In the “particle dominated” regime it is close to the maximum theoretical one, while in the “field dominated” regime it is about 75-80 % of the maximum speed-up. For a number of cores exceeding 32, performance degradation takes place as a result of the adopted 1D domain decomposition. The code parallelization will allow future implementation of atomic physics and extension to three dimensions.

Download Full-text