SMP-SIM: An SMP-based discrete-event execution-driven performance simulator

Yufei Lin; Xinhai Xu; Yuhua Tang; Xin Zhang; Xiaowei Guo

doi:10.2298/csis120118046l

SMP-SIM: An SMP-based discrete-event execution-driven performance simulator

Computer Science and Information Systems ◽

10.2298/csis120118046l ◽

2012 ◽

Vol 9 (4) ◽

pp. 1361-1383

Author(s):

Yufei Lin ◽

Xinhai Xu ◽

Yuhua Tang ◽

Xin Zhang ◽

Xiaowei Guo

Keyword(s):

Message Passing ◽

Large Scale ◽

Message Passing Interface ◽

Discrete Event ◽

Simulation Method ◽

Parallel System ◽

Memory Space ◽

Central Processing ◽

Performance Requirements ◽

Point To Point

Designing and implementing a large-scale parallel system can be time-consuming and costly. It is therefore desirable to enable system developers to predict the performance of a parallel system at its design phase so that they can evaluate design alternatives to better meet performance requirements. Before the target machine is completely built, the developers can always build an symmetric multi-processor (SMP) for evaluation purposes. In this paper, we introduce an SMP-based discrete-event execution-driven performance simulation method for message passing interface (MPI) programs and describe the design and implementation of a simulator called SMP-SIM. As the processes share the same memory space in an SMP, SMP-SIM manages the events globally at the granularity of central processing units (CPUs). Furthermore, by re-implementing core MPI point-to-point communication primitives, SMP-SIM handles the communication virtually and sequential computation actually. Our experimental results show that SMP-SIM is highly accurate and scalable, resulting in errors of less than 7.60% for both SMP and SMP-Cluster target machines.

Download Full-text

A lightweight approach to performance portability with targetDP

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016682071 ◽

2016 ◽

Vol 32 (2) ◽

pp. 288-301

Author(s):

Alan Gray ◽

Kevin Stratford

Keyword(s):

Particle Physics ◽

Message Passing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Processing Unit ◽

Performance Portability ◽

Graphics Processing

Leading high performance computing systems achieve their status through use of highly parallel devices such as NVIDIA graphics processing units or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus separate lattice quantum chromodynamics particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with Message Passing Interface (MPI) to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and graphics processing unit-accelerated large scale supercomputers.

Download Full-text

Improving data transfer for model coupling

Geoscientific Model Development Discussions ◽

10.5194/gmdd-8-8981-2015 ◽

2015 ◽

Vol 8 (10) ◽

pp. 8981-9020 ◽

Cited By ~ 2

Author(s):

C. Zhang ◽

L. Liu ◽

G. Yang ◽

R. Li ◽

B. Wang

Keyword(s):

Performance Improvement ◽

Message Passing ◽

Message Passing Interface ◽

Data Transfer ◽

The Public ◽

Message Size ◽

Abstract Data ◽

Point To Point ◽

Component Models ◽

Size Variable

Abstract. Data transfer, which means transferring data fields between two component models or rearranging data fields among processes of the same component model, is a fundamental operation of a coupler. Most of state-of-the-art coupler versions currently use an implementation based on the point-to-point (P2P) communication of the Message Passing Interface (MPI) (call such an implementation "P2P implementation" for short). In this paper, we reveal the drawbacks of the P2P implementation, including low communication bandwidth due to small message size, variable and big number of MPI messages, and jams during communication. To overcome these drawbacks, we propose a butterfly implementation for data transfer. Although the butterfly implementation can outperform the P2P implementation in many cases, it degrades the performance in some cases because the total message size transferred by the butterfly implementation is larger than that by the P2P implementation. To make the data transfer completely improved, we design and implement an adaptive data transfer library that combines the advantages of both butterfly implementation and P2P implementation. Performance evaluation shows that the adaptive data transfer library significantly improves the performance of data transfer in most cases and does not decrease the performance in any cases. Now the adaptive data transfer library is open to the public and has been imported into a coupler version C-Coupler1 for performance improvement of data transfer. We believe that it can also improve other coupler versions.

Download Full-text

Parallelization of the Lattice Boltzmann Method in Simulating Buoyancy-Driven Convection Heat Transfer

Heat Transfer, Volume 2 ◽

10.1115/imece2004-61871 ◽

2004 ◽

Author(s):

Anoosheh Niavarani-Kheirier ◽

Masoud Darbandi ◽

Gerry E. Schneider

Keyword(s):

Lattice Boltzmann Method ◽

Lattice Boltzmann ◽

Message Passing ◽

Large Scale ◽

Message Passing Interface ◽

Parallel Machines ◽

Convection Heat Transfer ◽

Wide Range ◽

Buoyancy Driven Convection ◽

Boltzmann Method

The main objective of the current work is to utilize Lattice Boltzmann Method (LBM) for simulating buoyancy-driven flow considering the hybrid thermal lattice Boltzmann equation (HTLBE). After deriving the required formulations, they are validated against a wide range of Rayleigh numbers in buoyancy-driven square cavity problem. The performance of the method is investigated on parallel machines using Message Passing Interface (MPI) library and implementing domain decomposition technique to solve problems with large order of computations. The achieved results show that the code is highly efficient to solve large scale problems with excellent speedup.

Download Full-text

Interpretive MPI for Parallel Computing

Volume 3: 28th Computers and Information in Engineering Conference, Parts A and B ◽

10.1115/detc2008-49996 ◽

2008 ◽

Author(s):

Yu-Cheng Chou ◽

Harry H. Cheng

Keyword(s):

Parallel Computing ◽

Programming Languages ◽

Message Passing ◽

Large Scale ◽

Message Passing Interface ◽

Rapid Development ◽

Web Based ◽

Heterogeneous Platforms ◽

C Programs ◽

Computation Speedup

Message Passing Interface (MPI) is a standardized library specification designed for message-passing parallel programming on large-scale distributed systems. A number of MPI libraries have been implemented to allow users to develop portable programs using the scientific programming languages, Fortran, C and C++. Ch is an embeddable C/C++ interpreter that provides an interpretive environment for C/C++ based scripts and programs. Combining Ch with any MPI C/C++ library provides the functionality for rapid development of MPI C/C++ programs without compilation. In this article, the method of interfacing Ch scripts with MPI C implementations is introduced by using the MPICH2 C library as an example. The MPICH2-based Ch MPI package provides users with the ability to interpretively run MPI C program based on the MPICH2 C library. Running MPI programs through the MPICH2-based Ch MPI package across heterogeneous platforms consisting of Linux and Windows machines is illustrated. Comparisons for the bandwidth, latency, and parallel computation speedup between C MPI, Ch MPI, and MPI for Python in an Ethernet-based environment comprising identical Linux machines are presented. A Web-based example is given to demonstrate the use of Ch and MPICH2 in C based CGI scripting to facilitate the development of Web-based applications for parallel computing.

Download Full-text

Comparing Message Passing Interface and MapReduce for large-scale parallel ranking and selection

2015 Winter Simulation Conference (WSC) ◽

10.1109/wsc.2015.7408542 ◽

2015 ◽

Cited By ~ 2

Author(s):

Eric C. Ni ◽

Dragos F. Ciocan ◽

Shane G. Henderson ◽

Susan R. Hunter

Keyword(s):

Message Passing ◽

Large Scale ◽

Message Passing Interface ◽

Ranking And Selection

Download Full-text

Analysis of the Calculation of a Plasma Sheath Using the Parallel SO-DGTD Method

International Journal of Antennas and Propagation ◽

10.1155/2019/7160913 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9

Author(s):

Qian Yang ◽

Bing Wei ◽

Linqian Li ◽

Debiao Ge

Keyword(s):

Discontinuous Galerkin ◽

Cross Section ◽

Time Domain ◽

Message Passing ◽

High Speed ◽

Large Scale ◽

Message Passing Interface ◽

Shift Operator ◽

Plasma Sheath ◽

Blunt Cone

The plasma sheath is known as a popular topic of computational electromagnetics, and the plasma case is more resource-intensive than the non-plasma case. In this paper, a parallel shift-operator discontinuous Galerkin time-domain method using the MPI (Message Passing Interface) library is proposed to solve the large-scale plasma problems. To demonstrate our algorithm, a plasma sheath model of the high-speed blunt cone was established based on the results of the multiphysics software, and our algorithm was used to extract the radar cross-section (RCS) versus different incident angles of the model.

Download Full-text

Model Order Reduction of Large-Scale Finite Element Systems in an MPI Parallelized Environment for Usage in Multibody Simulation

Archive of Mechanical Engineering ◽

10.1515/meceng-2016-0027 ◽

2016 ◽

Vol 63 (4) ◽

pp. 475-494 ◽

Cited By ~ 1

Author(s):

Thomas Volzer ◽

Peter Eberhard

Keyword(s):

Finite Element ◽

Model Reduction ◽

Message Passing ◽

Large Scale ◽

Message Passing Interface ◽

Block Size ◽

Reduction Process ◽

Element Model ◽

Multibody Simulation ◽

Elastic Bodies

Abstract The use of elastic bodies within a multibody simulation became more and more important within the last years. To include the elastic bodies, described as a finite element model in multibody simulations, the dimension of the system of ordinary differential equations must be reduced by projection. For this purpose, in this work, the modal reduction method, a component mode synthesis based method and a moment-matching method are used. Due to the always increasing size of the non-reduced systems, the calculation of the projection matrix leads to a large demand of computational resources and cannot be done on usual serial computers with available memory. In this paper, the model reduction software Morembs++ is presented using a parallelization concept based on the message passing interface to satisfy the need of memory and reduce the runtime of the model reduction process. Additionally, the behaviour of the Block-Krylov-Schur eigensolver, implemented in the Anasazi package of the Trilinos project, is analysed with regard to the choice of the size of the Krylov base, the block size and the number of blocks. Besides, an iterative solver is considered within the CMS-based method.

Download Full-text

CIP and Parallel Computing Based Numerical Solutions of 3-D Slamming Problems

Volume 11: Prof. Robert F. Beck Honoring Symposium on Marine Hydrodynamics ◽

10.1115/omae2015-41292 ◽

2015 ◽

Author(s):

Peng Wen ◽

Wei Qiu

Keyword(s):

Parallel Computing ◽

Message Passing ◽

Message Passing Interface ◽

Numerical Solutions ◽

Three Dimensional ◽

Simulation Method ◽

Water Entry ◽

Computational Domain ◽

Cip Method ◽

Constrained Interpolation

This paper presents the further development of numerical simulation method to solve 3-D highly non-linear slamming problems using parallel computing algorithms. The water entry problems are treated as multi-phase problems (solid, water and air) and governed by the Navier-Stokes (N-S) equations. They are solved by the three-dimensional constrained interpolation profile (CIP) method. The interfaces between different phases are captured using density functions. In the computation, the 3-D CIP method is employed for the advection phase of the N-S equations and a pressure-based algorithm is applied for the non-advection phase. The bi-conjugate gradient stabilized method (BiCGSTAB) is utilized to solve the linear equation systems. A Message Passing Interface (MPI) parallel computing scheme was implemented in the computations. For the parallel computations, the three-dimensional Cartesian decomposition of the computational domain was used. The speed-up performance of various decomposition schemes were studied. Validation studies were carried out for the water entry of a 3-D wedge and a 3-D ship section with prescribed velocities. The computed slamming force, pressure distribution and free-surface elevations are compared with experimental results and numerical results by other methods.

Download Full-text

Parallel Implementation of Non-slicing Floorplans with MPI and OpenMP

10.32920/ryerson.14647368 ◽

2021 ◽

Author(s):

Oluvaseun Owojaiye

Keyword(s):

Message Passing ◽

Large Scale ◽

Message Passing Interface ◽

Parallel Implementation ◽

Computation Time ◽

Sequential Algorithm ◽

Design Stage ◽

Single Chip ◽

Solution Quality ◽

Early Design Stage

Advancement in technology has brought considerable improvement to processor design and now manufacturers design multiple processors on a single chip. Supercomputers today consists of cluster of interconnected nodes that collaborate together to solve complex and advanced computation problems. Message Passing Interface and Open Multiprocessing are the popularly used programming models to optimize sequential codes by parallelizing them on the different multiprocessor architecture that exist today. In this thesis, we parallelize the non-slicing floorplan algorithm based on Multilevel Floorplanning/placement of large scale modules using B*tree (MB*tree) with MPI and OpenMP on distributed and shared memory architectures respectively. In VLSI (Very Large Scale Integration) design automation, floorplanning is an initial and vital task performed in the early design stage. Experimental results using MCNC benchmark circuits show that our parallel algorithm produced better results than the corresponding sequential algorithm; we were able to speed up the algorithm up to 4 times, hence reducing computation time and maintaining floorplan solution quality. On the other hand, we compared both parallel versions; and the OpenMP results gave slightly better than the corresponding MPI results.

Download Full-text

A Simulator for Large-Scale Parallel Computer Architectures

International Journal of Distributed Systems and Technologies ◽

10.4018/jdst.2010040104 ◽

2010 ◽

Vol 1 (2) ◽

pp. 57-73 ◽

Cited By ~ 52

Author(s):

Curtis L. Janssen ◽

Helgi Adalsteinsson ◽

Scott Cranford ◽

Joseph P. Kenny ◽

Ali Pinar ◽

...

Keyword(s):

Message Passing ◽

Large Scale ◽

Message Passing Interface ◽

Parallel Execution ◽

Parallel Computer ◽

Coarse Grained ◽

Efficient Design ◽

Standard Format ◽

Trace File ◽

Macro Scale

Efficient design of hardware and software for large-scale parallel execution requires detailed understanding of the interactions between the application, computer, and network. The authors have developed a macro-scale simulator (SST/macro) that permits the coarse-grained study of distributed-memory applications. In the presented work, applications using the Message Passing Interface (MPI) are simulated; however, the simulator is designed to allow inclusion of other programming models. The simulator is driven from either a trace file or a skeleton application. Trace files can be either a standard format (Open Trace Format) or a more detailed custom format (DUMPI). The simulator architecture is modular, allowing it to easily be extended with additional network models, trace file formats, and more detailed processor models. This paper describes the design of the simulator, provides performance results, and presents studies showing how application performance is affected by machine characteristics.

Download Full-text