Shared Memory Transport for ALFA

The high data rates expected for the next generation of particle physics experiments (e.g.: new experiments at FAIR/GSI and the upgrade of CERN experiments) call for dedicated attention with respect to design of the needed computing infrastructure. The common ALICE-FAIR framework ALFA is a modern software layer, that serves as a platform for simulation, reconstruction and analysis of particle physics experiments. Beside standard services needed for simulation and reconstruction of particle physics experiments, ALFA also provides tools for data transport, configuration and deployment. The FairMQ module in ALFA offers building blocks for creating distributed software components (processes) that communicate between each other via message passing. The abstract "message passing" interface in FairMQ has at the moment three implementations: ZeroMQ, nanomsg and shared memory. The newly developed shared memory transport will be presented, that provides significant per-formance benefits for transferring large data chunks between components on the same node. The implementation in FairMQ allows users to switch between the different transports via a trivial configuration change. The design decisions, im-plementation details and performance numbers of the shared memory transport in FairMQ/ALFA will be highlighted.

Download Full-text

RDMA-accelerated data transport in ALFA

EPJ Web of Conferences ◽

10.1051/epjconf/201921405022 ◽

2019 ◽

Vol 214 ◽

pp. 05022

Author(s):

Dennis Klein ◽

Alexey Rybalchenko ◽

Mohammad Al-Turany ◽

Thorsten Kollegger

Keyword(s):

Shared Memory ◽

Particle Physics ◽

Distributed Processing ◽

Building Blocks ◽

Data Transport ◽

High Data ◽

Data Throughput ◽

Data Rates ◽

High Bandwidth ◽

Physics Experiments

ALFA is a modern software platform for simulation, reconstruction and analysis of particle physics experiments. The FairMQ library in ALFA provides building blocks for distributed processing pipelines in anticipation of high data rates in next-generation, trigger-less FAIR and LHC RUN3 ALICE experiments. Modern data transport technologies are integrated through FairMQ by implementing an abstract message queuing based transport interface. Current implementations are based on ZeroMQ, nanomsg and shared memory and can be selected at run-time. In order to achieve highest inter-node data throughput on high bandwidth network fabrics (e.g. Infiniband), we propose a new FairMQ transport implementation based on the libfabric technology.

Download Full-text

The MPI and OpenMP Implementation of Parallel Algorithm for Generating Mandelbrot Set

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.571-572.26 ◽

2014 ◽

Vol 571-572 ◽

pp. 26-29

Author(s):

Xiang Wei Duan ◽

Wei Chang Shen ◽

Jun Guo

Keyword(s):

Parallel Algorithm ◽

Shared Memory ◽

Message Passing ◽

Message Passing Interface ◽

Algorithm Design ◽

Performance Testing ◽

Mandelbrot Set ◽

The Difference ◽

And Performance

The paper introduce the Mandelbrot Set and the message passing interface (MPI) and shared-memory (OpenMP), analyses the characteristic of algorithm design in the MPI and OpenMP environment, describes the implementation of parallel algorithm about Mandelbrot Set in the MPI environment and the OpenMP environment, conducted a series of evaluation and performance testing during the process of running, then the difference between the two system implementations is compared.

Download Full-text

A lightweight approach to performance portability with targetDP

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016682071 ◽

2016 ◽

Vol 32 (2) ◽

pp. 288-301

Author(s):

Alan Gray ◽

Kevin Stratford

Keyword(s):

Particle Physics ◽

Message Passing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Processing Unit ◽

Performance Portability ◽

Graphics Processing

Leading high performance computing systems achieve their status through use of highly parallel devices such as NVIDIA graphics processing units or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus separate lattice quantum chromodynamics particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with Message Passing Interface (MPI) to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and graphics processing unit-accelerated large scale supercomputers.

Download Full-text

A parallelization scheme to simulate reactive transport in the subsurface environment with OGS#IPhreeqc

Geoscientific Model Development Discussions ◽

10.5194/gmdd-8-2369-2015 ◽

2015 ◽

Vol 8 (3) ◽

pp. 2369-2402

Author(s):

W. He ◽

C. Beyer ◽

J. H. Fleckenstein ◽

E. Jang ◽

O. Kolditz ◽

...

Keyword(s):

Message Passing ◽

Reactive Transport ◽

Message Passing Interface ◽

Transport Processes ◽

Coupled Processes ◽

Scientific Software ◽

Geochemical Reactions ◽

Optimized Allocation ◽

And Performance ◽

The One

Abstract. This technical paper presents an efficient and performance-oriented method to model reactive mass transport processes in environmental and geotechnical subsurface systems. The open source scientific software packages OpenGeoSys and IPhreeqc have been coupled, to combine their individual strengths and features to simulate thermo-hydro-mechanical-chemical coupled processes in porous and fractured media with simultaneous consideration of aqueous geochemical reactions. Furthermore, a flexible parallelization scheme using MPI (Message Passing Interface) grouping techniques has been implemented, which allows an optimized allocation of computer resources for the node-wise calculation of chemical reactions on the one hand, and the underlying processes such as for groundwater flow or solute transport on the other hand. The coupling interface and parallelization scheme have been tested and verified in terms of precision and performance.

Download Full-text

MPI to Coarray Fortran: Experiences with a CFD Solver for Unstructured Meshes

Scientific Programming ◽

10.1155/2017/3409647 ◽

2017 ◽

Vol 2017 ◽

pp. 1-12 ◽

Cited By ~ 1

Author(s):

Anuj Sharma ◽

Irene Moulitsas

Keyword(s):

High Resolution ◽

Message Passing ◽

Message Passing Interface ◽

Parallel Implementation ◽

Unstructured Meshes ◽

Navier Stokes ◽

Performance Measurements ◽

Partitioned Global Address Space ◽

Computational Fluid Dynamics Cfd ◽

And Performance

High-resolution numerical methods and unstructured meshes are required in many applications of Computational Fluid Dynamics (CFD). These methods are quite computationally expensive and hence benefit from being parallelized. Message Passing Interface (MPI) has been utilized traditionally as a parallelization strategy. However, the inherent complexity of MPI contributes further to the existing complexity of the CFD scientific codes. The Partitioned Global Address Space (PGAS) parallelization paradigm was introduced in an attempt to improve the clarity of the parallel implementation. We present our experiences of converting an unstructured high-resolution compressible Navier-Stokes CFD solver from MPI to PGAS Coarray Fortran. We present the challenges, methodology, and performance measurements of our approach using Coarray Fortran. With the Cray compiler, we observe Coarray Fortran as a viable alternative to MPI. We are hopeful that Intel and open-source implementations could be utilized in the future.

Download Full-text

Design and Performance Evaluation of a 10GHz 32nm-CNTFET IR-UWB Transmitter for Inter-Chip Wireless Communication

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.646.228 ◽

2013 ◽

Vol 646 ◽

pp. 228-234

Author(s):

Fahim Rahman ◽

Prodyut Das ◽

Md. Forhad Hossain ◽

Sazzaduzzaman Khan ◽

Rajib Chowdhury

Keyword(s):

Performance Evaluation ◽

High Speed ◽

Pulse Generator ◽

Field Effect Transistors ◽

Logic Gates ◽

Pulse Amplitude ◽

Building Blocks ◽

Data Rate ◽

High Data ◽

And Performance

In this paper, we have presented the design and performance evaluation of a 10GHz 32nm-CNTFET IR-UWB transmitter for inter-chip wireless transmission. We have designed the transmitter using a VCO-based high speed clock generator and a positive and a negative monocycle Gaussian pulse generator. RF compatible Carbon Nano-Tube Field Effect Transistors (CNTFETs) have been used as the building blocks of the oscillator and the logic gates. The final design has resulted to a 7-channel-SWNT CNTFET-based transmitter for optimum 10GHz data rate with a promising 650mV pulse amplitude and only 1.069mW power consumption with a -32.27dB output. This transmitter can also operate satisfactorily upto 15GHz. The results show promising superiority over existing transmitters regarding high data rate, low power loss and high pulse amplitude.

Download Full-text

VISUAL PROGRAMMING FOR MESSAGE-PASSING SYSTEMS

International Journal of Software Engineering and Knowledge Engineering ◽

10.1142/s0218194099000231 ◽

1999 ◽

Vol 09 (04) ◽

pp. 397-423 ◽

Cited By ~ 8

Author(s):

NENAD STANKOVIC ◽

KANG ZHANG

Keyword(s):

Message Passing ◽

Message Passing Interface ◽

Direct Interaction ◽

Visual Programming ◽

Levels Of Abstraction ◽

Concrete Objects ◽

Real Objects ◽

And Performance ◽

Flow Graphs ◽

Parallel Debugging

The attractiveness of visual programming stems in large part from the direct interaction with program elements as if they were real objects, since people deal better with concrete objects than with the abstract. This paper describes a new graph based software visualization tool for parallel message-passing programming named Visper that combines the levels of abstraction at which message-passing parallel programs are expressed and makes use of compositional programming. Central to the tool is the Process Communication Graph that correlates both the control and data flow graphs into a single graph formalism, without a need for complex textual annotation. The graph can express static and runtime communication and replication structures, as found in Message Passing Interface (MPI) and Parallel Virtual Machine (PVM). It also forms the basis for visualizing parallel debugging and performance.

Download Full-text

Implementation and Performance of DSMPI

Scientific Programming ◽

10.1155/1997/452521 ◽

1997 ◽

Vol 6 (2) ◽

pp. 201-214 ◽

Cited By ~ 2

Author(s):

Luis M. Silva ◽

JoÃo Gabriel Silva ◽

Simon Chapple

Keyword(s):

Shared Memory ◽

Message Passing ◽

Distributed Memory ◽

Programming Model ◽

Distributed Shared Memory ◽

Memory Systems ◽

Distributed Memory Machines ◽

Coherence Protocols ◽

And Performance ◽

Performance Results

Distributed shared memory has been recognized as an alternative programming model to exploit the parallelism in distributed memory systems because it provides a higher level of abstraction than simple message passing. DSM combines the simple programming model of shared memory with the scalability of distributed memory machines. This article presents DSMPI, a parallel library that runs atop of MPI and provides a DSM abstraction. It provides an easy-to-use programming interface, is fully, portable, and supports heterogeneity. For the sake of flexibility, it supports different coherence protocols and models of consistency. We present some performance results taken in a network of workstations and in a Cray T3D which show that DSMPI can be competitive with MPI for some applications.

Download Full-text

A Hybrid MPI–OpenMP Parallel Algorithm and Performance Analysis for an Ensemble Square Root Filter Designed for Multiscale Observations

Journal of Atmospheric and Oceanic Technology ◽

10.1175/jtech-d-12-00165.1 ◽

2013 ◽

Vol 30 (7) ◽

pp. 1382-1397 ◽

Cited By ~ 14

Author(s):

Yunheng Wang ◽

Youngsun Jung ◽

Timothy A. Supinie ◽

Ming Xue

Keyword(s):

Data Assimilation ◽

Domain Decomposition ◽

Parallel Algorithm ◽

Shared Memory ◽

Message Passing ◽

Message Passing Interface ◽

High Volume ◽

Square Root ◽

Fixed Amount ◽

Square Root Filter

Abstract A hybrid parallel scheme for the ensemble square root filter (EnSRF) suitable for parallel assimilation of multiscale observations, including those from dense observational networks such as those of radar, is developed based on the domain decomposition strategy. The scheme handles internode communication through a message passing interface (MPI) and the communication within shared-memory nodes via Open Multiprocessing (OpenMP) threads. It also supports pure MPI and pure OpenMP modes. The parallel framework can accommodate high-volume remote-sensed radar (or satellite) observations as well as conventional observations that usually have larger covariance localization radii. The performance of the parallel algorithm has been tested with simulated and real radar data. The parallel program shows good scalability in pure MPI and hybrid MPI–OpenMP modes, while pure OpenMP runs exhibit limited scalability on a symmetric shared-memory system. It is found that in MPI mode, better parallel performance is achieved with domain decomposition configurations in which the leading dimension of the state variable arrays is larger, because this configuration allows for more efficient memory access. Given a fixed amount of computing resources, the hybrid parallel mode is preferred to pure MPI mode on supercomputers with nodes containing shared-memory cores. The overall performance is also affected by factors such as the cache size, memory bandwidth, and the networking topology. Tests with a real data case with a large number of radars confirm that the parallel data assimilation can be done on a multicore supercomputer with a significant speedup compared to the serial data assimilation algorithm.

Download Full-text

A GPU-Based Gibbs Sampler for a Unidimensional IRT Model

International Scholarly Research Notices ◽

10.1155/2014/368149 ◽

2014 ◽

Vol 2014 ◽

pp. 1-11

Author(s):

Yanyan Sheng ◽

William S. Welling ◽

Michelle M. Zhu

Keyword(s):

Message Passing ◽

Large Scale ◽

Message Passing Interface ◽

Cost Effective ◽

Communication Overhead ◽

Graphic Processing Units ◽

Data Dependencies ◽

High Data ◽

Irt Models ◽

Fully Bayesian

Item response theory (IRT) is a popular approach used for addressing large-scale statistical problems in psychometrics as well as in other fields. The fully Bayesian approach for estimating IRT models is usually memory and computationally expensive due to the large number of iterations. This limits the use of the procedure in many applications. In an effort to overcome such restraint, previous studies focused on utilizing the message passing interface (MPI) in a distributed memory-based Linux cluster to achieve certain speedups. However, given the high data dependencies in a single Markov chain for IRT models, the communication overhead rapidly grows as the number of cluster nodes increases. This makes it difficult to further improve the performance under such a parallel framework. This study aims to tackle the problem using massive core-based graphic processing units (GPU), which is practical, cost-effective, and convenient in actual applications. The performance comparisons among serial CPU, MPI, and compute unified device architecture (CUDA) programs demonstrate that the CUDA GPU approach has many advantages over the CPU-based approach and therefore is preferred.

Download Full-text