A GPU-Based Gibbs Sampler for a Unidimensional IRT Model

Item response theory (IRT) is a popular approach used for addressing large-scale statistical problems in psychometrics as well as in other fields. The fully Bayesian approach for estimating IRT models is usually memory and computationally expensive due to the large number of iterations. This limits the use of the procedure in many applications. In an effort to overcome such restraint, previous studies focused on utilizing the message passing interface (MPI) in a distributed memory-based Linux cluster to achieve certain speedups. However, given the high data dependencies in a single Markov chain for IRT models, the communication overhead rapidly grows as the number of cluster nodes increases. This makes it difficult to further improve the performance under such a parallel framework. This study aims to tackle the problem using massive core-based graphic processing units (GPU), which is practical, cost-effective, and convenient in actual applications. The performance comparisons among serial CPU, MPI, and compute unified device architecture (CUDA) programs demonstrate that the CUDA GPU approach has many advantages over the CPU-based approach and therefore is preferred.

Download Full-text

High Performance Gibbs Sampling for IRT Models Using Row-Wise Decomposition

ISRN Computational Mathematics ◽

10.5402/2012/264040 ◽

2012 ◽

Vol 2012 ◽

pp. 1-9 ◽

Cited By ~ 2

Author(s):

Yanyan Sheng ◽

Mona Rahimi

Keyword(s):

High Performance ◽

Communication Overhead ◽

Data Dependencies ◽

Decomposition Scheme ◽

High Data ◽

Irt Models ◽

Multiple Processors ◽

The Cost ◽

Fully Bayesian ◽

Performance Computing

Item response theory (IRT) is a popular approach used for addressing statistical problems in psychometrics as well as in other fields. The fully Bayesian approach for estimating IRT models is computationally expensive. This limits the use of the procedure in real applications. In an effort to reduce the execution time, a previous study shows that high performance computing provides a solution by achieving a considerable speedup via the use of multiple processors. Given the high data dependencies in a single Markov chain for IRT models, it is not possible to avoid communication overhead among processors. This study is to reduce communication overhead via the use of a row-wise decomposition scheme. The results suggest that the proposed approach increased the speedup and the efficiency for each implementation while minimizing the cost and the total overhead. This further sheds light on developing high performance Gibbs samplers for more complicated IRT models.

Download Full-text

A lightweight approach to performance portability with targetDP

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016682071 ◽

2016 ◽

Vol 32 (2) ◽

pp. 288-301

Author(s):

Alan Gray ◽

Kevin Stratford

Keyword(s):

Particle Physics ◽

Message Passing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Processing Unit ◽

Performance Portability ◽

Graphics Processing

Leading high performance computing systems achieve their status through use of highly parallel devices such as NVIDIA graphics processing units or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus separate lattice quantum chromodynamics particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with Message Passing Interface (MPI) to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and graphics processing unit-accelerated large scale supercomputers.

Download Full-text

Parallelization of the Lattice Boltzmann Method in Simulating Buoyancy-Driven Convection Heat Transfer

Heat Transfer, Volume 2 ◽

10.1115/imece2004-61871 ◽

2004 ◽

Author(s):

Anoosheh Niavarani-Kheirier ◽

Masoud Darbandi ◽

Gerry E. Schneider

Keyword(s):

Lattice Boltzmann Method ◽

Lattice Boltzmann ◽

Message Passing ◽

Large Scale ◽

Message Passing Interface ◽

Parallel Machines ◽

Convection Heat Transfer ◽

Wide Range ◽

Buoyancy Driven Convection ◽

Boltzmann Method

The main objective of the current work is to utilize Lattice Boltzmann Method (LBM) for simulating buoyancy-driven flow considering the hybrid thermal lattice Boltzmann equation (HTLBE). After deriving the required formulations, they are validated against a wide range of Rayleigh numbers in buoyancy-driven square cavity problem. The performance of the method is investigated on parallel machines using Message Passing Interface (MPI) library and implementing domain decomposition technique to solve problems with large order of computations. The achieved results show that the code is highly efficient to solve large scale problems with excellent speedup.

Download Full-text

Providing Quantitative Scalability Improvement of Consistency Control for Large-Scale, Replication-Based Grid Systems

Quantitative Quality of Service for Grid Computing ◽

10.4018/978-1-60566-370-8.ch005 ◽

2011 ◽

pp. 91-111

Author(s):

Yijun Lu ◽

Hong Jiang ◽

Ying Lu

Keyword(s):

Response Time ◽

Large Scale ◽

Cost Effective ◽

Communication Overhead ◽

Control Mechanisms ◽

Test Bed ◽

Consistency Maintenance ◽

Grid Systems ◽

Consistency Control ◽

Lab Test

Consistency control is important in replication-based-Grid systems because it provides QoS guarantee. However, conventional consistency control mechanisms incur high communication overhead and are ill suited for large-scale dynamic Grid systems. In this chapter, the authors propose CVRetrieval (Consistency View Retrieval) to provide quantitative scalability improvement of consistency control for large-scale, replication-based Grid systems. Based on the observation that not all participants are equally active or engaged in distributed online collaboration, CVRetrieval differentiates the notions of consistency maintenance and consistency retrieval. Here, consistency maintenance implies a protocol that periodically communicates with all participants to maintain a certain consistency level; and consistency retrieval means that passive participants explicitly request consistent views from the system when the need arises in stead of joining the expensive consistency maintenance protocol all the time. The rationale is that it is much more cost-effective to satisfy a passive participant’s need on-demand. The evaluation of CVRetrieval is done in two parts. First, by analyzing its scalability and the result shows that CVRetrieval can greatly reduce communication cost and hence make consistency control more scalable. Second, a prototype of CVRetrieval is deployed on the Planet-Lab test-bed and the results show that the active participants experience a short response time at expense of the passive participants that may encounter a longer response time.

Download Full-text

Interpretive MPI for Parallel Computing

Volume 3: 28th Computers and Information in Engineering Conference, Parts A and B ◽

10.1115/detc2008-49996 ◽

2008 ◽

Author(s):

Yu-Cheng Chou ◽

Harry H. Cheng

Keyword(s):

Parallel Computing ◽

Programming Languages ◽

Message Passing ◽

Large Scale ◽

Message Passing Interface ◽

Rapid Development ◽

Web Based ◽

Heterogeneous Platforms ◽

C Programs ◽

Computation Speedup

Message Passing Interface (MPI) is a standardized library specification designed for message-passing parallel programming on large-scale distributed systems. A number of MPI libraries have been implemented to allow users to develop portable programs using the scientific programming languages, Fortran, C and C++. Ch is an embeddable C/C++ interpreter that provides an interpretive environment for C/C++ based scripts and programs. Combining Ch with any MPI C/C++ library provides the functionality for rapid development of MPI C/C++ programs without compilation. In this article, the method of interfacing Ch scripts with MPI C implementations is introduced by using the MPICH2 C library as an example. The MPICH2-based Ch MPI package provides users with the ability to interpretively run MPI C program based on the MPICH2 C library. Running MPI programs through the MPICH2-based Ch MPI package across heterogeneous platforms consisting of Linux and Windows machines is illustrated. Comparisons for the bandwidth, latency, and parallel computation speedup between C MPI, Ch MPI, and MPI for Python in an Ethernet-based environment comprising identical Linux machines are presented. A Web-based example is given to demonstrate the use of Ch and MPICH2 in C based CGI scripting to facilitate the development of Web-based applications for parallel computing.

Download Full-text

Angara interconnect makes GPU-based Desmos supercomputer an efficient tool for molecular dynamics calculations

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019826667 ◽

2019 ◽

Vol 33 (3) ◽

pp. 507-521 ◽

Cited By ~ 10

Author(s):

Vladimir Stegailov ◽

Ekaterina Dlinnova ◽

Timur Ismagilov ◽

Mikhail Khalilov ◽

Nikolay Kondratyuk ◽

...

Keyword(s):

Molecular Dynamics ◽

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Job Scheduling ◽

Cost Effective ◽

Test Bed ◽

Molecular Dynamics Calculations ◽

High Bandwidth ◽

Network Topologies

In this article, we describe the Desmos supercomputer that consists of 32 hybrid nodes connected by a low-latency high-bandwidth Angara interconnect with torus topology. This supercomputer is aimed at cost-effective classical molecular dynamics calculations. Desmos serves as a test bed for the Angara interconnect that supports 3-D and 4-D torus network topologies and verifies its ability to unite massively parallel programming systems speeding-up effectively message-passing interface (MPI)-based applications. We describe the Angara interconnect presenting typical MPI benchmarks. Desmos benchmarks results for GROMACS, LAMMPS, VASP and CP2K are compared with the data for other high-performance computing (HPC) systems. Also, we consider the job scheduling statistics for several months of Desmos deployment.

Download Full-text

Comparing Message Passing Interface and MapReduce for large-scale parallel ranking and selection

2015 Winter Simulation Conference (WSC) ◽

10.1109/wsc.2015.7408542 ◽

2015 ◽

Cited By ~ 2

Author(s):

Eric C. Ni ◽

Dragos F. Ciocan ◽

Shane G. Henderson ◽

Susan R. Hunter

Keyword(s):

Message Passing ◽

Large Scale ◽

Message Passing Interface ◽

Ranking And Selection

Download Full-text

Analysis of the Calculation of a Plasma Sheath Using the Parallel SO-DGTD Method

International Journal of Antennas and Propagation ◽

10.1155/2019/7160913 ◽

2019 ◽

Vol 2019 ◽

pp. 1-9

Author(s):

Qian Yang ◽

Bing Wei ◽

Linqian Li ◽

Debiao Ge

Keyword(s):

Discontinuous Galerkin ◽

Cross Section ◽

Time Domain ◽

Message Passing ◽

High Speed ◽

Large Scale ◽

Message Passing Interface ◽

Shift Operator ◽

Plasma Sheath ◽

Blunt Cone

The plasma sheath is known as a popular topic of computational electromagnetics, and the plasma case is more resource-intensive than the non-plasma case. In this paper, a parallel shift-operator discontinuous Galerkin time-domain method using the MPI (Message Passing Interface) library is proposed to solve the large-scale plasma problems. To demonstrate our algorithm, a plasma sheath model of the high-speed blunt cone was established based on the results of the multiphysics software, and our algorithm was used to extract the radar cross-section (RCS) versus different incident angles of the model.

Download Full-text

Model Order Reduction of Large-Scale Finite Element Systems in an MPI Parallelized Environment for Usage in Multibody Simulation

Archive of Mechanical Engineering ◽

10.1515/meceng-2016-0027 ◽

2016 ◽

Vol 63 (4) ◽

pp. 475-494 ◽

Cited By ~ 1

Author(s):

Thomas Volzer ◽

Peter Eberhard

Keyword(s):

Finite Element ◽

Model Reduction ◽

Message Passing ◽

Large Scale ◽

Message Passing Interface ◽

Block Size ◽

Reduction Process ◽

Element Model ◽

Multibody Simulation ◽

Elastic Bodies

Abstract The use of elastic bodies within a multibody simulation became more and more important within the last years. To include the elastic bodies, described as a finite element model in multibody simulations, the dimension of the system of ordinary differential equations must be reduced by projection. For this purpose, in this work, the modal reduction method, a component mode synthesis based method and a moment-matching method are used. Due to the always increasing size of the non-reduced systems, the calculation of the projection matrix leads to a large demand of computational resources and cannot be done on usual serial computers with available memory. In this paper, the model reduction software Morembs++ is presented using a parallelization concept based on the message passing interface to satisfy the need of memory and reduce the runtime of the model reduction process. Additionally, the behaviour of the Block-Krylov-Schur eigensolver, implemented in the Anasazi package of the Trilinos project, is analysed with regard to the choice of the size of the Krylov base, the block size and the number of blocks. Besides, an iterative solver is considered within the CMS-based method.

Download Full-text

Demonstration of cluster computing for three-dimensional CFD simulations

The Aeronautical Journal ◽

10.1017/s0001924000028037 ◽

1999 ◽

Vol 103 (1027) ◽

pp. 443-447 ◽

Cited By ~ 5

Author(s):

W. McMillan ◽

M. Woodgate ◽

B. E. Richards ◽

B. J. Gribben ◽

K. J. Badcock ◽

...

Keyword(s):

Message Passing ◽

Large Scale ◽

Cluster Computing ◽

Low Cost ◽

Three Dimensional ◽

Cost Effective ◽

Parallel Applications ◽

Cfd Simulations ◽

Single Node ◽

Computing Unit

Abstract Motivated by a lack of sufficient local and national computing facilities for computational fluid dynamics simulations, the Affordable Systems Computing Unit (ASCU) was established to investigate low cost alternatives. The options considered have all involved cluster computing, a term which refers to the grouping of a number of components into a managed system capable of running both serial and parallel applications. The present work aims to demonstrate the utility of commodity processors for dedicated batch processing. The performance of the cluster has proved to be extremely cost effective, enabling large three dimensional flow simulations on a computer costing less than £25k sterling at current market prices. The experience gained on this system in terms of single node performance, message passing and parallel performance will be discussed. In particular, comparisons with the performance of other systems will be made. Several medium-large scale CFD simulations performed using the new cluster will be presented to demonstrate the potential of commodity processor based parallel computers for aerodynamic simulation.

Download Full-text