w-Stacking w-projection hybrid algorithm for wide-field interferometric imaging: implementation details and improvements

Abstract We present a detailed discussion of the implementation strategies for a recently developed w-stacking w-projection hybrid algorithm used to reconstruct wide-field interferometric images. In particular, we discuss the methodology used to deploy the algorithm efficiently on a supercomputer via use of a Message Passing Interface (MPI) k-means clustering technique to achieve efficient construction and application of non-coplanar effects. Additionally, we show that the use of conjugate symmetry can increase the w-stacking efficiency, decrease the time required to construction, and apply w-projection kernels for large data sets. We then demonstrate this implementation by imaging an interferometric observation of Fornax A from the Murchison Widefield Array (MWA). We perform an exact non-coplanar wide-field correction for 126.6 million visibilities using 50 nodes of a computing cluster. The w-projection kernel construction takes only 15 min prior to reconstruction, demonstrating that the implementation is both fast and efficient.

Download Full-text

Shared Memory Transport for ALFA

EPJ Web of Conferences ◽

10.1051/epjconf/201921405029 ◽

2019 ◽

Vol 214 ◽

pp. 05029 ◽

Cited By ~ 2

Author(s):

Alexey Rybalchenko ◽

Dennis Klein ◽

Mohammad Al-Turany ◽

Thorsten Kollegger

Keyword(s):

Shared Memory ◽

Particle Physics ◽

Message Passing ◽

Message Passing Interface ◽

Large Data ◽

Building Blocks ◽

Data Transport ◽

High Data ◽

And Performance ◽

Physics Experiments

The high data rates expected for the next generation of particle physics experiments (e.g.: new experiments at FAIR/GSI and the upgrade of CERN experiments) call for dedicated attention with respect to design of the needed computing infrastructure. The common ALICE-FAIR framework ALFA is a modern software layer, that serves as a platform for simulation, reconstruction and analysis of particle physics experiments. Beside standard services needed for simulation and reconstruction of particle physics experiments, ALFA also provides tools for data transport, configuration and deployment. The FairMQ module in ALFA offers building blocks for creating distributed software components (processes) that communicate between each other via message passing. The abstract "message passing" interface in FairMQ has at the moment three implementations: ZeroMQ, nanomsg and shared memory. The newly developed shared memory transport will be presented, that provides significant per-formance benefits for transferring large data chunks between components on the same node. The implementation in FairMQ allows users to switch between the different transports via a trivial configuration change. The design decisions, im-plementation details and performance numbers of the shared memory transport in FairMQ/ALFA will be highlighted.

Download Full-text

Measuring the Performance of Parallel Information Processing in Solving Linear Equation Using Multiprocessor Supercomputer

Modern Applied Science ◽

10.5539/mas.v12n3p74 ◽

2018 ◽

Vol 12 (3) ◽

pp. 74

Author(s):

Faten Hamad ◽

Abdelsalam Alawamrah

Keyword(s):

Linear Equation ◽

Message Passing ◽

Message Passing Interface ◽

Linear Equations ◽

Large Size ◽

Parallel Performance ◽

Input Size ◽

Large Matrix ◽

The Matrix ◽

Time Required

Evaluation the performance of the algorithms and the method that is used to implement it play a major role in the assessment of the performance of many applications and it help the researchers to decide which algorithm to use and which method to implement it, it also give indicate of the performance of the hardware that the algorithm is tested over. In this paper we evaluate the performance of solving linear equation application over supercomputer which was implemented and using Message Passing interface (MPI) library. The sequential and multithreaded algorithm for solving linear equations has been experimented too and the results has been recorded, the speedup and efficiency of the algorithm has been calculated and the results showed that the parallel algorithm outperforms other methods with the large size matrix of 8192 * 8192 over the number of processors of 64. For large input size, the results also showed that there is a noticeable decrease in running time as the number of processors increase. But in case of multithreaded the results showed that as the matrix size increase the time required for running the algorithm is rapidly increasing although the number of threads increased. This indicates that the parallel performance over for large matrix input size is better and outperforms other methods.

Download Full-text

Pemrosesan Paralel pada Low Pass Filtering Menggunakan Transform Cosinus di MPI (Message Passing Interface)

Conference SENATIK STT Adisutjipto Yogyakarta ◽

10.28989/senatik.v2i0.68 ◽

2016 ◽

Vol 2 ◽

pp. 115

Author(s):

Astika Ayuningtyas

Keyword(s):

Parallel Processing ◽

Message Passing ◽

Message Passing Interface ◽

Test Results ◽

Low Frequencies ◽

Treatment Models ◽

Low Pass ◽

System Resource ◽

Frequency Components ◽

Time Required

Parallel processing is a process of calculating two or more tasks simultaneously through the optimization of the computer system resource, one treatment models is a desktop system. The model allows to perform parallel processing between computers with specifications different computers. An implementation of a model network of workstations using MPI (Message Passing Interface). In this study, applied to the case of the low-pass filtering (LPF), a process in the image or the image of the shape of the filter that retrieves the data at low frequencies. filtering programs lowpass using the cosine transform MPI implemented by modifying the algorithm in the process on each node (computer). Depending on the test results, so that the processing speed of a parallel system is influenced by the number of nodes / processes and the number of frequency components that are processed. In the treatment of single larger process, the time it takes more and more and the value prop affects only the amount of high frequency data is filtered on the field. While parallel processing of more and more computers involved in the filter calculation process low-pass, plus the time required to perform the calculation.

Download Full-text

A Fast and Exact w-stacking and w-projection Hybrid Algorithm for Wide-field Interferometric Imaging

The Astrophysical Journal ◽

10.3847/1538-4357/ab0a05 ◽

2019 ◽

Vol 874 (2) ◽

pp. 174 ◽

Cited By ~ 5

Author(s):

Luke Pratley ◽

Melanie Johnston-Hollitt ◽

Jason D. McEwen

Keyword(s):

Hybrid Algorithm ◽

Wide Field ◽

Interferometric Imaging

Download Full-text

Multi-level Parallelization of Genotype Imputation on Supercomputers

Current Bioinformatics ◽

10.2174/1574893615999200420071307 ◽

2020 ◽

Vol 15 ◽

Author(s):

Weiwen Zhang ◽

Long Wang ◽

Theint Theint Aye ◽

Juniarto Samsudin ◽

Yongqing Zhu

Keyword(s):

Association Study ◽

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Genome Wide Association Study ◽

Job Scheduling ◽

Genotype Imputation ◽

Job Level ◽

Multi Level ◽

High Performance Requirement

Background: Genotype imputation as a service is developed to enable researchers to estimate genotypes on haplotyped data without performing whole genome sequencing. However, genotype imputation is computation intensive and thus it remains a challenge to satisfy the high performance requirement of genome wide association study (GWAS). Objective: In this paper, we propose a high performance computing solution for genotype imputation on supercomputers to enhance its execution performance. Method: We design and implement a multi-level parallelization that includes job level, process level and thread level parallelization, enabled by job scheduling management, message passing interface (MPI) and OpenMP, respectively. It involves job distribution, chunk partition and execution, parallelized iteration for imputation and data concatenation. Due to the design of multi-level parallelization, we can exploit the multi-machine/multi-core architecture to improve the performance of genotype imputation. Results: Experiment results show that our proposed method can outperform the Hadoop-based implementation of genotype imputation. Moreover, we conduct the experiments on supercomputers to evaluate the performance of the proposed method. The evaluation shows that it can significantly shorten the execution time, thus improving the performance for genotype imputation. Conclusion: The proposed multi-level parallelization, when deployed as an imputation as a service, will facilitate bioinformatics researchers in Singapore to conduct genotype imputation and enhance the association study.

Download Full-text

Distributed Singular Value Decomposition Method for Fast Data Processing in Recommendation Systems

Energies ◽

10.3390/en14082284 ◽

2021 ◽

Vol 14 (8) ◽

pp. 2284

Author(s):

Krzysztof Przystupa ◽

Mykola Beshley ◽

Olena Hordiichuk-Bublivska ◽

Marian Kyryk ◽

Halyna Beshley ◽

...

Keyword(s):

Distributed Systems ◽

Singular Value Decomposition ◽

Data Processing ◽

Message Passing ◽

Message Passing Interface ◽

Recommendation Systems ◽

Singular Value ◽

Singular Value Decomposition Method ◽

Value Decomposition ◽

Svd Method

The problem of analyzing a big amount of user data to determine their preferences and, based on these data, to provide recommendations on new products is important. Depending on the correctness and timeliness of the recommendations, significant profits or losses can be obtained. The task of analyzing data on users of services of companies is carried out in special recommendation systems. However, with a large number of users, the data for processing become very big, which causes complexity in the work of recommendation systems. For efficient data analysis in commercial systems, the Singular Value Decomposition (SVD) method can perform intelligent analysis of information. With a large amount of processed information we proposed to use distributed systems. This approach allows reducing time of data processing and recommendations to users. For the experimental study, we implemented the distributed SVD method using Message Passing Interface, Hadoop and Spark technologies and obtained the results of reducing the time of data processing when using distributed systems compared to non-distributed ones.

Download Full-text

A high-performance, portable implementation of the MPI message passing interface standard

Parallel Computing ◽

10.1016/0167-8191(96)00024-5 ◽

1996 ◽

Vol 22 (6) ◽

pp. 789-828 ◽

Cited By ~ 1155

Author(s):

William Gropp ◽

Ewing Lusk ◽

Nathan Doss ◽

Anthony Skjellum

Keyword(s):

Message Passing ◽

High Performance ◽

Message Passing Interface

Download Full-text

Parallel implementation for HSLO(3)-FDTD with message passing interface on Distributed Memory Architecture

2006 International Conference on Computing & Informatics ◽

10.1109/icoci.2006.5276531 ◽

2006 ◽

Author(s):

Mohammad Khatim Hasan ◽

Mohamed Othman ◽

Jalil Md Desa ◽

Zulkifly Abbas ◽

Jumat Sulaiman

Keyword(s):

Message Passing ◽

Message Passing Interface ◽

Distributed Memory ◽

Parallel Implementation ◽

Memory Architecture ◽

Distributed Memory Architecture

Download Full-text

Based on Numerical Simulation of High-Performance Parallel Machine Muffler Experimental Calibration

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.718-720.1645 ◽

2013 ◽

Vol 718-720 ◽

pp. 1645-1650

Author(s):

Gen Yin Cheng ◽

Sheng Chen Yu ◽

Zhi Yong Wei ◽

Shao Jie Chen ◽

You Cheng

Keyword(s):

Numerical Simulation ◽

Finite Element ◽

Boundary Element ◽

Message Passing ◽

High Performance ◽

Message Passing Interface ◽

Parallel Machine ◽

Simulation Software ◽

Experimental Calibration ◽

The Cost

Commonly used commercial simulation software SYSNOISE and ANSYS is run on a single machine (can not directly run on parallel machine) when use the finite element and boundary element to simulate muffler effect, and it will take more than ten days, sometimes even twenty days to work out an exact solution as the large amount of numerical simulation. Use a high performance parallel machine which was built by 32 commercial computers and transform the finite element and boundary element simulation software into a program that can running under the MPI (message passing interface) parallel environment in order to reduce the cost of numerical simulation. The relevant data worked out from the simulation experiment demonstrate that the result effect of the numerical simulation is well. And the computing speed of the high performance parallel machine is 25 ~ 30 times a microcomputer.

Download Full-text

A lightweight approach to performance portability with targetDP

The International Journal of High Performance Computing Applications ◽

10.1177/1094342016682071 ◽

2016 ◽

Vol 32 (2) ◽

pp. 288-301

Author(s):

Alan Gray ◽

Kevin Stratford

Keyword(s):

Particle Physics ◽

Message Passing ◽

Graphics Processing Units ◽

High Performance ◽

Large Scale ◽

Message Passing Interface ◽

Graphics Processing Unit ◽

Processing Unit ◽

Performance Portability ◽

Graphics Processing

Leading high performance computing systems achieve their status through use of highly parallel devices such as NVIDIA graphics processing units or Intel Xeon Phi many-core CPUs. The concept of performance portability across such architectures, as well as traditional CPUs, is vital for the application programmer. In this paper we describe targetDP, a lightweight abstraction layer which allows grid-based applications to target data parallel hardware in a platform agnostic manner. We demonstrate the effectiveness of our pragmatic approach by presenting performance results for a complex fluid application (with which the model was co-designed), plus separate lattice quantum chromodynamics particle physics code. For each application, a single source code base is seen to achieve portable performance, as assessed within the context of the Roofline model. TargetDP can be combined with Message Passing Interface (MPI) to allow use on systems containing multiple nodes: we demonstrate this through provision of scaling results on traditional and graphics processing unit-accelerated large scale supercomputers.

Download Full-text