Coarrays in the Context of XcalableMP

XcalableMP PGAS Programming Language ◽

10.1007/978-981-15-7683-6_3 ◽

2020 ◽

pp. 97-122

Author(s):

Hidetoshi Iwashita ◽

Masahiro Nakao

Keyword(s):

High Performance ◽

Memory Allocation ◽

Communication Algorithms ◽

Zero Copy ◽

Ping Pong ◽

Allocation Methods

AbstractCoarray features have been implemented on the Omni XcalableMP compiler with a source-to-source translator and layered runtime libraries. Three memory allocation methods for coarrays were implemented for the GASNet and MPI-3 communication libraries and the native interface of Fujitsu. For the coarray PUT/GET communication, algorithms using DMA (zero-copy) and buffering were introduced. Important techniques for achieving high performance were the non-blocking PUT communication implemented in the runtime library and the optimization for the GET communication in the translator. Using the ping-pong benchmark and the modified version, the fundamental performance was evaluated and analyzed. The MPI version of the Himeno benchmark was ported to the coarray version and modified for fully using the non-blocking PUT. As a result of the evaluation, the non-blocking coarray version clearly outperformed the original and non-blocking MPI versions.

Download Full-text

A comprehensive zero-copy architecture for high performance distributed data acquisition over advanced network technologies for the CMS experiment

2012 18th IEEE-NPSS Real Time Conference ◽

10.1109/rtc.2012.6418171 ◽

2012 ◽

Author(s):

Gerry Bauer ◽

Ulf Behrens ◽

James Branson ◽

Sebastian Bukowiec ◽

Olivier Chaze ◽

...

Keyword(s):

Data Acquisition ◽

High Performance ◽

Distributed Data ◽

Cms Experiment ◽

Zero Copy ◽

Network Technologies

Download Full-text

Memory allocation anomalies in high‐performance computing applications: A study with numerical simulations

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.6094 ◽

2020 ◽

Author(s):

Antônio Tadeu A. Gomes ◽

Enzo Molion ◽

Roberto P. Souto ◽

Jean‐François Méhaut

Keyword(s):

Numerical Simulations ◽

High Performance Computing ◽

High Performance ◽

Memory Allocation ◽

Performance Computing

Download Full-text

BROADCASTING IN BUS INTERCONNECTION NETWORKS

Journal of Interconnection Networks ◽

10.1142/s0219265900000068 ◽

2000 ◽

Vol 01 (02) ◽

pp. 73-94

Author(s):

A. FERREIRA ◽

A. GOLDMAN ◽

S. W. SONG

Keyword(s):

Interconnection Networks ◽

High Performance ◽

Interconnection Network ◽

Parallel Architectures ◽

Interprocessor Communication ◽

Efficient Manner ◽

Graph Theoretic ◽

Communication Algorithms ◽

Communication Links ◽

Point To Point

In most distributed memory MIMD multiprocessors, processors are connected by a point-to-point interconnection network, usually modeled by a graph where processors are nodes and communication links are edges. Since interprocessor communication frequently constitutes serious bottlenecks, several architectures were proposed that enhance point-to-point topologies with the help of multiple bus systems so as to improve the communication efficiency. In this paper we study parallel architectures where the communication means are constituted solely by buses. These architectures can use the power of bus technologies, providing a way to interconnect much more processors in a simple and efficient manner. We present the hyperpath, hypergrid, hyperring, and hypertorus architectures, which are the bus-based versions of the well used point-to-point interconnection networks. Using (hyper) graph theoretic concepts to model inter-processor communication in such networks, we give optimal algorithms for broadcasting a message from one processor to all the others. For deriving high performance communication patterns we developed a new tool called simplification. The idea is to construct a graph, to be called representative graph, from the original hyper-topology, in such a way that it will become easy to describe and perform communication schemes to the former that will fit to the latter, because the simplification concept also allows us to partially use some already known communication algorithms for usual networks.

Download Full-text

A GPU Accelerated Red-Black SOR Algorithm for Computational Fluid Dynamics Problems

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.320.335 ◽

2011 ◽

Vol 320 ◽

pp. 335-340 ◽

Cited By ~ 14

Author(s):

Ji Tang Liu ◽

Zhao Song Ma ◽

Shi Hai Li ◽

Ying Zhao

Keyword(s):

High Performance ◽

Memory Allocation ◽

Compute Unified Device Architecture ◽

Problem Size ◽

Benchmark Data ◽

Device Architecture ◽

Computational Performance ◽

Speed Up ◽

Sequential Code ◽

Dynamics Problems

GPUs are high performance co-processors of CPU for scientific computing including CFD. We present an optimistic shared memory allocation strategy to solve 2D CFD problems using Red-Black SOR method on GPU with CUDA (Compute Unified Device Architecture). Lid-driven results are compared with the benchmark data. The speed up ratio of same problem size by using NVDIA GTX480 and Intel Core-Dual 3.0GHz processor is discussed, the performance of GPU is 120 times faster than the sequential code on CPU with the problem size of 756756. Based on this work, we conclude that using the memory hierarchy properly has a key role in improving the computational performance of GPU.

Download Full-text

A Comprehensive Zero-Copy Architecture for High Performance Distributed Data Acquisition Over Advanced Network Technologies for the CMS Experiment

IEEE Transactions on Nuclear Science ◽

10.1109/tns.2013.2282340 ◽

2013 ◽

Vol 60 (6) ◽

pp. 4595-4602 ◽

Cited By ~ 4

Author(s):

Gerry Bauer ◽

Ulf Behrens ◽

James Branson ◽

Sebastian Bukowiec ◽

Olivier Chaze ◽

...

Keyword(s):

Data Acquisition ◽

High Performance ◽

Distributed Data ◽

Cms Experiment ◽

Zero Copy ◽

Network Technologies

Download Full-text

Parallelization efficiency of an FDTD code on a high performance cluster under different arrangements of memory allocation for material distribution

2007 IEEE Antennas and Propagation Society International Symposium ◽

10.1109/aps.2007.4396643 ◽

2007 ◽

Author(s):

Yongquan Lu ◽

Chu Qiu ◽

Rui Lu ◽

Zhiwu Su ◽

Xiaoling Yang ◽

...

Keyword(s):

High Performance ◽

Memory Allocation ◽

Material Distribution ◽

Parallelization Efficiency

Download Full-text

Reseach and Implementation of Packet Capture Based on Multi-Core Binding Technology in Linux Environment

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.48-49.902 ◽

2011 ◽

Vol 48-49 ◽

pp. 902-905

Author(s):

Jiang Sun ◽

Ju Long Lan ◽

Yu Feng Li

Keyword(s):

Memory Management ◽

High Performance ◽

Data Packet ◽

Management Mode ◽

Space Program ◽

Packet Capture ◽

Zero Copy ◽

User Space ◽

And Control

According to zero-copy idea and the application of multi-core binding to realize a high-performance packet capture platform based on multi-core binding(MCPCP).By modifying the memory management mode about sk_buff in kernel,realize the user space program to directly access the data packet, which is a kind of universal significance of the zero-copy scheme. And then through the multi-core binding technique, for each CPU core scheduling and control, with multi-threaded user programs can minimize the cache jitter to improve the efficiency of packet capture. Experiments show that in the case of low-end configuration, the throughputs of MCPCP for 64Byte and 1500Byte messages are 620 ,000pps (about 320Mbps) and 78,000pps (about 941Mbps) respectively. In the high-end configuration, can reach 1.46 million pps (748Mbps) and 81,000 pps (979Mbps).MCPCP surpasses the traditional ones' in performance.

Download Full-text

The design and implementation of zero copy MPI using commodity hardware with a high performance network

Proceedings of the 12th international conference on Supercomputing - ICS '98 ◽

10.1145/277830.277883 ◽

1998 ◽

Cited By ~ 42

Author(s):

Francis O'Carroll ◽

Hiroshi Tezuka ◽

Atsushi Hori ◽

Yutaka Ishikawa

Keyword(s):

High Performance ◽

Commodity Hardware ◽

Design And Implementation ◽

Zero Copy

Download Full-text

An Adaptive Zero-Copy Strategy for Ubiquitous High Performance Computing

Proceedings of the 21st European MPI Users' Group Meeting on - EuroMPI/ASIA '14 ◽

10.1145/2642769.2642796 ◽

2014 ◽

Cited By ~ 1

Author(s):

Ting-Hsuan Chien ◽

Chia-Jung Chen ◽

Rong-Guey Chang

Keyword(s):

High Performance Computing ◽

High Performance ◽

Zero Copy ◽

Copy Strategy ◽

Performance Computing

Download Full-text

Characterization of an Arginine:Pyruvate Transaminase in Arginine Catabolism of Pseudomonas aeruginosa PAO1

Journal of Bacteriology ◽

10.1128/jb.00262-07 ◽

2007 ◽

Vol 189 (11) ◽

pp. 3954-3959 ◽

Cited By ~ 15

Author(s):

Zhe Yang ◽

Chung-Dar Lu

Keyword(s):

Pseudomonas Aeruginosa ◽

High Performance ◽

Biochemical Characterization ◽

Physiological Function ◽

Catalytic Efficiency ◽

Kinetic Mechanism ◽

Arginine Catabolism ◽

Ping Pong ◽

Coupled Reaction

ABSTRACT The arginine transaminase (ATA) pathway represents one of the multiple pathways for l-arginine catabolism in Pseudomonas aeruginosa. The AruH protein was proposed to catalyze the first step in the ATA pathway, converting the substrates l-arginine and pyruvate into 2-ketoarginine and l-alanine. Here we report the initial biochemical characterization of this enzyme. The aruH gene was overexpressed in Escherichia coli, and its product was purified to homogeneity. High-performance liquid chromatography and mass spectrometry (MS) analyses were employed to detect the presence of the transamination products 2-ketoarginine and l-alanine, thus demonstrating the proposed biochemical reaction catalyzed by AruH. The enzymatic properties and kinetic parameters of dimeric recombinant AruH were determined by a coupled reaction with NAD+ and l-alanine dehydrogenase. The optimal activity of AruH was found at pH 9.0, and it has a novel substrate specificity with an order of preference of Arg > Lys > Met > Leu > Orn > Gln. With l-arginine and pyruvate as the substrates, Lineweaver-Burk plots of the data revealed a series of parallel lines characteristic of a ping-pong kinetic mechanism with calculated V max and k cat values of 54.6 ± 2.5 μmol/min/mg and 38.6 ± 1.8 s−1. The apparent Km and catalytic efficiency (k cat/Km ) were 1.6 ± 0.1 mM and 24.1 mM−1 s−1 for pyruvate and 13.9 ± 0.8 mM and 2.8 mM−1 s−1 for l-arginine. When l-lysine was used as the substrate, MS analysis suggested Δ1-piperideine-2-carboxylate as its transamination product. These results implied that AruH may have a broader physiological function in amino acid catabolism.

Download Full-text