Optimizing UPC Programs for Multi-Core Systems

The Partitioned Global Address Space (PGAS) model of Unified Parallel C (UPC) can help users express and manage application data locality on non-uniform memory access (NUMA) multi-core shared-memory systems to get good performance. First, we describe several UPC program optimization techniques that are important to achieving good performance on NUMA multi-core computers with examples and quantitative performance results. Second, we use two numerical computing kernels, parallel matrix–matrix multiplication and parallel 3-D FFT, to demonstrate the end-to-end development and optimization for UPC applications. Our results show that the optimized UPC programs achieve very good and scalable performance on current multi-core systems and can even outperform vendor-optimized libraries in some cases.

Download Full-text

Enabling Large-Scale Simulations of Quantum Transport with Manycore Computing

Electronics ◽

10.3390/electronics10030253 ◽

2021 ◽

Vol 10 (3) ◽

pp. 253

Author(s):

Yosang Jeong ◽

Hoon Ryu

Keyword(s):

Quantum Transport ◽

Large Scale ◽

Performance Enhancement ◽

Silicon Nanowire ◽

Matrix Multiplication ◽

Tight Binding ◽

Optimization Techniques ◽

Wide Energy Range ◽

Processing Unit ◽

Binding Model

The non-equilibrium Green’s function (NEGF) is being utilized in the field of nanoscience to predict transport behaviors of electronic devices. This work explores how much performance improvement can be driven for quantum transport simulations with the aid of manycore computing, where the core numerical operation involves a recursive process of matrix multiplication. Major techniques adopted for performance enhancement are data restructuring, matrix tiling, thread scheduling, and offload computing, and we present technical details on how they are applied to optimize the performance of simulations in computing hardware, including Intel Xeon Phi Knights Landing (KNL) systems and NVIDIA general purpose graphic processing unit (GPU) devices. With a target structure of a silicon nanowire that consists of 100,000 atoms and is described with an atomistic tight-binding model, the effects of optimization techniques on the performance of simulations are rigorously tested in a KNL node equipped with two Quadro GV100 GPU devices, and we observe that computation is accelerated by a factor of up to ∼20 against the unoptimized case. The feasibility of handling large-scale workloads in a huge computing environment is also examined with nanowire simulations in a wide energy range, where good scalability is procured up to 2048 KNL nodes.

Download Full-text

Fast and Scalable Parallel Matrix Multiplication and Its Applications on Distributed Memory Systems

Handbook of Parallel Computing - Chapman & Hall/CRC Computer & Information Science Series ◽

10.1201/9781420011296.ch47 ◽

2007 ◽

pp. 47-1-47-25

Author(s):

Keqin Li

Keyword(s):

Distributed Memory ◽

Matrix Multiplication ◽

Memory Systems

Download Full-text

Program optimization for shared virtual memory systems

High-Performance Computing and Networking - Lecture Notes in Computer Science ◽

10.1007/3-540-61142-8_681 ◽

1996 ◽

pp. 1001-1002

Author(s):

M. Gerndt ◽

A. Krumme

Keyword(s):

Virtual Memory ◽

Memory Systems ◽

Program Optimization ◽

Shared Virtual Memory

Download Full-text

Performance portable parallel programming of heterogeneous stencils across shared-memory platforms with modern Intel processors

The International Journal of High Performance Computing Applications ◽

10.1177/1094342019828153 ◽

2019 ◽

Vol 33 (3) ◽

pp. 534-553 ◽

Cited By ~ 4

Author(s):

Lukasz Szustak ◽

Pawel Bratek

Keyword(s):

Shared Memory ◽

Memory Systems ◽

Optimization Techniques ◽

Utilization Rate ◽

Problem Size ◽

Geophysical Model ◽

Step Procedure ◽

Wide Range ◽

Sustained Performance ◽

Portable Parallel Programming

In this work, we take up the challenge of performance portable programming of heterogeneous stencil computations across a wide range of modern shared-memory systems. An important example of such computations is the Multidimensional Positive Definite Advection Transport Algorithm (MPDATA), the second major part of the dynamic core of the EULAG geophysical model. For this aim, we develop a set of parametric optimization techniques and four-step procedure for customization of the MPDATA code. Among these techniques are: islands-of-cores strategy, (3+1)D decomposition, exploiting data parallelism and simultaneous multithreading, data flow synchronization, and vectorization. The proposed adaptation methodology helps us to develop the automatic transformation of the MPDATA code to achieve high sustained scalable performance for all tested ccNUMA platforms with Intel processors of last generations. This means that for a given platform, the sustained performance of the new code is kept at a similar level, independently of the problem size. The highest performance utilization rate of about 41–46% of the theoretical peak, measured for all benchmarks, is provided for any of the two-socket servers based on Skylake-SP (SKL-SP), Broadwell, and Haswell CPU architectures. At the same time, the four-socket server with SKL-SP processors achieves the highest sustained performance of around 1.0–1.1 Tflop/s that corresponds to about 33% of the peak.

Download Full-text

Mitigating State-Drift in Memristor Crossbar Arrays for Vector Matrix Multiplication

10.5772/intechopen.100246 ◽

2021 ◽

Author(s):

Amirali Amirsoleimani ◽

Tony Liu ◽

Fabien Alibart ◽

Serge Eccofey ◽

Yao-Feng Chang ◽

...

Keyword(s):

Matrix Multiplication ◽

Optimization Techniques ◽

Performance Improvements ◽

Network Applications ◽

Network Layers ◽

Adaptive Inference ◽

Computing Platforms ◽

And Performance ◽

Memristor Crossbar ◽

Vector Matrix

In this Chapter, we review the recent progress on resistance drift mitigation techniques for resistive switching memory devices (specifically memristors) and its impact on the accuracy in deep neural network applications. In the first section of the chapter, we investigate the importance of soft errors and their detrimental impact on memristor-based vector–matrix multiplication (VMM) platforms performance specially the memristance state-drift induced by long-term recurring inference operations with sub-threshold stress voltage. Also, we briefly review some currently developed state-drift mitigation methods. In the next section of the chapter, we will discuss an adaptive inference technique with low hardware overhead to mitigate the memristance drift in memristive VMM platform by using optimization techniques to adjust the inference voltage characteristic associated with different network layers. Also, we present simulation results and performance improvements achieved by applying the proposed inference technique by considering non-idealities for various deep network applications on memristor crossbar arrays. This chapter suggests that a simple low overhead inference technique can revive the functionality, enhance the performance of memristor-based VMM arrays and significantly increases their lifetime which can be a very important factor toward making this technology as a main stream player in future in-memory computing platforms.

Download Full-text

OPTIMIZATION AND PROFILING OF THE CACHE PERFORMANCE OF PARALLEL LATTICE BOLTZMANN CODES

Parallel Processing Letters ◽

10.1142/s0129626403001501 ◽

2003 ◽

Vol 13 (04) ◽

pp. 549-560 ◽

Cited By ~ 68

Author(s):

THOMAS POHL ◽

MARKUS KOWARSCHIK ◽

JENS WILKE ◽

KLAUS IGLBERGER ◽

ULRICH RÜDE

Keyword(s):

Fluid Dynamics ◽

Computational Fluid Dynamics ◽

Lattice Boltzmann ◽

Parallel Computers ◽

Optimization Techniques ◽

Main Memory ◽

Lattice Boltzmann Methods ◽

Hierarchical Memory ◽

Performance Results ◽

2D And 3D

When designing and implementing highly efficient scientific applications for parallel computers such as clusters of workstations, it is inevitable to consider and to optimize the single-CPU performance of the codes. For this purpose, it is particularly important that the codes respect the hierarchical memory designs that computer architects employ in order to hide the effects of the growing gap between CPU performance and main memory speed. In this article, we present techniques to enhance the single-CPU efficiency of lattice Boltzmann methods which are commonly used in computational fluid dynamics. We show various performance results for both 2D and 3D codes in order to emphasize the effectiveness of our optimization techniques.

Download Full-text

Supporting Mixed-domain Mixed-precision Matrix Multiplication within the BLIS Framework

ACM Transactions on Mathematical Software ◽

10.1145/3402225 ◽

2021 ◽

Vol 47 (2) ◽

pp. 1-26

Author(s):

Field G. Van Zee ◽

Devangi N. Parikh ◽

Robert A. Van De Geijn

Keyword(s):

High Performance ◽

Matrix Multiplication ◽

Software Framework ◽

Matrix Product ◽

Double Precision ◽

Precision Matrix ◽

Implementation Approach ◽

Mixed Precision ◽

The Matrix ◽

Performance Results

We approach the problem of implementing mixed-datatype support within the general matrix multiplication ( gemm ) operation of the BLAS-like Library Instantiation Software framework, whereby each matrix operand A , B , and C may be stored as single- or double-precision real or complex values. Another factor of complexity, whereby the matrix product and accumulation are allowed to take place in a precision different from the storage precisions of either A or B , is also discussed. We first break the problem into orthogonal dimensions, considering the mixing of domains separately from mixing precisions. Support for all combinations of matrix operands stored in either the real or complex domain is mapped out by enumerating the cases and describing an implementation approach for each. Supporting all combinations of storage and computation precisions is handled by typecasting the matrices at key stages of the computation—during packing and/or accumulation, as needed. Several optional optimizations are also documented. Performance results gathered on a 56-core Marvell ThunderX2 and a 52-core Intel Xeon Platinum demonstrate that high performance is mostly preserved, with modest slowdowns incurred from unavoidable typecast instructions. The mixed-datatype implementation confirms that combinatorial intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.

Download Full-text