scholarly journals Optimizing UPC Programs for Multi-Core Systems

2010 ◽  
Vol 18 (3-4) ◽  
pp. 183-191 ◽  
Author(s):  
Yili Zheng

The Partitioned Global Address Space (PGAS) model of Unified Parallel C (UPC) can help users express and manage application data locality on non-uniform memory access (NUMA) multi-core shared-memory systems to get good performance. First, we describe several UPC program optimization techniques that are important to achieving good performance on NUMA multi-core computers with examples and quantitative performance results. Second, we use two numerical computing kernels, parallel matrix–matrix multiplication and parallel 3-D FFT, to demonstrate the end-to-end development and optimization for UPC applications. Our results show that the optimized UPC programs achieve very good and scalable performance on current multi-core systems and can even outperform vendor-optimized libraries in some cases.

Electronics ◽  
2021 ◽  
Vol 10 (3) ◽  
pp. 253
Author(s):  
Yosang Jeong ◽  
Hoon Ryu

The non-equilibrium Green’s function (NEGF) is being utilized in the field of nanoscience to predict transport behaviors of electronic devices. This work explores how much performance improvement can be driven for quantum transport simulations with the aid of manycore computing, where the core numerical operation involves a recursive process of matrix multiplication. Major techniques adopted for performance enhancement are data restructuring, matrix tiling, thread scheduling, and offload computing, and we present technical details on how they are applied to optimize the performance of simulations in computing hardware, including Intel Xeon Phi Knights Landing (KNL) systems and NVIDIA general purpose graphic processing unit (GPU) devices. With a target structure of a silicon nanowire that consists of 100,000 atoms and is described with an atomistic tight-binding model, the effects of optimization techniques on the performance of simulations are rigorously tested in a KNL node equipped with two Quadro GV100 GPU devices, and we observe that computation is accelerated by a factor of up to ∼20 against the unoptimized case. The feasibility of handling large-scale workloads in a huge computing environment is also examined with nanowire simulations in a wide energy range, where good scalability is procured up to 2048 KNL nodes.


Author(s):  
Lukasz Szustak ◽  
Pawel Bratek

In this work, we take up the challenge of performance portable programming of heterogeneous stencil computations across a wide range of modern shared-memory systems. An important example of such computations is the Multidimensional Positive Definite Advection Transport Algorithm (MPDATA), the second major part of the dynamic core of the EULAG geophysical model. For this aim, we develop a set of parametric optimization techniques and four-step procedure for customization of the MPDATA code. Among these techniques are: islands-of-cores strategy, (3+1)D decomposition, exploiting data parallelism and simultaneous multithreading, data flow synchronization, and vectorization. The proposed adaptation methodology helps us to develop the automatic transformation of the MPDATA code to achieve high sustained scalable performance for all tested ccNUMA platforms with Intel processors of last generations. This means that for a given platform, the sustained performance of the new code is kept at a similar level, independently of the problem size. The highest performance utilization rate of about 41–46% of the theoretical peak, measured for all benchmarks, is provided for any of the two-socket servers based on Skylake-SP (SKL-SP), Broadwell, and Haswell CPU architectures. At the same time, the four-socket server with SKL-SP processors achieves the highest sustained performance of around 1.0–1.1 Tflop/s that corresponds to about 33% of the peak.


2021 ◽  
Author(s):  
Amirali Amirsoleimani ◽  
Tony Liu ◽  
Fabien Alibart ◽  
Serge Eccofey ◽  
Yao-Feng Chang ◽  
...  

In this Chapter, we review the recent progress on resistance drift mitigation techniques for resistive switching memory devices (specifically memristors) and its impact on the accuracy in deep neural network applications. In the first section of the chapter, we investigate the importance of soft errors and their detrimental impact on memristor-based vector–matrix multiplication (VMM) platforms performance specially the memristance state-drift induced by long-term recurring inference operations with sub-threshold stress voltage. Also, we briefly review some currently developed state-drift mitigation methods. In the next section of the chapter, we will discuss an adaptive inference technique with low hardware overhead to mitigate the memristance drift in memristive VMM platform by using optimization techniques to adjust the inference voltage characteristic associated with different network layers. Also, we present simulation results and performance improvements achieved by applying the proposed inference technique by considering non-idealities for various deep network applications on memristor crossbar arrays. This chapter suggests that a simple low overhead inference technique can revive the functionality, enhance the performance of memristor-based VMM arrays and significantly increases their lifetime which can be a very important factor toward making this technology as a main stream player in future in-memory computing platforms.


2003 ◽  
Vol 13 (04) ◽  
pp. 549-560 ◽  
Author(s):  
THOMAS POHL ◽  
MARKUS KOWARSCHIK ◽  
JENS WILKE ◽  
KLAUS IGLBERGER ◽  
ULRICH RÜDE

When designing and implementing highly efficient scientific applications for parallel computers such as clusters of workstations, it is inevitable to consider and to optimize the single-CPU performance of the codes. For this purpose, it is particularly important that the codes respect the hierarchical memory designs that computer architects employ in order to hide the effects of the growing gap between CPU performance and main memory speed. In this article, we present techniques to enhance the single-CPU efficiency of lattice Boltzmann methods which are commonly used in computational fluid dynamics. We show various performance results for both 2D and 3D codes in order to emphasize the effectiveness of our optimization techniques.


2021 ◽  
Vol 47 (2) ◽  
pp. 1-26
Author(s):  
Field G. Van Zee ◽  
Devangi N. Parikh ◽  
Robert A. Van De Geijn

We approach the problem of implementing mixed-datatype support within the general matrix multiplication ( gemm ) operation of the BLAS-like Library Instantiation Software framework, whereby each matrix operand A , B , and C may be stored as single- or double-precision real or complex values. Another factor of complexity, whereby the matrix product and accumulation are allowed to take place in a precision different from the storage precisions of either A or B , is also discussed. We first break the problem into orthogonal dimensions, considering the mixing of domains separately from mixing precisions. Support for all combinations of matrix operands stored in either the real or complex domain is mapped out by enumerating the cases and describing an implementation approach for each. Supporting all combinations of storage and computation precisions is handled by typecasting the matrices at key stages of the computation—during packing and/or accumulation, as needed. Several optional optimizations are also documented. Performance results gathered on a 56-core Marvell ThunderX2 and a 52-core Intel Xeon Platinum demonstrate that high performance is mostly preserved, with modest slowdowns incurred from unavoidable typecast instructions. The mixed-datatype implementation confirms that combinatorial intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.


2011 ◽  
Vol 44 (3/4) ◽  
pp. 107-108
Author(s):  
Charles Éric Drevet ◽  
Md. Nazrul Islam ◽  
Éric Schost

Author(s):  
Harald Servat ◽  
Antonio J. Pena ◽  
German Llort ◽  
Estanislao Mercadal ◽  
Hans-Christian Hoppe ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document