Global arrays: A nonuniform memory access programming model for high-performance computers

Availability of multiprocessor and multi-core chips and GPU accelerators at commodity prices is making personal supercomputers a reality. High performance programming models help apply this computational power to analyze and visualize massive datasets. Problems which required multi-million dollar supercomputers until recently can now be solved using personal supercomputers. However, specialized programming techniques are needed to harness the power of supercomputers. This chapter provides an overview of approaches to programming High Performance Computers (HPC). The programming paradigms illustrated include OpenMP, OpenACC, CUDA, OpenCL, shared-memory based concurrent programming model of Haskell, MPI, MapReduce, and message-based distributed computing model of Erlang. The goal is to provide enough detail on various paradigms to help the reader understand the fundamental differences and similarities among the paradigms. Example programs are chosen to illustrate the salient concepts that define these paradigms. The chapter concludes by providing research directions and future trends in programming high performance computers.

Numerical algorithms for high-performance computational science

Philosophical Transactions of The Royal Society A Mathematical Physical and Engineering Sciences ◽

10.1098/rsta.2019.0066 ◽

2020 ◽

Vol 378 (2166) ◽

pp. 20190066 ◽

Cited By ~ 2

Author(s):

Jack Dongarra ◽

Laura Grigori ◽

Nicholas J. Higham

Keyword(s):

High Performance ◽

Numerical Algorithms ◽

Computational Science ◽

Floating Point ◽

Important Criterion ◽

Data Movement ◽

Floating Point Arithmetic ◽

High Performance Computers ◽

Point Arithmetic ◽

Speed And Accuracy

A number of features of today’s high-performance computers make it challenging to exploit these machines fully for computational science. These include increasing core counts but stagnant clock frequencies; the high cost of data movement; use of accelerators (GPUs, FPGAs, coprocessors), making architectures increasingly heterogeneous; and multi- ple precisions of floating-point arithmetic, including half-precision. Moreover, as well as maximizing speed and accuracy, minimizing energy consumption is an important criterion. New generations of algorithms are needed to tackle these challenges. We discuss some approaches that we can take to develop numerical algorithms for high-performance computational science, with a view to exploiting the next generation of supercomputers. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.

High-Level Parallel Ant Colony Optimization with Algorithmic Skeletons

International Journal of Parallel Programming ◽

10.1007/s10766-021-00714-1 ◽

2021 ◽

Author(s):

Breno A. de Melo Menezes ◽

Nina Herrmann ◽

Herbert Kuchen ◽

Fernando Buarque de Lima Neto

Keyword(s):

Ant Colony Optimization ◽

High Performance ◽

Optimization Problems ◽

Programming Model ◽

Parallel Implementation ◽

Ant Colony ◽

Algorithmic Skeletons ◽

Low Level ◽

Programming Patterns ◽

High Level

AbstractParallel implementations of swarm intelligence algorithms such as the ant colony optimization (ACO) have been widely used to shorten the execution time when solving complex optimization problems. When aiming for a GPU environment, developing efficient parallel versions of such algorithms using CUDA can be a difficult and error-prone task even for experienced programmers. To overcome this issue, the parallel programming model of Algorithmic Skeletons simplifies parallel programs by abstracting from low-level features. This is realized by defining common programming patterns (e.g. map, fold and zip) that later on will be converted to efficient parallel code. In this paper, we show how algorithmic skeletons formulated in the domain specific language Musket can cope with the development of a parallel implementation of ACO and how that compares to a low-level implementation. Our experimental results show that Musket suits the development of ACO. Besides making it easier for the programmer to deal with the parallelization aspects, Musket generates high performance code with similar execution times when compared to low-level implementations.

An Economical Power Management for High Performance Computers

2011 International Conference on Computational and Information Sciences ◽

10.1109/iccis.2011.71 ◽

2011 ◽

Author(s):

Huiquan Wang ◽

Xinhai Xu ◽

Jing Zhou ◽

Hao Zhou ◽

Guibin Wang

Keyword(s):

Power Management ◽

High Performance ◽

High Performance Computers

High–Performance Computers

Nature Biotechnology ◽

10.1038/nbt0692-632 ◽

1992 ◽

Vol 10 (6) ◽

pp. 632-632

Author(s):

Stuart M. Dambrot

Keyword(s):

High Performance ◽

High Performance Computers

An efficient numerical method for fully-resolved particle simulations on high-performance computers

PAMM ◽

10.1002/pamm.201510238 ◽

2015 ◽

Vol 15 (1) ◽

pp. 495-496 ◽

Cited By ~ 11

Author(s):

Lennart Schneiders ◽

Jerry H. Grimmen ◽

Matthias Meinke ◽

Wolfgang Schröder

Keyword(s):

Numerical Method ◽

High Performance ◽

Particle Simulations ◽

High Performance Computers

Thermodynamic-Dynamic Sea Ice Modeling under J Parallel Adaptive Structured Mesh Applications Infrastructure

Advanced Materials Research ◽

10.4028/www.scientific.net/amr.629.704 ◽

2012 ◽

Vol 629 ◽

pp. 704-710

Author(s):

Xi Ying Liu ◽

Tong Gui Bai ◽

Tao Zhang

Keyword(s):

Sea Ice ◽

High Performance ◽

Earth Science ◽

Arctic Sea Ice ◽

Time Consumption ◽

Arctic Sea ◽

Important Approach ◽

Structured Mesh ◽

One Year ◽

High Performance Computers

Analyzing problems represented by partial differential equations numerically with modern high performance computers has become an important approach in research of earth science. In the work, a Sea Ice numerical Model under JASMIN (J parallel Adaptive Structured Mesh applications INfrastructure) (SIMJ for brevity) including thermodynamic and dynamic processes is implemented and an numerical experiment of 20-year integration with SIMJ has been performed. It’s found that the model can reproduce seasonal variation of Arctic sea ice well and implementation of parallel computing is flexible and easy. The ratio of time consumption is 1:1.16:1.48:2.45 with 8, 4, 2, and 1 core(s) respectively for one year integration on mobile workstation (Thinkpad W510) with Red Hat Enterprise Linux 5.4 and Portland group’s pgf90 9.0-1.

Overview of High Performance Computers

Massive Computing - Handbook of Massive Data Sets ◽

10.1007/978-1-4615-0005-6_22 ◽

2002 ◽

pp. 791-852 ◽

Cited By ~ 1

Author(s):

Aad J. van der Steen ◽

Jack Dongarra

Keyword(s):

High Performance ◽

High Performance Computers

NUMA-Aware DGEMM Based on 64-Bit ARMv8 Multicore Processors Architecture

Electronics ◽

10.3390/electronics10161984 ◽

2021 ◽

Vol 10 (16) ◽

pp. 1984

Author(s):

Wei Zhang ◽

Zihao Jiang ◽

Zhiguang Chen ◽

Nong Xiao ◽

Yang Ou

Keyword(s):

Energy Efficiency ◽

High Performance ◽

Multicore Processors ◽

Matrix Multiplication ◽

Memory Access ◽

Double Precision ◽

Competitive Performance ◽

General Matrix ◽

Remarkable Improvement ◽

Task Independence

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.

Evaluating the potential trade-off between students’ satisfaction and school performance using evolutionary multiobjective optimization

RAIRO - Operations Research ◽

10.1051/ro/2020027 ◽

2020 ◽

Author(s):

Oscar D. Marcenaro-Gutierrez ◽

Sandra Gonzalez-Gallardo ◽

Mariano Luque

Keyword(s):

Academic Performance ◽

High Performance ◽

Multiobjective Programming ◽

Programming Model ◽

Objective Functions ◽

Explanatory Variables ◽

Multiobjective Analysis ◽

Using Data ◽

Teaching Learning ◽

Students Satisfaction

In this article, we carry out a combined econometric and multiobjective analysis using data from a representative sample of Andalusian schools. In particular, four econometric models are estimated in which the students’ academic performance (scores in math and reading, and percentage of students reaching a certain threshold in both subjects, respectively) are regressed against the satisfaction of students with different aspects of the teaching-learning process. From these estimates, four objective functions are defined which have been simultaneously maximized, subject to a set of constraints obtained by analyzing dependencies between explanatory variables. This multiobjective programming model is intended to optimize the students’ academic performance as a function of the students’ satisfaction. To solve this problem we use a decomposition-based evolutionary multiobjective algorithm called Global WASF-GA with different scalarizing functions which allows generating an approximation of the Pareto optimal front. In general, the results show the importance of promoting respect and closer interaction between students and teachers, as a way to increase the average performance of the students and the proportion of high performance students.