Global arrays: A nonuniform memory access programming model for high-performance computers

1996 ◽  
Vol 10 (2) ◽  
Author(s):  
Jaroslaw Nieplocha ◽  
RobertJ. Harrison ◽  
RichardJ. Littlefield

Author(s):  
Venkat N Gudivada ◽  
Jagadeesh Nandigam ◽  
Jordan Paris

Availability of multiprocessor and multi-core chips and GPU accelerators at commodity prices is making personal supercomputers a reality. High performance programming models help apply this computational power to analyze and visualize massive datasets. Problems which required multi-million dollar supercomputers until recently can now be solved using personal supercomputers. However, specialized programming techniques are needed to harness the power of supercomputers. This chapter provides an overview of approaches to programming High Performance Computers (HPC). The programming paradigms illustrated include OpenMP, OpenACC, CUDA, OpenCL, shared-memory based concurrent programming model of Haskell, MPI, MapReduce, and message-based distributed computing model of Erlang. The goal is to provide enough detail on various paradigms to help the reader understand the fundamental differences and similarities among the paradigms. Example programs are chosen to illustrate the salient concepts that define these paradigms. The chapter concludes by providing research directions and future trends in programming high performance computers.



Author(s):  
Jack Dongarra ◽  
Laura Grigori ◽  
Nicholas J. Higham

A number of features of today’s high-performance computers make it challenging to exploit these machines fully for computational science. These include increasing core counts but stagnant clock frequencies; the high cost of data movement; use of accelerators (GPUs, FPGAs, coprocessors), making architectures increasingly heterogeneous; and multi- ple precisions of floating-point arithmetic, including half-precision. Moreover, as well as maximizing speed and accuracy, minimizing energy consumption is an important criterion. New generations of algorithms are needed to tackle these challenges. We discuss some approaches that we can take to develop numerical algorithms for high-performance computational science, with a view to exploiting the next generation of supercomputers. This article is part of a discussion meeting issue ‘Numerical algorithms for high-performance computational science’.



Author(s):  
Breno A. de Melo Menezes ◽  
Nina Herrmann ◽  
Herbert Kuchen ◽  
Fernando Buarque de Lima Neto

AbstractParallel implementations of swarm intelligence algorithms such as the ant colony optimization (ACO) have been widely used to shorten the execution time when solving complex optimization problems. When aiming for a GPU environment, developing efficient parallel versions of such algorithms using CUDA can be a difficult and error-prone task even for experienced programmers. To overcome this issue, the parallel programming model of Algorithmic Skeletons simplifies parallel programs by abstracting from low-level features. This is realized by defining common programming patterns (e.g. map, fold and zip) that later on will be converted to efficient parallel code. In this paper, we show how algorithmic skeletons formulated in the domain specific language Musket can cope with the development of a parallel implementation of ACO and how that compares to a low-level implementation. Our experimental results show that Musket suits the development of ACO. Besides making it easier for the programmer to deal with the parallelization aspects, Musket generates high performance code with similar execution times when compared to low-level implementations.



Author(s):  
Huiquan Wang ◽  
Xinhai Xu ◽  
Jing Zhou ◽  
Hao Zhou ◽  
Guibin Wang


1992 ◽  
Vol 10 (6) ◽  
pp. 632-632
Author(s):  
Stuart M. Dambrot


PAMM ◽  
2015 ◽  
Vol 15 (1) ◽  
pp. 495-496 ◽  
Author(s):  
Lennart Schneiders ◽  
Jerry H. Grimmen ◽  
Matthias Meinke ◽  
Wolfgang Schröder


2012 ◽  
Vol 629 ◽  
pp. 704-710
Author(s):  
Xi Ying Liu ◽  
Tong Gui Bai ◽  
Tao Zhang

Analyzing problems represented by partial differential equations numerically with modern high performance computers has become an important approach in research of earth science. In the work, a Sea Ice numerical Model under JASMIN (J parallel Adaptive Structured Mesh applications INfrastructure) (SIMJ for brevity) including thermodynamic and dynamic processes is implemented and an numerical experiment of 20-year integration with SIMJ has been performed. It’s found that the model can reproduce seasonal variation of Arctic sea ice well and implementation of parallel computing is flexible and easy. The ratio of time consumption is 1:1.16:1.48:2.45 with 8, 4, 2, and 1 core(s) respectively for one year integration on mobile workstation (Thinkpad W510) with Red Hat Enterprise Linux 5.4 and Portland group’s pgf90 9.0-1.



Author(s):  
Aad J. van der Steen ◽  
Jack Dongarra


Electronics ◽  
2021 ◽  
Vol 10 (16) ◽  
pp. 1984
Author(s):  
Wei Zhang ◽  
Zihao Jiang ◽  
Zhiguang Chen ◽  
Nong Xiao ◽  
Yang Ou

Double-precision general matrix multiplication (DGEMM) is an essential kernel for measuring the potential performance of an HPC platform. ARMv8-based system-on-chips (SoCs) have become the candidates for the next-generation HPC systems with their highly competitive performance and energy efficiency. Therefore, it is meaningful to design high-performance DGEMM for ARMv8-based SoCs. However, as ARMv8-based SoCs integrate increasing cores, modern CPU uses non-uniform memory access (NUMA). NUMA restricts the performance and scalability of DGEMM when many threads access remote NUMA domains. This poses a challenge to develop high-performance DGEMM on multi-NUMA architecture. We present a NUMA-aware method to reduce the number of cross-die and cross-chip memory access events. The critical enabler for NUMA-aware DGEMM is to leverage two levels of parallelism between and within nodes in a purely threaded implementation, which allows the task independence and data localization of NUMA nodes. We have implemented NUMA-aware DGEMM in the OpenBLAS and evaluated it on a dual-socket server with 48-core processors based on the Kunpeng920 architecture. The results show that NUMA-aware DGEMM has effectively reduced the number of cross-die and cross-chip memory access, resulting in enhancing the scalability of DGEMM significantly and increasing the performance of DGEMM by 17.1% on average, with the most remarkable improvement being 21.9%.



Author(s):  
Oscar D. Marcenaro-Gutierrez ◽  
Sandra Gonzalez-Gallardo ◽  
Mariano Luque

In this article, we carry out a combined econometric and multiobjective analysis using data from a representative sample of Andalusian schools. In particular, four econometric models are estimated in which the students’ academic performance (scores in math and reading, and percentage of students reaching a certain threshold in both subjects, respectively) are regressed against the satisfaction of students with different aspects of the teaching-learning process. From these estimates, four objective functions are defined which have been simultaneously maximized, subject to a set of constraints obtained by analyzing dependencies between explanatory variables. This multiobjective programming model is intended to optimize the students’ academic performance as a function of the students’ satisfaction. To solve this problem we use a decomposition-based evolutionary multiobjective algorithm called Global WASF-GA with different scalarizing functions which allows generating an approximation of the Pareto optimal front. In general, the results show the importance of promoting respect and closer interaction between students and teachers, as a way to increase the average performance of the students and the proportion of high performance students.



Sign in / Sign up

Export Citation Format

Share Document