A Transparent Runtime Data Distribution Engine for OpenMP

This paper makes two important contributions. First, the paper investigates the performance implications of data placement in OpenMP programs running on modern NUMA multiprocessors. Data locality and minimization of the rate of remote memory accesses are critical for sustaining high performance on these systems. We show that due to the low remote-to-local memory access latency ratio of contemporary NUMA architectures, reasonably balanced page placement schemes, such as round-robin or random distribution, incur modest performance losses. Second, the paper presents a transparent, user-level page migration engine with an ability to gain back any performance loss that stems from suboptimal placement of pages in iterative OpenMP programs. The main body of the paper describes how our OpenMP runtime environment uses page migration for implementing implicit data distribution and redistribution schemes without programmer intervention. Our experimental results verify the effectiveness of the proposed framework and provide a proof of concept that it is not necessary to introduce data distribution directives in OpenMP and warrant the simplicity or the portability of the programming model.

Download Full-text

Memory Access Behavior Analysis of NUMA-Based Shared Memory Programs

Scientific Programming ◽

10.1155/2002/790749 ◽

2002 ◽

Vol 10 (1) ◽

pp. 45-53 ◽

Cited By ~ 3

Author(s):

Jie Tao ◽

Wolfgang Karl ◽

Martin Schulz

Keyword(s):

Shared Memory ◽

Data Locality ◽

Memory Access ◽

Remote Memory ◽

Data Layout ◽

Performance Improvements ◽

Significant Performance ◽

Working Set ◽

Memory Accesses ◽

Memory Applications

Shared memory applications running transparently on top of NUMA architectures often face severe performance problems due to bad data locality and excessive remote memory accesses. Optimizations with respect to data locality are therefore necessary, but require a fundamental understanding of an application's memory access behavior. The information necessary for this cannot be obtained using simple code instrumentation due to the implicit nature of the communication handled by the NUMA hardware, the large amount of traffic produced at runtime, and the fine access granularity in shared memory codes. In this paper an approach to overcome these problems and thereby to enable an easy and efficient optimization process is presented. Based on a low-level hardware monitoring facility in coordination with a comprehensive visualization tool, it enables the generation of memory access histograms capable of showing all memory accesses across the complete address space of an application's working set. This information can be used to identify access hot spots, to understand the dynamic behavior of shared memory applications, and to optimize applications using an application specific data layout resulting in significant performance improvements.

Download Full-text

VFC: The Vienna Fortran Compiler

Scientific Programming ◽

10.1155/1999/304639 ◽

1999 ◽

Vol 7 (1) ◽

pp. 67-81 ◽

Cited By ~ 34

Author(s):

Siegfried Benkner

Keyword(s):

Message Passing ◽

High Performance ◽

Data Distribution ◽

Data Locality ◽

Performance Measurements ◽

Fortran Compiler ◽

Work Distribution ◽

Local Access ◽

High Level ◽

Access Patterns

High Performance Fortran (HPF) offers an attractive high‐level language interface for programming scalable parallel architectures providing the user with directives for the specification of data distribution and delegating to the compiler the task of generating an explicitly parallel program. Available HPF compilers can handle regular codes quite efficiently, but dramatic performance losses may be encountered for applications which are based on highly irregular, dynamically changing data structures and access patterns. In this paper we introduce the Vienna Fortran Compiler (VFC), a new source‐to‐source parallelization system for HPF+, an optimized version of HPF, which addresses the requirements of irregular applications. In addition to extended data distribution and work distribution mechanisms, HPF+ provides the user with language features for specifying certain information that decisively influence a program’s performance. This comprises data locality assertions, non‐local access specifications and the possibility of reusing runtime‐generated communication schedules of irregular loops. Performance measurements of kernels from advanced applications demonstrate that with a high‐level data parallel language such as HPF+ a performance close to hand‐written message‐passing programs can be achieved even for highly irregular codes.

Download Full-text

The Predictable Execution Model in Practice

ACM Transactions on Embedded Computing Systems ◽

10.1145/3465370 ◽

2021 ◽

Vol 20 (5) ◽

pp. 1-25

Author(s):

Björn Forsberg ◽

Marco Solieri ◽

Marko Bertogna ◽

Luca Benini ◽

Andrea Marongiu

Keyword(s):

Real Time ◽

Programming Model ◽

Replacement Policy ◽

Worst Case ◽

Performance Loss ◽

Code Refactoring ◽

Execution Model ◽

Commercial Off The Shelf ◽

Memory Accesses ◽

And Performance

Adoption of multi- and many-core processors in real-time systems has so far been slowed down, if not totally barred, due do the difficulty in providing analytical real-time guarantees on worst-case execution times. The Predictable Execution Model (PREM) has been proposed to solve this problem, but its practical support requires significant code refactoring, a task better suited for a compilation tool chain than human programmers. Implementing a PREM compiler presents significant challenges to conform to PREM requirements, such as guaranteed upper bounds on memory footprint and the generation of efficient schedulable non-preemptive regions. This article presents a comprehensive description on how a PREM compiler can be implemented, based on several years of experience from the community. We provide accumulated insights on how to best balance conformance to real-time requirements and performance and present novel techniques that extend the applicability from simple benchmark suites to real-world applications. We show that code transformed by the PREM compiler enables timing predictable execution on modern commercial off-the-shelf hardware, providing novel insights on how PREM can protect 99.4% of memory accesses on random replacement policy caches at only 16% performance loss on benchmarks from the PolyBench benchmark suite. Finally, we show that the requirements imposed on the programming model are well-aligned with current coding guidelines for timing critical software, promoting easy adoption.

Download Full-text

PDDP, A Data Parallel Programming Model

Scientific Programming ◽

10.1155/1996/857815 ◽

1996 ◽

Vol 5 (4) ◽

pp. 319-327

Author(s):

Karen H. Warren

Keyword(s):

Parallel Programming ◽

High Performance ◽

Parallel Machines ◽

Programming Model ◽

Data Distribution ◽

Interprocessor Communication ◽

Data Parallel ◽

Parallel Programming Model ◽

Data Objects ◽

Data Parallel Programming

PDDP, the parallel data distribution preprocessor, is a data parallel programming model for distributed memory parallel computers. PDDP implements high-performance Fortran-compatible data distribution directives and parallelism expressed by the use of Fortran 90 array syntax, the FORALL statement, and the WHERE construct. Distributed data objects belong to a global name space; other data objects are treated as local and replicated on each processor. PDDP allows the user to program in a shared memory style and generates codes that are portable to a variety of parallel machines. For interprocessor communication, PDDP uses the fastest communication primitives on each platform.

Download Full-text

Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

ACM Transactions on Mathematical Software ◽

10.1145/3441850 ◽

2021 ◽

Vol 47 (2) ◽

pp. 1-28

Author(s):

Goran Flegar ◽

Hartwig Anzt ◽

Terry Cojean ◽

Enrique S. Quintana-Ortí

Keyword(s):

Linear Algebra ◽

Graphics Processing Units ◽

High Performance ◽

Numerical Algorithms ◽

Mixed Precision ◽

Before And After ◽

Memory Accesses ◽

Specialized Hardware ◽

The Individual ◽

Graphics Processing

The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator–like a preconditioner–in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.

Download Full-text

High-Level Parallel Ant Colony Optimization with Algorithmic Skeletons

International Journal of Parallel Programming ◽

10.1007/s10766-021-00714-1 ◽

2021 ◽

Author(s):

Breno A. de Melo Menezes ◽

Nina Herrmann ◽

Herbert Kuchen ◽

Fernando Buarque de Lima Neto

Keyword(s):

Ant Colony Optimization ◽

High Performance ◽

Optimization Problems ◽

Programming Model ◽

Parallel Implementation ◽

Ant Colony ◽

Algorithmic Skeletons ◽

Low Level ◽

Programming Patterns ◽

High Level

AbstractParallel implementations of swarm intelligence algorithms such as the ant colony optimization (ACO) have been widely used to shorten the execution time when solving complex optimization problems. When aiming for a GPU environment, developing efficient parallel versions of such algorithms using CUDA can be a difficult and error-prone task even for experienced programmers. To overcome this issue, the parallel programming model of Algorithmic Skeletons simplifies parallel programs by abstracting from low-level features. This is realized by defining common programming patterns (e.g. map, fold and zip) that later on will be converted to efficient parallel code. In this paper, we show how algorithmic skeletons formulated in the domain specific language Musket can cope with the development of a parallel implementation of ACO and how that compares to a low-level implementation. Our experimental results show that Musket suits the development of ACO. Besides making it easier for the programmer to deal with the parallelization aspects, Musket generates high performance code with similar execution times when compared to low-level implementations.

Download Full-text

Beyond MPI

ACM SIGMOD Record ◽

10.1145/3456859.3456862 ◽

2021 ◽

Vol 49 (4) ◽

pp. 12-17

Author(s):

Feilong Liu ◽

Claude Barthels ◽

Spyros Blanas ◽

Hideaki Kimura ◽

Garret Swart

Keyword(s):

High Performance ◽

Processing System ◽

Complex Interaction ◽

Remote Memory ◽

Interaction Patterns ◽

Round Trip ◽

Data Processing System ◽

Data Intensive ◽

Multiple Round ◽

Programming Interface

Networkswith Remote DirectMemoryAccess (RDMA) support are becoming increasingly common. RDMA, however, offers a limited programming interface to remote memory that consists of read, write and atomic operations. With RDMA alone, completing the most basic operations on remote data structures often requires multiple round-trips over the network. Data-intensive systems strongly desire higher-level communication abstractions that supportmore complex interaction patterns. A natural candidate to consider is MPI, the de facto standard for developing high-performance applications in the HPC community. This paper critically evaluates the communication primitives of MPI and shows that using MPI in the context of a data processing system comes with its own set of insurmountable challenges. Based on this analysis, we propose a new communication abstraction named RDMO, or Remote DirectMemory Operation, that dispatches a short sequence of reads, writes and atomic operations to remote memory and executes them in a single round-trip.

Download Full-text

Development of Filters with Minimal Hydraulic Resistance for Underground Water Intakes

Civil Engineering Journal ◽

10.28991/cej-2020-03091517 ◽

2020 ◽

Vol 6 (5) ◽

pp. 919-927

Author(s):

A. A. Akulshin ◽

N. V. Bredikhina ◽

An. A. Akulshin ◽

I. Y. Aksenteva ◽

N. P. Ermakova

Keyword(s):

Hydraulic Resistance ◽

Cross Section ◽

High Performance ◽

Underground Water ◽

Performance Characteristics ◽

Design Stage ◽

Performance Loss ◽

Filter Performance ◽

Optimal Section ◽

The Cost

The development of modern structures of water wells filtering equipment with enhanced performance characteristics is a vital task. The purpose of this work was to create filters for taking water from underground sources that have high performance, long service life, quickly and economically replace or repair in case of performance loss. The selection of the filter device must be made taking into account all the geological features of the aquifers, the performance characteristics of the filter devices and the size of the future structure. Filter equipment designs for water intake wells have been developed in this study. These filters have low hydraulic resistance, high performance and are easy to repair. This article presents the dependency of flow inside the receiving part of the well, the dependence of filter resistance at various forms of the cross section of the filter wire and the selected optimal section. The paper proposes a method for selecting the optimal cross-section of the filter wire used in the manufacture of a water well filter. The proposed structures of easy-to-remove well filters with increased productivity allow replacing the sealed well filter with a new one easily, reducing capital and operating costs, and increasing the inter-repair periods of their operation. Based on the presented method, examples are given for selecting the parameters of the filter wire cross-section. The above calculations showed that the use of the hydraulic resistance criterion at the design stage of underground water intakes can significantly reduce the cost of well construction. Studies have found that the minimum hydraulic resistance to ensure maximum filter performance is achieved when using filter wire teardrop and elliptical shapes.

Download Full-text

Evaluating the potential trade-off between students’ satisfaction and school performance using evolutionary multiobjective optimization

RAIRO - Operations Research ◽

10.1051/ro/2020027 ◽

2020 ◽

Author(s):

Oscar D. Marcenaro-Gutierrez ◽

Sandra Gonzalez-Gallardo ◽

Mariano Luque

Keyword(s):

Academic Performance ◽

High Performance ◽

Multiobjective Programming ◽

Programming Model ◽

Objective Functions ◽

Explanatory Variables ◽

Multiobjective Analysis ◽

Using Data ◽

Teaching Learning ◽

Students Satisfaction

In this article, we carry out a combined econometric and multiobjective analysis using data from a representative sample of Andalusian schools. In particular, four econometric models are estimated in which the students’ academic performance (scores in math and reading, and percentage of students reaching a certain threshold in both subjects, respectively) are regressed against the satisfaction of students with different aspects of the teaching-learning process. From these estimates, four objective functions are defined which have been simultaneously maximized, subject to a set of constraints obtained by analyzing dependencies between explanatory variables. This multiobjective programming model is intended to optimize the students’ academic performance as a function of the students’ satisfaction. To solve this problem we use a decomposition-based evolutionary multiobjective algorithm called Global WASF-GA with different scalarizing functions which allows generating an approximation of the Pareto optimal front. In general, the results show the importance of promoting respect and closer interaction between students and teachers, as a way to increase the average performance of the students and the proportion of high performance students.

Download Full-text

Task-based programming in COMPSs to converge from HPC to big data

The International Journal of High Performance Computing Applications ◽

10.1177/1094342017701278 ◽

2017 ◽

Vol 32 (1) ◽

pp. 45-60 ◽

Cited By ~ 11

Author(s):

Javier Conejero ◽

Sandra Corella ◽

Rosa M Badia ◽

Jesus Labarta

Keyword(s):

Big Data ◽

High Performance ◽

Programming Model ◽

Good Alternative ◽

Programming Models ◽

Suitable Model ◽

Advantages And Disadvantages ◽

Big Data Applications ◽

And Performance ◽

The Right

Task-based programming has proven to be a suitable model for high-performance computing (HPC) applications. Different implementations have been good demonstrators of this fact and have promoted the acceptance of task-based programming in the OpenMP standard. Furthermore, in recent years, Apache Spark has gained wide popularity in business and research environments as a programming model for addressing emerging big data problems. COMP Superscalar (COMPSs) is a task-based environment that tackles distributed computing (including Clouds) and is a good alternative for a task-based programming model for big data applications. This article describes why we consider that task-based programming models are a good approach for big data applications. The article includes a comparison of Spark and COMPSs in terms of architecture, programming model, and performance. It focuses on the differences that both frameworks have in structural terms, on their programmability interface, and in terms of their efficiency by means of three widely known benchmarking kernels: Wordcount, Kmeans, and Terasort. These kernels enable the evaluation of the more important functionalities of both programming models and analyze different work flows and conditions. The main results achieved from this comparison are (1) COMPSs is able to extract the inherent parallelism from the user code with minimal coding effort as opposed to Spark, which requires the existing algorithms to be adapted and rewritten by explicitly using their predefined functions, (2) it is an improvement in terms of performance when compared with Spark, and (3) COMPSs has shown to scale better than Spark in most cases. Finally, we discuss the advantages and disadvantages of both frameworks, highlighting the differences that make them unique, thereby helping to choose the right framework for each particular objective.

Download Full-text