Parallel Execution of Devs in Shared-memory Multicore Architectures

Parallel Gaussian elimination of symmetric positive definite band matrices for shared-memory multicore architectures

RAIRO - Operations Research ◽

10.1051/ro/2020013 ◽

2020 ◽

Author(s):

Sirine Marrakchi ◽

Mohamed Jemni

Keyword(s):

Shared Memory ◽

Gaussian Elimination ◽

Positive Definite ◽

Parallel Execution ◽

Optimal Time ◽

Multicore Architectures ◽

Start Time ◽

Band Matrices ◽

Symmetric Positive Definite ◽

High Degree

This study presents a new parallel Gaussian elimination approach for symmetric positive definite band systems. For each task, the appropriate start time and adequate processor are determined. Unnecessary dependencies between tasks are eliminated. Simultaneously, all processors perform their associated tasks with precedence constraints under consideration. Our main goal is to obtain a high degree of parallelism by balancing the load of processors and reducing the total idle and parallel execution times. The theoretical lower bounds for parallel execution time and number of processors required to execute the precedence graph at an optimal time are also computed. The validity of our investigation is confirmed by carrying out several experiments on a shared-memory multicore architecture using OpenMP. Practical results prove the efficiency of the proposed method.

Download Full-text

Parallel Nonnegative Matrix Factorization via Newton Iteration

Parallel Processing Letters ◽

10.1142/s0129626416500146 ◽

2016 ◽

Vol 26 (03) ◽

pp. 1650014 ◽

Cited By ~ 3

Author(s):

Markus Flatz ◽

Marián Vajteršic

Keyword(s):

Shared Memory ◽

Matrix Factorization ◽

Message Passing ◽

Nonnegative Matrix Factorization ◽

Nonnegative Matrix ◽

Newton Iteration ◽

Parallel Execution ◽

Kkt Conditions ◽

Nonnegative Matrices ◽

First Order

The goal of Nonnegative Matrix Factorization (NMF) is to represent a large nonnegative matrix in an approximate way as a product of two significantly smaller nonnegative matrices. This paper shows in detail how an NMF algorithm based on Newton iteration can be derived using the general Karush-Kuhn-Tucker (KKT) conditions for first-order optimality. This algorithm is suited for parallel execution on systems with shared memory and also with message passing. Both versions were implemented and tested, delivering satisfactory speedup results.

Download Full-text

Energy Efficiency Evaluation of Parallel Execution of DEVS Models in Multicore Architectures

2020 Winter Simulation Conference (WSC) ◽

10.1109/wsc48552.2020.9384117 ◽

2020 ◽

Author(s):

Guillermo G. Trabes ◽

Veronica Gil Costa ◽

Gabriel A. Wainer

Keyword(s):

Energy Efficiency ◽

Parallel Execution ◽

Efficiency Evaluation ◽

Multicore Architectures ◽

Energy Efficiency Evaluation

Download Full-text

New safe reliable design methodologies examined by fault injection testing and Monte Carlo simulation: tolerating shared-memory interferences in multicore architectures

International Journal of Embedded Systems ◽

10.1504/ijes.2021.117956 ◽

2021 ◽

Vol 14 (4) ◽

pp. 409

Author(s):

Abdullah El Bayoumi

Keyword(s):

Monte Carlo Simulation ◽

Monte Carlo ◽

Shared Memory ◽

Fault Injection ◽

Multicore Architectures ◽

Design Methodologies ◽

Reliable Design

Download Full-text

Parallelization of a Commercial Streamline Simulator and Performance on Practical Models

SPE Reservoir Evaluation & Engineering ◽

10.2118/118684-pa ◽

2010 ◽

Vol 13 (03) ◽

pp. 383-390 ◽

Cited By ~ 5

Author(s):

R.P.. P. Batycky ◽

M.. Förster ◽

M.R.. R. Thiele ◽

K.. Stüben

Keyword(s):

Large Scale ◽

Programming Model ◽

Scaling Law ◽

Independent Solution ◽

Parallel Execution ◽

Water Model ◽

Test Machine ◽

Multicore Architectures ◽

Streamline Simulation ◽

Run Time

Summary We present the parallelization of a commercial streamline simulator to multicore architectures based on the OpenMP programming model and its performance on various field examples. This work is a continuation of recent work by Gerritsen et al. (2009) in which a research streamline simulator was extended to parallel execution. We identified that the streamline-transport step represents approximately 40-80% of the total run time. It is exactly this step that is straightforward to parallelize owing to the independent solution of each streamline that is at the heart of streamline simulation. Because we are working with an existing large serial code, we used specialty software to quickly and easily identify variables that required particular handling for implementing the parallel extension. Minimal rewrite to existing code was required to extend the streamline-transport step to OpenMP. As part of this work, we also parallelized additional run-time code, including the gravity-line solver and some simple routines required for constructing the pressure matrix. Overall, the run-time fraction of code parallelized ranged from 0.50 to 0.83, depending on the transport physics being considered. We tested our parallel simulator on a variety of large models including SPE 10, Forties-a UK oil/water model, Judy Creek-a Canadian waterflood/water-alternating-gas (WAG) model, and a South American black-oil model. We noted overall speedup factors from 1.8 to 3.3x for eight threads. In terms of real time, this implies that large-scale streamline simulation models as tested here can be simulated in less than 4 hours. We found speedup results to be reasonable when compared with Amdahl's ideal scaling law. Beyond eight threads, we observed minimal speedups because of memory bandwidth limits on our test machine.

Download Full-text

Shared memory multiprocessor support for functional array processing in SAC

Journal of Functional Programming ◽

10.1017/s0956796805005538 ◽

2005 ◽

Vol 15 (3) ◽

pp. 353-401 ◽

Cited By ~ 29

Author(s):

CLEMENS GRELCK

Keyword(s):

Shared Memory ◽

Array Processing ◽

Numerical Data ◽

Parallel Execution ◽

Real Performance ◽

Execution Model ◽

Series Of Experiments ◽

High Level ◽

Performance Gains ◽

The Impact

Classical application domains of parallel computing are dominated by processing large arrays of numerical data. Whereas most functional languages focus on lists and trees rather than on arrays, SAC is tailor-made in design and in implementation for efficient high-level array processing. Advanced compiler optimizations yield performance levels that are often competitive with low-level imperative implementations. Based on SAC, we develop compilation techniques and runtime system support for the compiler-directed parallel execution of high-level functional array processing code on shared memory architectures. Competitive sequential performance gives us the opportunity to exploit the conceptual advantages of the functional paradigm for achieving real performance gains with respect to existing imperative implementations, not only in comparison with uniprocessor runtimes. While the design of SAC facilitates parallelization, the particular challenge of high sequential performance is that realization of satisfying speedups through parallelization becomes substantially more difficult. We present an initial compilation scheme and multi-threaded execution model, which we step-wise refine to reduce organizational overhead and to improve parallel performance. We close with a detailed analysis of the impact of certain design decisions on runtime performance, based on a series of experiments.

Download Full-text

Comparative Evaluation and Case Studies of Shared-Memory and Data-Parallel Execution Patterns

Scientific Programming ◽

10.1155/1999/468372 ◽

1999 ◽

Vol 7 (1) ◽

pp. 1-19

Author(s):

Xiaodong Zhang ◽

Lin Sun

Keyword(s):

Linear System ◽

Shared Memory ◽

Interconnection Networks ◽

Parallel Execution ◽

Parallel Model ◽

Scientific Applications ◽

Data Parallel ◽

High Level ◽

Access Patterns ◽

Structured Program

Shared‐memory and data‐parallel programming models are two important paradigms for scientific applications. Both models provide high‐level program abstractions, and simple and uniform views of network structures. The common features of the two models significantly simplify program coding and debugging for scientific applications. However, the underlining execution and overhead patterns are significantly different between the two models due to their programming constraints, and due to different and complex structures of interconnection networks and systems which support the two models. We performed this experimental study to present implications and comparisons of execution patterns on two commercial architectures. We implemented a standard electromagnetic simulation program (EM) and a linear system solver using the shared‐memory model on the KSR‐1 and the data‐parallel model on the CM‐5. Our objectives are to examine the execution pattern changes required for an implementation transformation between the two models; to study memory access patterns; to address scalability issues; and to investigate relative costs and advantages/disadvantages of using the two models for scientific computations. Our results indicate that the EM program tends to become computation‐intensive in the KSR‐1 shared‐memory system, and memory‐demanding in the CM‐5 data‐parallel system when the systems and the problems are scaled. The EM program, a highly data‐parallel program performed extremely well, and the linear system solver, a highly control‐structured program suffered significantly in the data‐parallel model on the CM‐5. Our study provides further evidence that matching execution patterns of algorithms to parallel architectures would achieve better performance.

Download Full-text

The Parallel Tiled WZ Factorization Algorithm for Multicore Architectures

International Journal of Applied Mathematics and Computer Science ◽

10.2478/amcs-2019-0030 ◽

2019 ◽

Vol 29 (2) ◽

pp. 407-419

Author(s):

Beata Bylina ◽

Jarosław Bylina

Keyword(s):

Shared Memory ◽

Linear Algebra ◽

Multicore Architectures ◽

Numerical Accuracy ◽

Factorization Algorithm ◽

Computational Performance ◽

Parallel Implementations ◽

Diagonally Dominant Matrices ◽

Diagonally Dominant ◽

Level Parallelism

Abstract The aim of this paper is to investigate dense linear algebra algorithms on shared memory multicore architectures. The design and implementation of a parallel tiled WZ factorization algorithm which can fully exploit such architectures are presented. Three parallel implementations of the algorithm are studied. The first one relies only on exploiting multithreaded BLAS (basic linear algebra subprograms) operations. The second implementation, except for BLAS operations, employs the OpenMP standard to use the loop-level parallelism. The third implementation, except for BLAS operations, employs the OpenMP task directive with the depend clause. We report the computational performance and the speedup of the parallel tiled WZ factorization algorithm on shared memory multicore architectures for dense square diagonally dominant matrices. Then we compare our parallel implementations with the respective LU factorization from a vendor implemented LAPACK library. We also analyze the numerical accuracy. Two of our implementations can be achieved with near maximal theoretical speedup implied by Amdahl’s law.

Download Full-text

SAC — FROM HIGH-LEVEL PROGRAMMING WITH ARRAYS TO EFFICIENT PARALLEL EXECUTION

Parallel Processing Letters ◽

10.1142/s0129626403001379 ◽

2003 ◽

Vol 13 (03) ◽

pp. 401-412 ◽

Cited By ~ 15

Author(s):

CLEMENS GRELCK ◽

SVEN-BODO SCHOLZ

Keyword(s):

Shared Memory ◽

Parallel Execution ◽

3 Dimensional ◽

Shape Invariant ◽

Fixed Set ◽

High Level ◽

Successive Over Relaxation ◽

Compilation Techniques ◽

Processing Language

SAC is a purely functional array processing language designed with numerical applications in mind. It supports generic, high-level program specifications in the style of APL. However, rather than providing a fixed set of built-in array operations, SAC provides means to specify such operations in the language itself in a way that still allows their application to arrays of any rank and size. This paper illustrates the major steps in compiling generic, rank- and shape-invariant SAC specifications into efficiently executable multithreaded code for parallel execution on shared memory multiprocessors. The effectiveness of the compilation techniques is demonstrated by means of a small case study on the PDE1 benchmark, which implements 3-dimensional red/black successive over-relaxation. Comparisons with HPF and ZPL show that despite the genericity of code, SAC achieves highly competitive runtime performance characteristics.

Download Full-text

A Formal Model of Parallel Execution on Multicore Architectures with Multilevel Caches

Formal Aspects of Component Software - Lecture Notes in Computer Science ◽

10.1007/978-3-319-68034-7_4 ◽

2017 ◽

pp. 58-77 ◽

Cited By ~ 2

Author(s):

Shiji Bijo ◽

Einar Broch Johnsen ◽

Ka I Pun ◽

Silvia Lizeth Tapia Tarifa

Keyword(s):

Formal Model ◽

Parallel Execution ◽

Multicore Architectures

Download Full-text