A Formal Model of Parallel Execution on Multicore Architectures with Multilevel Caches

Author(s):  
Shiji Bijo ◽  
Einar Broch Johnsen ◽  
Ka I Pun ◽  
Silvia Lizeth Tapia Tarifa
2010 ◽  
Vol 13 (03) ◽  
pp. 383-390 ◽  
Author(s):  
R.P.. P. Batycky ◽  
M.. Förster ◽  
M.R.. R. Thiele ◽  
K.. Stüben

Summary We present the parallelization of a commercial streamline simulator to multicore architectures based on the OpenMP programming model and its performance on various field examples. This work is a continuation of recent work by Gerritsen et al. (2009) in which a research streamline simulator was extended to parallel execution. We identified that the streamline-transport step represents approximately 40-80% of the total run time. It is exactly this step that is straightforward to parallelize owing to the independent solution of each streamline that is at the heart of streamline simulation. Because we are working with an existing large serial code, we used specialty software to quickly and easily identify variables that required particular handling for implementing the parallel extension. Minimal rewrite to existing code was required to extend the streamline-transport step to OpenMP. As part of this work, we also parallelized additional run-time code, including the gravity-line solver and some simple routines required for constructing the pressure matrix. Overall, the run-time fraction of code parallelized ranged from 0.50 to 0.83, depending on the transport physics being considered. We tested our parallel simulator on a variety of large models including SPE 10, Forties-a UK oil/water model, Judy Creek-a Canadian waterflood/water-alternating-gas (WAG) model, and a South American black-oil model. We noted overall speedup factors from 1.8 to 3.3x for eight threads. In terms of real time, this implies that large-scale streamline simulation models as tested here can be simulated in less than 4 hours. We found speedup results to be reasonable when compared with Amdahl's ideal scaling law. Beyond eight threads, we observed minimal speedups because of memory bandwidth limits on our test machine.


2019 ◽  
Vol 179 ◽  
pp. 24-53 ◽  
Author(s):  
Shiji Bijo ◽  
Einar Broch Johnsen ◽  
Ka I Pun ◽  
S. Lizeth Tapia Tarifa

Author(s):  
Sirine Marrakchi ◽  
Mohamed Jemni

This study presents a new parallel Gaussian elimination approach for symmetric positive definite band systems. For each task, the appropriate start time and adequate processor are determined. Unnecessary dependencies between tasks are eliminated. Simultaneously, all processors perform their associated tasks with precedence constraints under consideration. Our main goal is to obtain a high degree of parallelism by balancing the load of processors and reducing the total idle and parallel execution times. The theoretical lower bounds for parallel execution time and number of processors required to execute the precedence graph at an optimal time are also computed. The validity of our investigation is confirmed by carrying out several experiments on a shared-memory multicore architecture using OpenMP. Practical results prove the efficiency of the proposed method.


2021 ◽  
Vol 179 (1) ◽  
pp. 35-58
Author(s):  
Sirine Marrakchi ◽  
Mohamed Jemni

A new approach for solving triangular band linear systems is established in this study to balance the load and obtain a high degree of parallelism. Our investigation consists to attribute both adequate start time and processor to each task and eliminate the useless dependencies which are not used in the parallel solve stage. Thereby, processors execute in parallel their related tasks taking account of the considered precedence constraints. The theoretical lower bounds for parallel execution time and the number of processors required to carry out the task graph in the shortest time are determined. Experimentations are realized on a shared-memory multicore processor. The experimental results are fitted to the values derived from the determined mathematical formulas. The comparison of results obtained by our contribution with those from triangular systems resolution routine belonging to the library PLASMA, Parallel Linear Algebra Software for Multicore Architectures, confirms the efficiency of the proposed approach.


Sign in / Sign up

Export Citation Format

Share Document