Static Scheduling with Load Balancing for Solving Triangular Band Linear Systems on Multicore Processors

2021 ◽  
Vol 179 (1) ◽  
pp. 35-58
Author(s):  
Sirine Marrakchi ◽  
Mohamed Jemni

A new approach for solving triangular band linear systems is established in this study to balance the load and obtain a high degree of parallelism. Our investigation consists to attribute both adequate start time and processor to each task and eliminate the useless dependencies which are not used in the parallel solve stage. Thereby, processors execute in parallel their related tasks taking account of the considered precedence constraints. The theoretical lower bounds for parallel execution time and the number of processors required to carry out the task graph in the shortest time are determined. Experimentations are realized on a shared-memory multicore processor. The experimental results are fitted to the values derived from the determined mathematical formulas. The comparison of results obtained by our contribution with those from triangular systems resolution routine belonging to the library PLASMA, Parallel Linear Algebra Software for Multicore Architectures, confirms the efficiency of the proposed approach.

Author(s):  
Sirine Marrakchi ◽  
Mohamed Jemni

This study presents a new parallel Gaussian elimination approach for symmetric positive definite band systems. For each task, the appropriate start time and adequate processor are determined. Unnecessary dependencies between tasks are eliminated. Simultaneously, all processors perform their associated tasks with precedence constraints under consideration. Our main goal is to obtain a high degree of parallelism by balancing the load of processors and reducing the total idle and parallel execution times. The theoretical lower bounds for parallel execution time and number of processors required to execute the precedence graph at an optimal time are also computed. The validity of our investigation is confirmed by carrying out several experiments on a shared-memory multicore architecture using OpenMP. Practical results prove the efficiency of the proposed method.


2012 ◽  
Vol 198-199 ◽  
pp. 523-527
Author(s):  
Fang Yuan Chen ◽  
Dong Song Zhang ◽  
Zhi Ying Wang

Worst-Case Execution Time (WCET) is crucial in real-time systems and is very challenging in multicore processors due to the possible runtime inter-thread interferences caused by shared resources. This paper proposes a novel approach to analyze runtime inter-core interferences for consecutive or inconsecutive concurrent programs. Our approach can reasonably estimate runtime inter-core interferences in shared cache by introducing lifetime and instruction fetching timing relations analysis into address mapping method. Compared with the method based on lifetime alone, our proposed approach efficiently improves the tightness of WCET estimation.


2010 ◽  
Vol 13 (03) ◽  
pp. 383-390 ◽  
Author(s):  
R.P.. P. Batycky ◽  
M.. Förster ◽  
M.R.. R. Thiele ◽  
K.. Stüben

Summary We present the parallelization of a commercial streamline simulator to multicore architectures based on the OpenMP programming model and its performance on various field examples. This work is a continuation of recent work by Gerritsen et al. (2009) in which a research streamline simulator was extended to parallel execution. We identified that the streamline-transport step represents approximately 40-80% of the total run time. It is exactly this step that is straightforward to parallelize owing to the independent solution of each streamline that is at the heart of streamline simulation. Because we are working with an existing large serial code, we used specialty software to quickly and easily identify variables that required particular handling for implementing the parallel extension. Minimal rewrite to existing code was required to extend the streamline-transport step to OpenMP. As part of this work, we also parallelized additional run-time code, including the gravity-line solver and some simple routines required for constructing the pressure matrix. Overall, the run-time fraction of code parallelized ranged from 0.50 to 0.83, depending on the transport physics being considered. We tested our parallel simulator on a variety of large models including SPE 10, Forties-a UK oil/water model, Judy Creek-a Canadian waterflood/water-alternating-gas (WAG) model, and a South American black-oil model. We noted overall speedup factors from 1.8 to 3.3x for eight threads. In terms of real time, this implies that large-scale streamline simulation models as tested here can be simulated in less than 4 hours. We found speedup results to be reasonable when compared with Amdahl's ideal scaling law. Beyond eight threads, we observed minimal speedups because of memory bandwidth limits on our test machine.


Author(s):  
Ram Prasad Mohanty ◽  
Ashok Kumar Turuk ◽  
Bibhudatta Sahoo

The growing number of cores increases the demand for a powerful memory subsystem which leads to enhancement in the size of caches in multicore processors. Caches are responsible for giving processing elements a faster, higher bandwidth local memory to work with. In this chapter, an attempt has been made to analyze the impact of cache size on performance of Multi-core processors by varying L1 and L2 cache size on the multicore processor with internal network (MPIN) referenced from NIAGRA architecture. As the number of core's increases, traditional on-chip interconnects like bus and crossbar proves to be low in efficiency as well as suffer from poor scalability. In order to overcome the scalability and efficiency issues in these conventional interconnect, ring based design has been proposed. The effect of interconnect on the performance of multicore processors has been analyzed and a novel scalable on-chip interconnection mechanism (INOC) for multicore processors has been proposed. The benchmark results are presented by using a full system simulator. Results show that, using the proposed INoC, compared with the MPIN; the execution time are significantly reduced.


2017 ◽  
Vol 59 (5) ◽  
Author(s):  
Sebastian Tobuschat ◽  
Adam Kostrzewa ◽  
Falco K. Bapp ◽  
Christoph Dropmann

AbstractUsing multicore processors in safety-critical systems is a challenge as well as an opportunity. The real parallelism, which may affect synchronization and determinism, leads to a safety-challenge, because new possible interferences might arise. Additionally, redundant software execution is possible within multicore systems. In complex multicore architectures one of the most important challenges is to know the system behavior and the recognition of any variations from the normal system behavior has to be guaranteed. For those cases it is necessary to monitor several states of the system, configurations, timing, etc. To monitor such a complex system a lot of information from the inside of the system needs to be evaluated without affecting the rest of the MPSoC.


Author(s):  
Medhat Awadalla ◽  
Hanan Konsowa

Multicore processors integrate several cores on a single chip. The fixed architecture of multicore platforms often fails to accommodate the inherent diverse requirements of different applications. The permanent need to enhance the performance of multicore architecture motivates the development of a dynamic architecture. To address this issue, this paper presents new algorithms for thread selection in fetch stage. Moreover, this paper presents three new fetch stage policies, EACH_LOOP_FETCH, INC-FETCH, and WZ-FETCH, based on Ordinary Least Square (OLS) regression statistic method. These new fetch policies differ on thread selection time which is represented by instructions’ count and window size. Furthermore, the simulation multicore tool, , is adapted to cope with multicore processor dynamic design by adding a dynamic feature in the policy of thread selection in fetch stage. SPLASH2, parallel scientific workloads, has been used to validate the proposed adaptation for multi2sim. Intensive simulated experiments have been conducted and the obtained results show that remarkable performance enhancements have been achieved in terms of execution time and number of instructions per second produces less broadcast operations compared to the typical algorithm.


1968 ◽  
Vol 61 (4) ◽  
pp. 399-402
Author(s):  
Jack M. Elkin

The binomial coefficients are an almost endless source of formulas for the summation of series. A reference to the “Problems for Solution” pages of the American Mathematical Monthly or to an advanced collection of mathematical formulas will convince anyone who has not yet discovered this for himself. Some of these series summations can be derived with relative ease with the help of the binomial theorem or Pascal's triangle; many require a high degree of virtuosity in algebraic manipulation and, often, advanced methods of analysis. A number of them can be obtained simply by reasoning logically about the meaning of certain combinatorial expressions, with recourse to only a minimum of algebra or to none at all. These, naturally, have a special appeal of their own, and it is the purpose of this article to illustrate several such derivations.


Sign in / Sign up

Export Citation Format

Share Document