Static Scheduling with Load Balancing for Solving Triangular Band Linear Systems on Multicore Processors

A new approach for solving triangular band linear systems is established in this study to balance the load and obtain a high degree of parallelism. Our investigation consists to attribute both adequate start time and processor to each task and eliminate the useless dependencies which are not used in the parallel solve stage. Thereby, processors execute in parallel their related tasks taking account of the considered precedence constraints. The theoretical lower bounds for parallel execution time and the number of processors required to carry out the task graph in the shortest time are determined. Experimentations are realized on a shared-memory multicore processor. The experimental results are fitted to the values derived from the determined mathematical formulas. The comparison of results obtained by our contribution with those from triangular systems resolution routine belonging to the library PLASMA, Parallel Linear Algebra Software for Multicore Architectures, confirms the efficiency of the proposed approach.

Download Full-text

Parallel Gaussian elimination of symmetric positive definite band matrices for shared-memory multicore architectures

RAIRO - Operations Research ◽

10.1051/ro/2020013 ◽

2020 ◽

Author(s):

Sirine Marrakchi ◽

Mohamed Jemni

Keyword(s):

Shared Memory ◽

Gaussian Elimination ◽

Positive Definite ◽

Parallel Execution ◽

Optimal Time ◽

Multicore Architectures ◽

Start Time ◽

Band Matrices ◽

Symmetric Positive Definite ◽

High Degree

This study presents a new parallel Gaussian elimination approach for symmetric positive definite band systems. For each task, the appropriate start time and adequate processor are determined. Unnecessary dependencies between tasks are eliminated. Simultaneously, all processors perform their associated tasks with precedence constraints under consideration. Our main goal is to obtain a high degree of parallelism by balancing the load of processors and reducing the total idle and parallel execution times. The theoretical lower bounds for parallel execution time and number of processors required to execute the precedence graph at an optimal time are also computed. The validity of our investigation is confirmed by carrying out several experiments on a shared-memory multicore architecture using OpenMP. Practical results prove the efficiency of the proposed method.

Download Full-text

Energy Efficiency Evaluation of Parallel Execution of DEVS Models in Multicore Architectures

2020 Winter Simulation Conference (WSC) ◽

10.1109/wsc48552.2020.9384117 ◽

2020 ◽

Author(s):

Guillermo G. Trabes ◽

Veronica Gil Costa ◽

Gabriel A. Wainer

Keyword(s):

Energy Efficiency ◽

Parallel Execution ◽

Efficiency Evaluation ◽

Multicore Architectures ◽

Energy Efficiency Evaluation

Download Full-text

Static Analysis of Run-Time Inter-Core Interferences for Concurrent Programs in Shared Cache Multicore Architectures

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.198-199.523 ◽

2012 ◽

Vol 198-199 ◽

pp. 523-527

Author(s):

Fang Yuan Chen ◽

Dong Song Zhang ◽

Zhi Ying Wang

Keyword(s):

Multicore Processors ◽

Mapping Method ◽

Concurrent Programs ◽

Multicore Architectures ◽

Shared Resources ◽

Worst Case ◽

Shared Cache ◽

Address Mapping ◽

Novel Approach ◽

Time Systems

Worst-Case Execution Time (WCET) is crucial in real-time systems and is very challenging in multicore processors due to the possible runtime inter-thread interferences caused by shared resources. This paper proposes a novel approach to analyze runtime inter-core interferences for consecutive or inconsecutive concurrent programs. Our approach can reasonably estimate runtime inter-core interferences in shared cache by introducing lifetime and instruction fetching timing relations analysis into address mapping method. Compared with the method based on lifetime alone, our proposed approach efficiently improves the tightness of WCET estimation.

Download Full-text

Parallelization of a Commercial Streamline Simulator and Performance on Practical Models

SPE Reservoir Evaluation & Engineering ◽

10.2118/118684-pa ◽

2010 ◽

Vol 13 (03) ◽

pp. 383-390 ◽

Cited By ~ 5

Author(s):

R.P.. P. Batycky ◽

M.. Förster ◽

M.R.. R. Thiele ◽

K.. Stüben

Keyword(s):

Large Scale ◽

Programming Model ◽

Scaling Law ◽

Independent Solution ◽

Parallel Execution ◽

Water Model ◽

Test Machine ◽

Multicore Architectures ◽

Streamline Simulation ◽

Run Time

Summary We present the parallelization of a commercial streamline simulator to multicore architectures based on the OpenMP programming model and its performance on various field examples. This work is a continuation of recent work by Gerritsen et al. (2009) in which a research streamline simulator was extended to parallel execution. We identified that the streamline-transport step represents approximately 40-80% of the total run time. It is exactly this step that is straightforward to parallelize owing to the independent solution of each streamline that is at the heart of streamline simulation. Because we are working with an existing large serial code, we used specialty software to quickly and easily identify variables that required particular handling for implementing the parallel extension. Minimal rewrite to existing code was required to extend the streamline-transport step to OpenMP. As part of this work, we also parallelized additional run-time code, including the gravity-line solver and some simple routines required for constructing the pressure matrix. Overall, the run-time fraction of code parallelized ranged from 0.50 to 0.83, depending on the transport physics being considered. We tested our parallel simulator on a variety of large models including SPE 10, Forties-a UK oil/water model, Judy Creek-a Canadian waterflood/water-alternating-gas (WAG) model, and a South American black-oil model. We noted overall speedup factors from 1.8 to 3.3x for eight threads. In terms of real time, this implies that large-scale streamline simulation models as tested here can be simulated in less than 4 hours. We found speedup results to be reasonable when compared with Amdahl's ideal scaling law. Beyond eight threads, we observed minimal speedups because of memory bandwidth limits on our test machine.

Download Full-text

Designing of High Performance Multicore Processor with Improved Cache Configuration and Interconnect

Advances in Systems Analysis, Software Engineering, and High Performance Computing - Emerging Research Surrounding Power Consumption and Performance Issues in Utility Computing ◽

10.4018/978-1-4666-8853-7.ch009 ◽

2016 ◽

pp. 204-219

Author(s):

Ram Prasad Mohanty ◽

Ashok Kumar Turuk ◽

Bibhudatta Sahoo

Keyword(s):

High Performance ◽

Multicore Processors ◽

Multicore Processor ◽

Cache Size ◽

L2 Cache ◽

Internal Network ◽

On Chip ◽

L1 And L2 ◽

The Impact ◽

Cache Configuration

The growing number of cores increases the demand for a powerful memory subsystem which leads to enhancement in the size of caches in multicore processors. Caches are responsible for giving processing elements a faster, higher bandwidth local memory to work with. In this chapter, an attempt has been made to analyze the impact of cache size on performance of Multi-core processors by varying L1 and L2 cache size on the multicore processor with internal network (MPIN) referenced from NIAGRA architecture. As the number of core's increases, traditional on-chip interconnects like bus and crossbar proves to be low in efficiency as well as suffer from poor scalability. In order to overcome the scalability and efficiency issues in these conventional interconnect, ring based design has been proposed. The effect of interconnect on the performance of multicore processors has been analyzed and a novel scalable on-chip interconnection mechanism (INOC) for multicore processors has been proposed. The benchmark results are presented by using a full system simulator. Results show that, using the proposed INoC, compared with the MPIN; the execution time are significantly reduced.

Download Full-text

Online monitoring for safety-critical multicore systems

it - Information Technology ◽

10.1515/itit-2017-0028 ◽

2017 ◽

Vol 59 (5) ◽

Author(s):

Sebastian Tobuschat ◽

Adam Kostrzewa ◽

Falco K. Bapp ◽

Christoph Dropmann

Keyword(s):

Multicore Processors ◽

Normal System ◽

Critical Systems ◽

Multicore Architectures ◽

System Behavior ◽

Multicore Systems ◽

Safety Critical ◽

System A ◽

Software Execution ◽

System Configurations

AbstractUsing multicore processors in safety-critical systems is a challenge as well as an opportunity. The real parallelism, which may affect synchronization and determinism, leads to a safety-challenge, because new possible interferences might arise. Additionally, redundant software execution is possible within multicore systems. In complex multicore architectures one of the most important challenges is to know the system behavior and the recognition of any variations from the normal system behavior has to be guaranteed. For those cases it is necessary to monitor several states of the system, configurations, timing, etc. To monitor such a complex system a lot of information from the inside of the system needs to be evaluated without affecting the rest of the MPSoC.

Download Full-text

Dynamic task graph scheduling on multicore processors for performance, energy, and temperature optimization

2013 International Green Computing Conference Proceedings ◽

10.1109/igcc.2013.6604513 ◽

2013 ◽

Cited By ~ 2

Author(s):

Hafiz Fahad Sheikh ◽

Ishfaq Ahmad

Keyword(s):

Multicore Processors ◽

Task Graph ◽

Dynamic Task ◽

Temperature Optimization ◽

Task Graph Scheduling

Download Full-text

Performance Enhancement of Multicore Architecture

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v5i4.pp669-684 ◽

2015 ◽

Vol 5 (4) ◽

pp. 669

Author(s):

Medhat Awadalla ◽

Hanan Konsowa

Keyword(s):

Performance Enhancement ◽

Multicore Processors ◽

Window Size ◽

Least Square ◽

Dynamic Feature ◽

Single Chip ◽

Multicore Processor ◽

Multicore Architecture ◽

Ordinary Least Square ◽

Dynamic Architecture

Multicore processors integrate several cores on a single chip. The fixed architecture of multicore platforms often fails to accommodate the inherent diverse requirements of different applications. The permanent need to enhance the performance of multicore architecture motivates the development of a dynamic architecture. To address this issue, this paper presents new algorithms for thread selection in fetch stage. Moreover, this paper presents three new fetch stage policies, EACH_LOOP_FETCH, INC-FETCH, and WZ-FETCH, based on Ordinary Least Square (OLS) regression statistic method. These new fetch policies differ on thread selection time which is represented by instructions’ count and window size. Furthermore, the simulation multicore tool, , is adapted to cope with multicore processor dynamic design by adding a dynamic feature in the policy of thread selection in fetch stage. SPLASH2, parallel scientific workloads, has been used to validate the proposed adaptation for multi2sim. Intensive simulated experiments have been conducted and the obtained results show that remarkable performance enhancements have been achieved in terms of execution time and number of instructions per second produces less broadcast operations compared to the typical algorithm.

Download Full-text

A Formal Model of Parallel Execution on Multicore Architectures with Multilevel Caches

Formal Aspects of Component Software - Lecture Notes in Computer Science ◽

10.1007/978-3-319-68034-7_4 ◽

2017 ◽

pp. 58-77 ◽

Cited By ~ 2

Author(s):

Shiji Bijo ◽

Einar Broch Johnsen ◽

Ka I Pun ◽

Silvia Lizeth Tapia Tarifa

Keyword(s):

Formal Model ◽

Parallel Execution ◽

Multicore Architectures

Download Full-text

Binomial Coefficient Formulas by General Reasoning

Mathematics Teacher ◽

10.5951/mt.61.4.0399 ◽

1968 ◽

Vol 61 (4) ◽

pp. 399-402

Author(s):

Jack M. Elkin

Keyword(s):

Binomial Coefficient ◽

Algebraic Manipulation ◽

Binomial Coefficients ◽

Binomial Theorem ◽

Relative Ease ◽

Methods Of Analysis ◽

Summation Of Series ◽

Mathematical Formulas ◽

High Degree ◽

Pascal's Triangle

The binomial coefficients are an almost endless source of formulas for the summation of series. A reference to the “Problems for Solution” pages of the American Mathematical Monthly or to an advanced collection of mathematical formulas will convince anyone who has not yet discovered this for himself. Some of these series summations can be derived with relative ease with the help of the binomial theorem or Pascal's triangle; many require a high degree of virtuosity in algebraic manipulation and, often, advanced methods of analysis. A number of them can be obtained simply by reasoning logically about the meaning of certain combinatorial expressions, with recourse to only a minimum of algebra or to none at all. These, naturally, have a special appeal of their own, and it is the purpose of this article to illustrate several such derivations.

Download Full-text