Load balancing and parallel implementation of iterative algorithms for row-continuous Markov chains

Author(s):  
M. Colajanni ◽  
M. Angelaccio
2009 ◽  
Vol 7 ◽  
pp. 95-100 ◽  
Author(s):  
C. C. Sun ◽  
J. Götze

Abstract. Modern VLSI manufacturing technology has kept shrinking down to the nanoscale level with a very fast trend. Integration with the advanced nano-technology now makes it possible to realize advanced parallel iterative algorithms directly which was almost impossible 10 years ago. In this paper, we want to discuss the influences of evolving VLSI technologies for iterative algorithms and present design strategies from an algorithmic and architectural point of view. Implementing an iterative algorithm on a multiprocessor array, there is a trade-off between the performance/complexity of processors and the load/throughput of interconnects. This is due to the behavior of iterative algorithms. For example, we could simplify the parallel implementation of the iterative algorithm (i.e., processor elements of the multiprocessor array) in any way as long as the convergence is guaranteed. However, the modification of the algorithm (processors) usually increases the number of required iterations which also means that the switch activity of interconnects is increasing. As an example we show that a 25×25 full Jacobi EVD array could be realized into one single FPGA device with the simplified μ-rotation CORDIC architecture.


2021 ◽  
Vol 14 (2) ◽  
pp. 843-857
Author(s):  
Pavel Perezhogin ◽  
Ilya Chernov ◽  
Nikolay Iakovlev

Abstract. In this paper, we present a parallel version of the finite-element model of the Arctic Ocean (FEMAO) configured for the White Sea and based on MPI technology. This model consists of two main parts: an ocean dynamics model and a surface ice dynamics model. These parts are very different in terms of the number of computations because the complexity of the ocean part depends on the bottom depth, while that of the sea-ice component does not. In the first step, we decided to locate both submodels on the same CPU cores with a common horizontal partition of the computational domain. The model domain is divided into small blocks, which are distributed over the CPU cores using Hilbert-curve balancing. Partitioning of the model domain is static (i.e., computed during the initialization stage). There are three baseline options: a single block per core, balancing of 2D computations, and balancing of 3D computations. After showing parallel acceleration for particular ocean and ice procedures, we construct the common partition, which minimizes joint imbalance in both submodels. Our novelty is using arrays shared by all blocks that belong to a CPU core instead of allocating separate arrays for each block, as is usually done. Computations on a CPU core are restricted by the masks of non-land grid nodes and block–core correspondence. This approach allows us to implement parallel computations into the model that are as simple as when the usual decomposition to squares is used, though with advances in load balancing. We provide parallel acceleration of up to 996 cores for the model with a resolution of 500×500×39 in the ocean component and 43 sea-ice scalars, and we carry out a detailed analysis of different partitions on the model runtime.


2020 ◽  
Vol 21 (1) ◽  
Author(s):  
Marouen Ben Guebila

Abstract Background Genome-scale metabolic models are increasingly employed to predict the phenotype of various biological systems pertaining to healthcare and bioengineering. To characterize the full metabolic spectrum of such systems, Fast Flux Variability Analysis (FFVA) is commonly used in parallel with static load balancing. This approach assigns to each core an equal number of biochemical reactions without consideration of their solution complexity. Results Here, we present Very Fast Flux Variability Analysis (VFFVA) as a parallel implementation that dynamically balances the computation load between the cores in runtime which guarantees equal convergence time between them. VFFVA allowed to gain a threefold speedup factor with coupled models and up to 100 with ill-conditioned models along with a 14-fold decrease in memory usage. Conclusions VFFVA exploits the parallel capabilities of modern machines to enable biological insights through optimizing systems biology modeling. VFFVA is available in C, MATLAB, and Python at https://github.com/marouenbg/VFFVA.


Sign in / Sign up

Export Citation Format

Share Document