Renumbering strategies for unstructured-grid solvers operating on shared-memory, cache-based parallel machines

AbstractOver time, several competing approaches to parallel Haskell programming have emerged. Different approaches support parallelism at various different scales, ranging from small multicores to massively parallel high-performance computing systems. They also provide varying degrees of control, ranging from completely implicit approaches to ones providing full programmer control. Most current designs assume a shared memory model at the programmer, implementation and hardware levels. This is, however, becoming increasingly divorced from the reality at the hardware level. It also imposes significant unwanted runtime overheads in the form of garbage collection synchronisation etc. What is needed is an easy way to abstract over the implementation and hardware levels, while presenting a simple parallelism model to the programmer. The PArallEl shAred Nothing runtime system design aims to provide a portable and high-level shared-nothing implementation platform for parallel Haskell dialects. It abstracts over major issues such as work distribution and data serialisation, consolidating existing, successful designs into a single framework. It also provides an optional virtual shared-memory programming abstraction for (possibly) shared-nothing parallel machines, such as modern multicore/manycore architectures or cluster/cloud computing systems. It builds on, unifies and extends, existing well-developed support for shared-memory parallelism that is provided by the widely used GHC Haskell compiler. This paper summarises the state-of-the-art in shared-nothing parallel Haskell implementations, introduces the PArallEl shAred Nothing abstractions, shows how they can be used to implement three distinct parallel Haskell dialects, and demonstrates that good scalability can be obtained on recent parallel machines.

Download Full-text

An efficient Parallel Substrate for Typed Feature Structures on shared memory parallel machines

10.3115/980691.980728 ◽

1998 ◽

Author(s):

Ninomiya Takashi ◽

Torisawa Kentaro ◽

Tsujii Jun'ichi

Keyword(s):

Shared Memory ◽

Parallel Machines ◽

Feature Structures ◽

Typed Feature Structures

Download Full-text

An implementation of ray tracing algorithm for the multiprocessor machines

Yugoslav journal of operations research ◽

10.2298/yjor0601125s ◽

2006 ◽

Vol 16 (1) ◽

pp. 125-135 ◽

Cited By ~ 1

Author(s):

Aleksandar Samardzic ◽

Dusan Starcevic ◽

Milan Tuba

Keyword(s):

Parallel Algorithm ◽

Ray Tracing ◽

Shared Memory ◽

Parallel Machines ◽

Lighting Condition ◽

Posix Threads ◽

Scene Description ◽

Tracing Algorithm

Ray Tracing is an algorithm for generating photo-realistic pictures of the 3D scenes, given scene description, lighting condition and viewing parameters as inputs. The algorithm is inherently convenient for parallelization and the simplest parallelization scheme is for the shared-memory parallel machines (multiprocessors). This paper presents two implementations of the algorithm developed by the authors for alike machines, one using the POSIX threads API and another one using the OpenMP API. The paper also presents results of rendering some test scenes using these implementations and discusses our parallel algorithm version efficiency.

Download Full-text

Programming Heterogeneous Parallel Machines Using Refactoring and Monte–Carlo Tree Search

International Journal of Parallel Programming ◽

10.1007/s10766-020-00665-z ◽

2020 ◽

Vol 48 (4) ◽

pp. 583-602

Author(s):

Christopher Brown ◽

Vladimir Janjic ◽

M. Goli ◽

J. McCall

Keyword(s):

Monte Carlo ◽

Shared Memory ◽

Parallel Machines ◽

Memory Systems ◽

Tool Support ◽

Tree Search ◽

Algorithmic Skeletons ◽

Monte Carlo Tree Search ◽

Sequential Code ◽

A New Technique

Abstract This paper presents a new technique for introducing and tuning parallelism for heterogeneous shared-memory systems (comprising a mixture of CPUs and GPUs), using a combination of algorithmic skeletons (such as farms and pipelines), Monte–Carlo tree search for deriving mappings of tasks to available hardware resources, and refactoring tool support for applying the patterns and mappings in an easy and effective way. Using our approach, we demonstrate easily obtainable, significant and scalable speedups on a number of case studies showing speedups of up to 41 over the sequential code on a 24-core machine with one GPU. We also demonstrate that the speedups obtained by mappings derived by the MCTS algorithm are within 5–15% of the best-obtained manual parallelisation.

Download Full-text

Running Models on Parallel Machines with Shared Memory

Computer Treatment of Large Air Pollution Models - Environmental Science and Technology Library ◽

10.1007/978-94-011-0311-4_8 ◽

1995 ◽

pp. 225-249

Author(s):

Zahari Zlatev

Keyword(s):

Shared Memory ◽

Parallel Machines

Download Full-text

Concurrent Computation of Attribute Filters on Shared Memory Parallel Machines

IEEE Transactions on Pattern Analysis and Machine Intelligence ◽

10.1109/tpami.2007.70836 ◽

2008 ◽

Vol 30 (10) ◽

pp. 1800-1813 ◽

Cited By ~ 60

Author(s):

M.H.F. Wilkinson ◽

Hui Gao ◽

W.H. Hesselink ◽

J.-E. Jonker ◽

A. Meijster

Keyword(s):

Shared Memory ◽

Parallel Machines ◽

Concurrent Computation

Download Full-text

Multi-core and many-core shared-memory parallel raycasting volume rendering optimization and tuning

The International Journal of High Performance Computing Applications ◽

10.1177/1094342012440466 ◽

2012 ◽

Vol 26 (4) ◽

pp. 399-412 ◽

Cited By ~ 13

Author(s):

E Wes Bethel ◽

Mark Howison

Keyword(s):

Shared Memory ◽

Volume Rendering ◽

Performance Optimization ◽

Optimal Algorithm ◽

Performance Model ◽

Crossover Point ◽

Memory Cache ◽

And Performance ◽

Many Core ◽

Optimal Configurations

Given the computing industry trend of increasing processing capacity by adding more cores to a chip, the focus of this work is tuning the performance of a staple visualization algorithm, raycasting volume rendering, for shared-memory parallelism on multi-core CPUs and many-core GPUs. Our approach is to vary tunable algorithmic settings, along with known algorithmic optimizations and two different memory layouts, and measure performance in terms of absolute runtime and L2 memory cache misses. Our results indicate there is a wide variation in runtime performance on all platforms, as much as 254% for the tunable parameters we test on multi-core CPUs and 265% on many-core GPUs, and the optimal configurations vary across platforms, often in a non-obvious way. For example, our results indicate the optimal configurations on the GPU occur at a crossover point between those that maintain good cache utilization and those that saturate computational throughput. This result is likely to be extremely difficult to predict with an empirical performance model for this particular algorithm because it has an unstructured memory access pattern that varies locally for individual rays and globally for the selected viewpoint. Our results also show that optimal parameters on modern architectures are markedly different from those in previous studies run on older architectures. In addition, given the dramatic performance variation across platforms for both optimal algorithm settings and performance results, there is a clear benefit for production visualization and analysis codes to adopt a strategy for performance optimization through auto-tuning. These benefits will likely become more pronounced in the future as the number of cores per chip and the cost of moving data through the memory hierarchy both increase.

Download Full-text