MULTITHREADED PARALLELISM WITH OPENMP

While multithreaded programming is an effective way to exploit concurrency, multithreaded programs are notoriously hard to program, debug and tune for performance. In this paper, we present OpenMP shared memory programming as a viable alternative and a much simpler way to write multithreaded programs. We show through empirical results obtained by running, on a single processor machine, a simple matrix multiplication program written in OpenMP C that the drop in performance compared with the single threaded version even on a uniprocessor machine may be negligible. However, this is well compensated for by the increased programmer productivity resulting from the ease of programming, debugging, tuning and the relative ease of OpenMP skill acquisition.

Download Full-text

Matrix Multiplication Performance on Commodity Shared-Memory Multiprocessors

Parallel Computing in Electrical Engineering, International Conference on ◽

10.1109/pcee.2004.43 ◽

2005 ◽

Cited By ~ 1

Author(s):

G. Tsilikas ◽

M. Fleury

Keyword(s):

Shared Memory ◽

Matrix Multiplication ◽

Shared Memory Multiprocessors

Download Full-text

Parallel matrix multiplication based on space-filling curves on shared memory multicore platforms

Proceedings of the 2008 workshop on Memory access on future processors a solved problem? - MAW '08 ◽

10.1145/1366219.1366223 ◽

2008 ◽

Cited By ~ 25

Author(s):

Alexander Heinecke ◽

Michael Brader

Keyword(s):

Shared Memory ◽

Matrix Multiplication ◽

Space Filling ◽

Multicore Platforms ◽

Space Filling Curves

Download Full-text

ALIGNED MULTITHREADED COMPUTATIONS AND THEIR SCHEDULING WITH PERFORMANCE GUARANTEES

Parallel Processing Letters ◽

10.1142/s0129626403001331 ◽

2003 ◽

Vol 13 (03) ◽

pp. 353-364 ◽

Cited By ~ 1

Author(s):

XIE YONG ◽

HSU WEN-JING

Keyword(s):

Computational Model ◽

Linear Expansion ◽

Task Graph ◽

Multithreaded Programming ◽

Performance Guarantees ◽

Linear Speedup ◽

Time Required ◽

Space Requirements ◽

Parallel Graph ◽

Single Processor

This paper considers the problem of scheduling dynamic parallel computations to achieve linear speedup without using significantly more space per processor than that required for a single processor execution. Earlier research in the Cilk project proposed the "strict" computational model, in which every dependency goes from a thread x only to one of x's ancestor threads, and guaranteed both linear speedup and linear expansion of space. However, Cilk threads are stateless, and the task graph that Cilk language expresses is series-parallel graph, which is a proper subset of arbitrary task graph. Moreover, Cilk does not support applications with pipelining. We propose the "aligned" multithreaded computational model, which extends the "strict" computational model in Cilk. In the aligned multithreaded computational model, dependencies can go from arbitrary thread x not only to x's ancestor threads, but also to x's younger brother threads, that are spawned by x's parent thread but after x. We use the same measures of time and space as those used in Cilk: T1 is the time required for executing the computation on 1 processor, T∞ is the time required by an infinite number of processors, and S1 is the space required to execute the computation on 1 processor. We show that for any aligned computation, there exists an execution schedule that achieves both efficient time and efficient space. Specifically, we show that for an execution of any aligned multithreaded computation on P processors, the time required is bounded by O(T1/P + T∞), and the space required can be loosely bounded by O(λ·S1P), where λ is the maximum number of younger brother threads that have the same parent thread and can be blocked during execution. If we assume that λ is a constant, and the space requirements for elder and younger brother threads are the same, then the space required would be bounded by O(S1P). Based on the aligned multithreaded computational model, we show that the aligned multithreaded computational model supports pipelined applications. Furthermore, we propose a multithreaded programming language and show that it can express arbitrary task graph.

Download Full-text

Analysis of protein circular dichroism spectra for secondary structure using a simple matrix multiplication

Analytical Biochemistry ◽

10.1016/0003-2697(86)90241-1 ◽

1986 ◽

Vol 155 (1) ◽

pp. 155-167 ◽

Cited By ~ 382

Author(s):

Larry A. Compton ◽

W.Curtis Johnson

Keyword(s):

Circular Dichroism ◽

Secondary Structure ◽

Matrix Multiplication ◽

Circular Dichroism Spectra ◽

Simple Matrix ◽

Circular Dichroïsm

Download Full-text

Matrix multiplication performance on commodity shared-memory multiprocessors

Parallel Computing in Electrical Engineering, International Conference on ◽

10.1109/pcee.2004.1335582 ◽

2004 ◽

Author(s):

G. Tsilikas ◽

M. Fleury

Keyword(s):

Shared Memory ◽

Matrix Multiplication ◽

Shared Memory Multiprocessors

Download Full-text

Analysis of RAPL Energy Prediction Accuracy in a Matrix Multiplication Application on Shared Memory

Communications in Computer and Information Science - Computer Science – CACIC 2017 ◽

10.1007/978-3-319-75214-3_4 ◽

2018 ◽

pp. 37-46 ◽

Cited By ~ 1

Author(s):

Juan Manuel Paniego ◽

Silvana Gallo ◽

Martín Pi Puig ◽

Franco Chichizola ◽

Laura De Giusti ◽

...

Keyword(s):

Shared Memory ◽

Prediction Accuracy ◽

Matrix Multiplication ◽

Energy Prediction

Download Full-text

DYNAMICS OF FLEXIBLE MULTIBODY MECHANICAL SYSTEMS

Transactions of the Canadian Society for Mechanical Engineering ◽

10.1139/tcsme-1991-0014 ◽

1991 ◽

Vol 15 (3) ◽

pp. 235-256 ◽

Cited By ~ 2

Author(s):

X. Cyril ◽

J. Angeles ◽

A. Misra

Keyword(s):

Equations Of Motion ◽

Matrix Multiplication ◽

Mechanical Systems ◽

Dynamical Behaviour ◽

Constraint Forces ◽

Dynamical Equations ◽

Simple Matrix ◽

Kinematic Velocity ◽

Velocity Constraints ◽

The Individual

In this paper the formulation and simulation of the dynamical equations of multibody mechanical systems comprising of both rigid and flexible-links are accomplished in two steps: in the first step, each link is considered as an unconstrained body and hence, its Euler-Lagrange (EL) equations are derived disregarding the kinematic couplings; in the second step, the individual-link equations, along with the associated constraint forces, are assembled to obtain the constrained dynamical equations of the multibody system. These constraint forces are then efficiently eliminated by simple matrix multiplication of the said equations by the transpose of the natural orthogonal complement of kinematic velocity constraints to obtain the independent dynamical equations. The equations of motion are solved for the generalized accelerations using the Cholesky decomposition method and integrated using Gear’s method for stiff differential equations. Finally, the dynamical behaviour of the Shuttle Remote Manipulator when performing a typical manoeuvre is determined using the above approach.

Download Full-text

Parallel Performance Investigations of an Unstructured Mesh Navier-Stokes Solver

The International Journal of High Performance Computing Applications ◽

10.1177/109434200201600403 ◽

2002 ◽

Vol 16 (4) ◽

pp. 395-407 ◽

Cited By ~ 16

Author(s):

Dimitri J. Mavriplis

Keyword(s):

Shared Memory ◽

Unstructured Mesh ◽

Communication Strategy ◽

Navier Stokes ◽

Parallel Performance ◽

Communication Models ◽

Cache Efficient ◽

And Performance ◽

Memory Architectures ◽

Single Processor

Summary The implementation and performance of a hybrid OpenMP/ MPI parallel communication strategy for an unstructured mesh computational fluid dynamics code is described. The solver is cache efficient and fully vectorizable, and is parallelized using a two-level hybrid MPI-OpenMP implementation suitable for shared and/or distributed memory architectures, as well as clusters of shared memory machines. Parallelism is obtained through domain decomposition for both communication models. Single processor computational rates as well as scalability curves are given on various architectures. For the architectures studied in this work, the OpenMP or hybrid OpenMP/MPI communication strategies achieved no appreciable performance benefit over an exclusive MPI communication strategy.

Download Full-text

Comparing CPU and GPU Implementations of a Simple Matrix Multiplication Algorithm

International Journal of Computer and Electrical Engineering ◽

10.17706/ijcee.2017.9.2.430-438 ◽

2017 ◽

Vol 9 (2) ◽

pp. 430-438 ◽

Cited By ~ 4

Author(s):

Tomaž Dobravec ◽

◽

Patricio Bulić

Keyword(s):

Matrix Multiplication ◽

Matrix Multiplication Algorithm ◽

Multiplication Algorithm ◽

Simple Matrix

Download Full-text

Performance Evaluation of Computation Intensive Tasks in Grid

International Journal of Computer and Communication Technology ◽

10.47893/ijcct.2014.1214 ◽

2014 ◽

pp. 22-26

Author(s):

P. Raghu ◽

K. Sriram

Keyword(s):

Grid Computing ◽

Matrix Multiplication ◽

Data Sets ◽

Globus Toolkit ◽

Tree Algorithm ◽

The Matrix ◽

Super Computer ◽

Simple Matrix ◽

Performance Results ◽

Business Requirements

Grid computing is a special type of parallel computing, which allows us to unite pools of servers, storage systems, and networks into a single large virtual super computer. Grid computing has the advantages of solving complex problems in a shorter time and also makes better use of the existing hardware. It can take advantage of underutilized resources to meet business requirements while minimizing additional costs. There are many Grid setup tools available. In this paper, Globus Toolkit, an open source tool for grid enabled applications, is considered. Initially grid is established between two systems running Linux, using Globus Toolkit. A simple matrix multiplication program, which is capable of running both in grid and stand alone systems, is developed. The application is executed in single system varying the order of the matrices. The same application is split into two sub jobs and run on two grid machines with different orders. Finally the results of the execution are compares and the results are presented in graphs. The work can be extended further to find the type of parallelizing suitable for the application developed. Similarly, FP tree algorithm is taken and the data sets are fed into different machine and in stand alone system. A suitable load balancing mechanism for grid application is discussed. The sections in the paper are arranged as following; Introduction to Grid, Grid setup using Globus toolkit, splitting of the matrix application, FP tree algorithm, performance results, future works, conclusion and references.

Download Full-text