scholarly journals Trace Generation and Deterministic Execution for Concurrent Programs

2016 ◽  
Author(s):  
Paulo Souza ◽  
Raphael Batista ◽  
Simone Souza ◽  
Rafael Prado ◽  
George Dourado ◽  
...  

This paper proposes new algorithms for generation of trace files and deterministic execution of concurrent programs under test. The proposed algorithms are essential to automate the coverage testing of concurrent programs and allow to execute new synchronizations automatically, increasing the source code coverage with focus on non-determinism, and edges of communication and synchronization. Our algorithms consider programs with multiple paradigms of communication and synchronization (collective, blocking and non-blocking point-to-point message passing, and shared memory). We validate our algorithms by means of experiments based on nine representative benchmarks, which exercise non-trivial aspects of synchronization found in real applications. Our algorithms have a robust behaviour and meet their objectives. We also highlight the overhead generated with the algorithms.

2013 ◽  
Vol 2013 ◽  
pp. 1-40 ◽  
Author(s):  
Krzysztof Jozwik ◽  
Shinya Honda ◽  
Masato Edahiro ◽  
Hiroyuki Tomiyama ◽  
Hiroaki Takada

Dynamic Partial Reconfiguration technology coupled with an Operating System for Reconfigurable Systems (OS4RS) allows for implementation of a hardware task concept, that is, an active computing object which can contend for reconfigurable computing resources and request OS services in a way software task does in a conventional OS. In this work, we show a complete model and implementation of a lightweight OS4RS supporting preemptable and clock-scalable hardware tasks. We also propose a novel, lightweight scheduling mechanism allowing for timely and priority-based reservation of reconfigurable resources, which aims at usage of preemption only at the time it brings benefits to the performance of a system. The architecture of the scheduler and the way it schedules allocations of the hardware tasks result in shorter latency of system calls, thereby reducing the overall OS overhead. Finally, we present a novel model and implementation of a channel-based intertask communication and synchronization suitable for software-hardware multitasking with preemptable and clock-scalable hardware tasks. It allows for optimizations of the communication on per task basis and utilizes point-to-point message passing rather than shared-memory communication, whenever it is possible. Extensive overhead tests of the OS4RS services as well as application speedup tests show efficiency of our approach.


Author(s):  
Qichang Chen ◽  
Liqiang Wang ◽  
Ping Guo ◽  
He Huang

Today, multi-core/multi-processor hardware has become ubiquitous, leading to a fundamental turning point on software development. However, developing concurrent programs is difficult. Concurrency introduces the possibility of errors that do not exist in sequential programs. This chapter introduces the major concurrent programming models including multithreaded programming on shared memory and message passing programming on distributed memory. Then, the state-of-the-art research achievements on detecting concurrency errors such as deadlock, race condition, and atomicity violation are reviewed. Finally, the chapter surveys the widely used tools for testing and debugging concurrent programs.


2013 ◽  
Vol 18 ◽  
pp. 149-158 ◽  
Author(s):  
Paulo S.L. Souza ◽  
Simone S. Souza ◽  
Murilo G. Rocha ◽  
Rafael R. Prado ◽  
Raphael N. Batista

2021 ◽  
Vol 178 (3) ◽  
pp. 229-266
Author(s):  
Ivan Lanese ◽  
Adrián Palacios ◽  
Germán Vidal

Causal-consistent reversible debugging is an innovative technique for debugging concurrent systems. It allows one to go back in the execution focusing on the actions that most likely caused a visible misbehavior. When such an action is selected, the debugger undoes it, including all and only its consequences. This operation is called a causal-consistent rollback. In this way, the user can avoid being distracted by the actions of other, unrelated processes. In this work, we introduce its dual notion: causal-consistent replay. We allow the user to record an execution of a running program and, in contrast to traditional replay debuggers, to reproduce a visible misbehavior inside the debugger including all and only its causes. Furthermore, we present a unified framework that combines both causal-consistent replay and causal-consistent rollback. Although most of the ideas that we present are rather general, we focus on a popular functional and concurrent programming language based on message passing: Erlang.


2016 ◽  
Vol 26 (03) ◽  
pp. 1650014 ◽  
Author(s):  
Markus Flatz ◽  
Marián Vajteršic

The goal of Nonnegative Matrix Factorization (NMF) is to represent a large nonnegative matrix in an approximate way as a product of two significantly smaller nonnegative matrices. This paper shows in detail how an NMF algorithm based on Newton iteration can be derived using the general Karush-Kuhn-Tucker (KKT) conditions for first-order optimality. This algorithm is suited for parallel execution on systems with shared memory and also with message passing. Both versions were implemented and tested, delivering satisfactory speedup results.


1992 ◽  
Vol 6 (1) ◽  
pp. 98-111 ◽  
Author(s):  
S. K. Kim ◽  
A. T. Chrortopoulos

Main memory accesses for shared-memory systems or global communications (synchronizations) in message passing systems decrease the computation speed. In this paper, the standard Arnoldi algorithm for approximating a small number of eigenvalues, with largest (or smallest) real parts for nonsymmetric large sparse matrices, is restructured so that only one synchronization point is required; that is, one global communication in a message passing distributed-memory machine or one global memory sweep in a shared-memory machine per each iteration is required. We also introduce an s-step Arnoldi method for finding a few eigenvalues of nonsymmetric large sparse matrices. This method generates reduction matrices that are similar to those generated by the standard method. One iteration of the s-step Arnoldi algorithm corresponds to s iterations of the standard Arnoldi algorithm. The s-step method has improved data locality, minimized global communication, and superior parallel properties. These algorithms are implemented on a 64-node NCUBE/7 Hypercube and a CRAY-2, and performance results are presented.


2005 ◽  
Vol 18 (2) ◽  
pp. 219-224
Author(s):  
Emina Milovanovic ◽  
Natalija Stojanovic

Because many universities lack the funds to purchase expensive parallel computers, cost effective alternatives are needed to teach students about parallel processing. Free software is available to support the three major paradigms of parallel computing. Parallaxis is a sophisticated SIMD simulator which runs on a variety of platforms.jBACI shared memory simulator supports the MIMD model of computing with a common shared memory. PVM and MPI allow students to treat a network of workstations as a message passing MIMD multicomputer with distributed memory. Each of this software tools can be used in a variety of courses to give students experience with parallel algorithms.


2015 ◽  
Vol 8 (10) ◽  
pp. 8981-9020 ◽  
Author(s):  
C. Zhang ◽  
L. Liu ◽  
G. Yang ◽  
R. Li ◽  
B. Wang

Abstract. Data transfer, which means transferring data fields between two component models or rearranging data fields among processes of the same component model, is a fundamental operation of a coupler. Most of state-of-the-art coupler versions currently use an implementation based on the point-to-point (P2P) communication of the Message Passing Interface (MPI) (call such an implementation "P2P implementation" for short). In this paper, we reveal the drawbacks of the P2P implementation, including low communication bandwidth due to small message size, variable and big number of MPI messages, and jams during communication. To overcome these drawbacks, we propose a butterfly implementation for data transfer. Although the butterfly implementation can outperform the P2P implementation in many cases, it degrades the performance in some cases because the total message size transferred by the butterfly implementation is larger than that by the P2P implementation. To make the data transfer completely improved, we design and implement an adaptive data transfer library that combines the advantages of both butterfly implementation and P2P implementation. Performance evaluation shows that the adaptive data transfer library significantly improves the performance of data transfer in most cases and does not decrease the performance in any cases. Now the adaptive data transfer library is open to the public and has been imported into a coupler version C-Coupler1 for performance improvement of data transfer. We believe that it can also improve other coupler versions.


2009 ◽  
Vol 6 (2) ◽  
pp. 23
Author(s):  
Siti Arpah Ahmad ◽  
Mohamed Faidz Mohamed Said ◽  
Norazan Mohamed Ramli ◽  
Mohd Nasir Taib

This paper focuses on the performance of basic communication primitives, namely the overlap of message transfer with computation in the point-to-point communication within a small cluster of four nodes. The mpptest has been implemented to measure the basic performance of MPI message passing routines with a variety of message sizes. The mpptest is capable of measuring performance with many participating processes thus exposing contention and scalability problems. This enables programmers to select message sizes in order to isolate and evaluate sudden changes in performance. Investigating these matters is interesting in that non-blocking calls have the advantage of allowing the system to schedule communications even when many processes are running simultaneously. On the other hand, understanding the characteristics of computation and communication overlap is significant, because high- performance kernels often strive to achieve this, since it is both advantageous with respect to data transfer and latency hiding. The results indicate that certain overlap sizes utilize greater node processing power either in blocking send and receive operations or non-blocking send and receive operations. The results have elucidated a detailed MPI characterization of the performance regarding the overlap of message transfer with computation in a small cluster system. 


Sign in / Sign up

Export Citation Format

Share Document