ACM Transactions on Parallel Computing
Latest Publications


TOTAL DOCUMENTS

180
(FIVE YEARS 86)

H-INDEX

12
(FIVE YEARS 4)

Published By Association For Computing Machinery

2329-4949

2021 ◽  
Vol 8 (4) ◽  
pp. 1-25
Author(s):  
Laurent Feuilloley ◽  
Pierre Fraigniaud

We carry on investigating the line of research questioning the power of randomization for the design of distributed algorithms. In their seminal paper, Naor and Stockmeyer [STOC 1993] established that, in the context of network computing in which all nodes execute the same algorithm in parallel, any construction task that can be solved locally by a randomized Monte-Carlo algorithm can also be solved locally by a deterministic algorithm. This result, however, holds only for distributed tasks such that the correctness of their solutions can be locally checked by a deterministic algorithm. In this article, we extend the result of Naor and Stockmeyer to a wider class of tasks. Specifically, we prove that the same derandomization result holds for every task such that the correctness of their solutions can be locally checked using a 2-sided error randomized Monte-Carlo algorithm.


2021 ◽  
Vol 8 (4) ◽  
pp. 1-25
Author(s):  
Saleh Khalaj Monfared ◽  
Omid Hajihassani ◽  
Vahid Mohsseni ◽  
Dara Rahmati ◽  
Saeid Gorgin

In this work, we present a novel bitsliced high-performance Viterbi algorithm suitable for high-throughput and data-intensive communication. A new column-major data representation scheme coupled with the bitsliced architecture is employed in our proposed Viterbi decoder that enables the maximum utilization of the parallel processing units in modern parallel accelerators. With the help of the proposed alteration of the data scheme, instead of the conventional bit-by-bit operations, 32-bit chunks of data are processed by each processing unit. This means that a single bitsliced parallel Viterbi decoder is capable of decoding 32 different chunks of data simultaneously. Here, the Viterbi’s Add-Compare-Select procedure is implemented with our proposed bitslicing technique, where it is shown that the bitsliced operations for the Viterbi internal functionalities are efficient in terms of their performance and complexity. We have achieved this level of high parallelism while keeping an acceptable bit error rate performance for our proposed methodology. Our suggested hard and soft-decision Viterbi decoder implementations on GPU platforms outperform the fastest previously proposed works by 4.3{\times } and 2.3{\times } , achieving 21.41 and 8.24 Gbps on Tesla V100, respectively.


2021 ◽  
Vol 8 (4) ◽  
pp. 1-19
Author(s):  
Xuejiao Kang ◽  
David F. Gleich ◽  
Ahmed Sameh ◽  
Ananth Grama

As parallel and distributed systems scale, fault tolerance is an increasingly important problem—particularly on systems with limited I/O capacity and bandwidth. Erasure coded computations address this problem by augmenting a given problem instance with redundant data and then solving the augmented problem in a fault oblivious manner in a faulty parallel environment. In the event of faults, a computationally inexpensive procedure is used to compute the true solution from a potentially fault-prone solution. These techniques are significantly more efficient than conventional solutions to the fault tolerance problem. In this article, we show how we can minimize, to optimality, the overhead associated with our problem augmentation techniques for linear system solvers. Specifically, we present a technique that adaptively augments the problem only when faults are detected. At any point in execution, we only solve a system whose size is identical to the original input system. This has several advantages in terms of maintaining the size and conditioning of the system, as well as in only adding the minimal amount of computation needed to tolerate observed faults. We present, in detail, the augmentation process, the parallel formulation, and evaluation of performance of our technique. Specifically, we show that the proposed adaptive fault tolerance mechanism has minimal overhead in terms of FLOP counts with respect to the original solver executing in a non-faulty environment, has good convergence properties, and yields excellent parallel performance. We also demonstrate that our approach significantly outperforms an optimized application-level checkpointing scheme that only checkpoints needed data structures.


2021 ◽  
Vol 8 (4) ◽  
pp. 1-23
Author(s):  
Shao-Chung Wang ◽  
Lin-Ya Yu ◽  
Li-An Her ◽  
Yuan-Shin Hwang ◽  
Jenq-Kuen Lee

A modern GPU is designed with many large thread groups to achieve a high throughput and performance. Within these groups, the threads are grouped into fixed-size SIMD batches in which the same instruction is applied to vectors of data in a lockstep. This GPU architecture is suitable for applications with a high degree of data parallelism, but its performance degrades seriously when divergence occurs. Many optimizations for divergence have been proposed, and they vary with the divergence information about variables and branches. A previous analysis scheme viewed pointers and return values from functions as divergence directly, and only focused on OpenCL 1.x. In this article, we present a novel scheme that reports the divergence information for pointer-intensive OpenCL programs. The approach is based on extended static single assignment (SSA) and adds some special functions and annotations from memory SSA and gated SSA. The proposed scheme first constructs extended SSA, which is then used to build a divergence relation graph that includes all of the possible points-to relationships of the pointers and initialized divergence states. The divergence state of the pointers can be determined by propagating the divergence state of the divergence relation graph. The scheme is further extended for interprocedural cases by considering function-related statements. The proposed scheme was implemented in an LLVM compiler and can be applied to OpenCL programs. We analyzed 10 programs with 24 kernels, with a total analyzed program size of 1,306 instructions in an LLVM intermediate representation, with 885 variables, 108 branches, and 313 pointer-related statements. The total number of divergent pointers detected was 146 for the proposed scheme, 200 for the scheme in which the pointer was always divergent, and 155 for the current LLVM default scheme; the total numbers of divergent variables detected were 458, 519, and 482, respectively, with 31, 34, and 32 divergent branches. These experimental results indicate that the proposed scheme is more precise than both a scheme in which a pointer is always divergent and the current LLVM default scheme.


2021 ◽  
Vol 8 (4) ◽  
pp. 1-26
Author(s):  
Prasad Jayanti ◽  
Siddhartha Jayanti

The abortable mutual exclusion problem, proposed by Scott and Scherer in response to the needs in real-time systems and databases, is a variant of mutual exclusion that allows processes to abort from their attempt to acquire the lock. Worst-case constant remote memory reference algorithms for mutual exclusion using hardware instructions such as Fetch&Add or Fetch&Store have long existed for both cache coherent (CC) and distributed shared memory multiprocessors, but no such algorithms are known for abortable mutual exclusion. Even relaxing the worst-case requirement to amortized, algorithms are only known for the CC model. In this article, we improve this state of the art by designing a deterministic algorithm that uses Fetch&Store to achieve amortized O (1) remote memory reference in both the CC and distributed shared memory models. Our algorithm supports Fast Abort (a process aborts within six steps of receiving the abort signal) and has the following additional desirable properties: it supports an arbitrary number of processes of arbitrary names, requires only O (1) space per process, and satisfies a novel fairness condition that we call Airline FCFS . Our algorithm is short with fewer than a dozen lines of code.


2021 ◽  
Vol 8 (3) ◽  
pp. 1-17
Author(s):  
Faith Ellen ◽  
Barun Gorain ◽  
Avery Miller ◽  
Andrzej Pelc

Broadcast is one of the fundamental network communication primitives. One node of a network, called the s ource, has a message that has to be learned by all other nodes. We consider broadcast in radio networks, modeled as simple undirected connected graphs with a distinguished source. Nodes communicate in synchronous rounds. In each round, a node can either transmit a message to all its neighbours, or stay silent and listen. At the receiving end, a node v hears a message from a neighbour w in a given round if v listens in this round and if w is its only neighbour that transmits in this round. If more than one neighbour of a node v transmits in a given round, we say that a c ollision occurs at v . We do not assume collision detection: in case of a collision, node v does not hear anything (except the background noise that it also hears when no neighbour transmits). We are interested in the feasibility of deterministic broadcast in radio networks. If nodes of the network do not have any labels, deterministic broadcast is impossible even in the four-cycle. On the other hand, if all nodes have distinct labels, then broadcast can be carried out, e.g., in a round-robin fashion, and hence O (log n )-bit labels are sufficient for this task in n -node networks. In fact, O (log Δ)-bit labels, where Δ is the maximum degree, are enough to broadcast successfully. Hence, it is natural to ask if very short labels are sufficient for broadcast. Our main result is a positive answer to this question. We show that every radio network can be labeled using 2 bits in such a way that broadcast can be accomplished by some universal deterministic algorithm that does not know the network topology nor any bound on its size. Moreover, at the expense of an extra bit in the labels, we can get the following additional strong property of our algorithm: there exists a common round in which all nodes know that broadcast has been completed. Finally, we show that 3-bit labels are also sufficient to solve both versions of broadcast in the case where it is not known a priori which node is the source.


2021 ◽  
Vol 8 (3) ◽  
pp. 1-30
Author(s):  
Matthias Maier ◽  
Martin Kronbichler

We discuss the efficient implementation of a high-performance second-order collocation-type finite-element scheme for solving the compressible Euler equations of gas dynamics on unstructured meshes. The solver is based on the convex-limiting technique introduced by Guermond et al. (SIAM J. Sci. Comput. 40, A3211–A3239, 2018). As such, it is invariant-domain preserving ; i.e., the solver maintains important physical invariants and is guaranteed to be stable without the use of ad hoc tuning parameters. This stability comes at the expense of a significantly more involved algorithmic structure that renders conventional high-performance discretizations challenging. We develop an algorithmic design that allows SIMD vectorization of the compute kernel, identify the main ingredients for a good node-level performance, and report excellent weak and strong scaling of a hybrid thread/MPI parallelization.


2021 ◽  
Vol 8 (3) ◽  
pp. 1-18
Author(s):  
James Edwards ◽  
Uzi Vishkin

Boolean satisfiability (SAT) is an important performance-hungry problem with applications in many problem domains. However, most work on parallelizing SAT solvers has focused on coarse-grained, mostly embarrassing, parallelism. Here, we study fine-grained parallelism that can speed up existing sequential SAT solvers, which all happen to be of the so-called Conflict-Directed Clause Learning variety. We show the potential for speedups of up to 382× across a variety of problem instances. We hope that these results will stimulate future research, particularly with respect to a computer architecture open problem we present.


2021 ◽  
Vol 8 (3) ◽  
pp. 1-20
Author(s):  
Michael A. Bender ◽  
Alex Conway ◽  
Martín Farach-Colton ◽  
William Jannen ◽  
Yizheng Jiao ◽  
...  

Storage devices have complex performance profiles, including costs to initiate IOs (e.g., seek times in hard drives), parallelism and bank conflicts (in SSDs), costs to transfer data, and firmware-internal operations. The Disk-access Machine (DAM) model simplifies reality by assuming that storage devices transfer data in blocks of size B and that all transfers have unit cost. Despite its simplifications, the DAM model is reasonably accurate. In fact, if B is set to the half-bandwidth point, where the latency and bandwidth of the hardware are equal, then the DAM approximates the IO cost on any hardware to within a factor of 2. Furthermore, the DAM model explains the popularity of B-trees in the 1970s and the current popularity of B ɛ -trees and log-structured merge trees. But it fails to explain why some B-trees use small nodes, whereas all B ɛ -trees use large nodes. In a DAM, all IOs, and hence all nodes, are the same size. In this article, we show that the affine and PDAM models, which are small refinements of the DAM model, yield a surprisingly large improvement in predictability without sacrificing ease of use. We present benchmarks on a large collection of storage devices showing that the affine and PDAM models give good approximations of the performance characteristics of hard drives and SSDs, respectively. We show that the affine model explains node-size choices in B-trees and B ɛ -trees. Furthermore, the models predict that B-trees are highly sensitive to variations in the node size, whereas B ɛ -trees are much less sensitive. These predictions are born out empirically. Finally, we show that in both the affine and PDAM models, it pays to organize data structures to exploit varying IO size. In the affine model, B ɛ -trees can be optimized so that all operations are simultaneously optimal, even up to lower-order terms. In the PDAM model, B ɛ -trees (or B-trees) can be organized so that both sequential and concurrent workloads are handled efficiently. We conclude that the DAM model is useful as a first cut when designing or analyzing an algorithm or data structure but the affine and PDAM models enable the algorithm designer to optimize parameter choices and fill in design details.


2021 ◽  
Vol 8 (3) ◽  
pp. 1-25
Author(s):  
Soheil Behnezhad ◽  
Laxman Dhulipala ◽  
Hossein Esfandiari ◽  
Jakub Łącki ◽  
Vahab Mirrokni ◽  
...  

We introduce the Adaptive Massively Parallel Computation (AMPC) model, which is an extension of the Massively Parallel Computation (MPC) model. At a high level, the AMPC model strengthens the MPC model by storing all messages sent within a round in a distributed data store. In the following round, all machines are provided with random read access to the data store, subject to the same constraints on the total amount of communication as in the MPC model. Our model is inspired by the previous empirical studies of distributed graph algorithms [8, 30] using MapReduce and a distributed hash table service [17]. This extension allows us to give new graph algorithms with much lower round complexities compared to the best-known solutions in the MPC model. In particular, in the AMPC model we show how to solve maximal independent set in O (1) rounds and connectivity/minimum spanning tree in O (log log m / n n rounds both using O ( n δ ) space per machine for constant δ < 1. In the same memory regime for MPC, the best-known algorithms for these problems require poly log n rounds. Our results imply that the 2-C YCLE conjecture, which is widely believed to hold in the MPC model, does not hold in the AMPC model.


Sign in / Sign up

Export Citation Format

Share Document