Specifying and testing GPU workgroup progress models

Tyler Sorensen; Lucas F. Salvador; Harmit Raval; Hugues Evrard; John Wickerson; Margaret Martonosi; Alastair F. Donaldson

doi:10.1145/3485508

Specifying and testing GPU workgroup progress models

Proceedings of the ACM on Programming Languages ◽

10.1145/3485508 ◽

2021 ◽

Vol 5 (OOPSLA) ◽

pp. 1-30

Author(s):

Tyler Sorensen ◽

Lucas F. Salvador ◽

Harmit Raval ◽

Hugues Evrard ◽

John Wickerson ◽

...

Keyword(s):

Parallel Applications ◽

Gpu Programming ◽

Concurrent Programs ◽

Large Set ◽

Prior Work ◽

Fine Grained ◽

Experimental Campaign ◽

Correct Execution ◽

Programming Support ◽

Litmus Test

As GPU availability has increased and programming support has matured, a wider variety of applications are being ported to these platforms. Many parallel applications contain fine-grained synchronization idioms; as such, their correct execution depends on a degree of relative forward progress between threads (or thread groups). Unfortunately, many GPU programming specifications (e.g. Vulkan and Metal) say almost nothing about relative forward progress guarantees between workgroups. Although prior work has proposed a spectrum of plausible progress models for GPUs, cross-vendor specifications have yet to commit to any model. This work is a collection of tools and experimental data to aid specification designers when considering forward progress guarantees in programming frameworks. As a foundation, we formalize a small parallel programming language that captures the essence of fine-grained synchronization. We then provide a means of formally specifying a progress model, and develop a termination oracle that decides whether a given program is guaranteed to eventually terminate with respect to a given progress model. Next, we formalize a set of constraints that describe concurrent programs that require forward progress to terminate. This allows us to synthesize a large set of 483 progress litmus tests. Combined with the termination oracle, we can determine the expected status of each litmus test -- i.e. whether it is guaranteed to eventually terminate -- under various progress models. We present a large experimental campaign running the litmus tests across 8 GPUs from 5 different vendors. Our results highlight that GPUs have significantly different termination behaviors under our test suite. Most notably, we find that Apple and ARM GPUs do not support the linear occupancy-bound model, as was hypothesized by prior work.

Download Full-text

Mechanized verification of fine-grained concurrent programs

ACM SIGPLAN Notices ◽

10.1145/2813885.2737964 ◽

2015 ◽

Vol 50 (6) ◽

pp. 77-87 ◽

Cited By ~ 11

Author(s):

Ilya Sergey ◽

Aleksandar Nanevski ◽

Anindya Banerjee

Keyword(s):

Concurrent Programs ◽

Fine Grained ◽

Mechanized Verification

Download Full-text

The impact of checkpointing interval selection on the scheduling performance of real-time fine-grained parallel applications in SaaS clouds under various failure probabilities

Concurrency and Computation Practice and Experience ◽

10.1002/cpe.4288 ◽

2017 ◽

Vol 30 (12) ◽

pp. e4288 ◽

Cited By ~ 9

Author(s):

Georgios L. Stavrinides ◽

Helen D. Karatza

Keyword(s):

Real Time ◽

Parallel Applications ◽

Fine Grained ◽

Interval Selection ◽

The Impact ◽

Failure Probabilities

Download Full-text

Reconciling optimization with secure compilation

Proceedings of the ACM on Programming Languages ◽

10.1145/3485519 ◽

2021 ◽

Vol 5 (OOPSLA) ◽

pp. 1-30

Author(s):

Son Tuan Vu ◽

Albert Cohen ◽

Arnaud De Grandmaison ◽

Christophe Guillon ◽

Karine Heydemann

Keyword(s):

Side Effects ◽

Control Flow ◽

Partial Ordering ◽

Input Output ◽

Machine Code ◽

Fine Grained ◽

Control Flow Integrity ◽

Correct Execution ◽

Source Level ◽

And Performance

Software protections against side-channel and physical attacks are essential to the development of secure applications. Such protections are meaningful at machine code or micro-architectural level, but they typically do not carry observable semantics at source level. This renders them susceptible to miscompilation, and security engineers embed input/output side-effects to prevent optimizing compilers from altering them. Yet these side-effects are error-prone and compiler-dependent. The current practice involves analyzing the generated machine code to make sure security or privacy properties are still enforced. These side-effects may also be too expensive in fine-grained protections such as control-flow integrity. We introduce observations of the program state that are intrinsic to the correct execution of security protections, along with means to specify and preserve observations across the compilation flow. Such observations complement the input/output semantics-preservation contract of compilers. We introduce an opacification mechanism to preserve and enforce a partial ordering of observations. This approach is compatible with a production compiler and does not incur any modification to its optimization passes. We validate the effectiveness and performance of our approach on a range of benchmarks, expressing the secure compilation of these applications in terms of observations to be made at specific program points.

Download Full-text

TaDA Live: Compositional Reasoning for Termination of Fine-grained Concurrent Programs

ACM Transactions on Programming Languages and Systems ◽

10.1145/3477082 ◽

2021 ◽

Vol 43 (4) ◽

pp. 1-134

Author(s):

Emanuele D’Osualdo ◽

Julian Sutherland ◽

Azadeh Farzan ◽

Philippa Gardner

Keyword(s):

Case Studies ◽

Separation Logic ◽

Proof System ◽

Concurrent Programs ◽

Semantic Model ◽

Compositional Reasoning ◽

Fine Grained ◽

Fundamental Innovation ◽

Concurrent Separation Logic

We present TaDA Live, a concurrent separation logic for reasoning compositionally about the termination of blocking fine-grained concurrent programs. The crucial challenge is how to deal with abstract atomic blocking : that is, abstract atomic operations that have blocking behaviour arising from busy-waiting patterns as found in, for example, fine-grained spin locks. Our fundamental innovation is with the design of abstract specifications that capture this blocking behaviour as liveness assumptions on the environment. We design a logic that can reason about the termination of clients that use such operations without breaking their abstraction boundaries, and the correctness of the implementations of the operations with respect to their abstract specifications. We introduce a novel semantic model using layered subjective obligations to express liveness invariants and a proof system that is sound with respect to the model. The subtlety of our specifications and reasoning is illustrated using several case studies.

Download Full-text

Synthetic Simulation of Mesh-Based Parallel Applications Driven by Fine-Grained Profiling

19th IEEE International Parallel and Distributed Processing Symposium ◽

10.1109/ipdps.2005.417 ◽

2005 ◽

Author(s):

Qingyuan Liu ◽

A.S. Deshmukh ◽

K.A. Tomko

Keyword(s):

Parallel Applications ◽

Fine Grained ◽

Synthetic Simulation

Download Full-text

Symmetry-adapted digital modeling III. Coarse-grained icosahedral viruses

Acta Crystallographica Section A Foundations and Advances ◽

10.1107/s205327331600276x ◽

2016 ◽

Vol 72 (3) ◽

pp. 324-337 ◽

Cited By ~ 2

Author(s):

A. Janner

Keyword(s):

Lattice Point ◽

Three Dimensional ◽

Structural Data ◽

Data Bank ◽

Coarse Grained ◽

Large Set ◽

Fine Grained ◽

Coarse Grained Model ◽

Digital Modeling ◽

Icosahedral Viruses

Considered is the coarse-grained modeling of icosahedral viruses in terms of a three-dimensional lattice (the digital modeling lattice) selected among the projected points in space of a six-dimensional icosahedral lattice. Backbone atomic positions (Cα's for the residues of the capsid and phosphorus atoms P for the genome nucleotides) are then indexed by their nearest lattice point. This leads to a fine-grained lattice point characterization of the full viral chains in the backbone approximation (denoted as digital modeling). Coarse-grained models then follow by a proper selection of the indexed backbone positions, where for each chain one can choose the desired coarseness. This approach is applied to three viruses, the Satellite tobacco mosaic virus, the bacteriophage MS2 and the Pariacoto virus, on the basis of structural data from the Brookhaven Protein Data Bank. In each case the various stages of the procedure are illustrated for a given coarse-grained model and the corresponding indexed positions are listed. Alternative coarse-grained models have been derived and compared. Comments on related results and approaches, found among the very large set of publications in this field, conclude this article.

Download Full-text

SOME METACOMPUTING EXPERIENCES FOR SCIENTIFIC APPLICATIONS

Parallel Processing Letters ◽

10.1142/s0129626499000232 ◽

1999 ◽

Vol 09 (02) ◽

pp. 243-252 ◽

Cited By ~ 2

Author(s):

O. LARSSON ◽

M. FEIG ◽

L. JOHNSSON

Keyword(s):

Molecular Dynamics ◽

Thin Layer ◽

High Performance ◽

Large Scale ◽

Distributed Memory ◽

Parallel Applications ◽

Globus Toolkit ◽

Performance Measurements ◽

Correct Execution ◽

Application Codes

We demonstrate good metacomputing efficiency and portability for three typical large-scale parallel applications; one molecular dynamics code and two electromagnetics codes. The codes were developed for distributed memory parallel platforms using Fortran77 or Fortran90 with MPI. The performance measurements were made for a testbed of two IBM SPs connected through the vBNS. No change of the application codes were required for correct execution of the codes on the testbed using the Globus Toolkit for the required metacomputing services. However, we observe that for good performance, it may be necessary for MPI codes to make use of overlapped computation and communication. For such MPI codes, a communications library designed for hierarchical or clustered communication can yield very good metacomputing efficiencies when high-performance networks, such as the vBNS or the Abilene networks, such as the vBNS or the Abilene networks, are used for platform connectivity. We demonstrate this by inserting a thin layer between the MPI application and the MPI libraries, providing some clustering of communications between platforms.

Download Full-text

Object and handling handshapes in 11 sign languages: towards a typology of the iconic use of the hands

Linguistic Typology ◽

10.1515/lingty-2021-0026 ◽

2021 ◽

Vol 0 (0) ◽

Author(s):

Victoria Nyst ◽

Marta Morgado ◽

Timothy Mac Hadjah ◽

Marco Nyarko ◽

Mariana Martins ◽

...

Keyword(s):

Significant Variation ◽

Linguistic Variation ◽

Large Set ◽

Human Language ◽

Sign Languages ◽

Data Set ◽

European Origin ◽

Use Of Space ◽

Fine Grained ◽

Iconic Signs

Abstract This article looks at cross-linguistic variation in lexical iconicity, addressing the question of to what extent and how this variation is patterned. More than in spoken languages, iconicity is highly frequent in the lexicons of sign languages. It is also highly complex, in that often multiple motivated components jointly shape an iconic lexeme. Recent typological research on spoken languages finds tentative iconic patterning in a large number of basic lexical items, underlining once again the significance of iconicity for human language. The uncontested and widespread use of iconicity found in the lexicons of sign languages enables us to take typological research into lexical iconicity to the next level. Indeed, previous studies have shown cross-linguistic variation in: a) the use of embodying and handling handshapes in sign languages (mostly of European origin) and b) the frequency of space-based size depiction in African and European sign languages. The two types of variation may be interrelated, as handling handshapes may use space-based size depiction. In this study, we first replicate earlier studies on the distribution of embodying and handling handshapes, this time in a data set consisting of a relatively large set of sign languages (n = 11), most of which are used in Africa. The results confirm significant variation across these sign languages. These findings are then compared to the use of space-based size depiction, revealing that these patterns independently from the distribution of embodying/handling handshapes. We argue that the results call for expanding typological studies on representational strategies in iconic signs beyond the now relatively well studied instrument/manipulation alternation. Fine-grained analyses on a multitude of iconic features in signs are likely to reveal cross-linguistic variation in iconic tendencies in SL lexicons.

Download Full-text

Using Heterogeneous High Performance Computing Cluster for Supporting Fine-Grained Parallel Applications

10.21236/ada459900 ◽

2006 ◽

Author(s):

Nael Abu-Gazaleh

Keyword(s):

High Performance Computing ◽

High Performance ◽

Parallel Applications ◽

Fine Grained ◽

High Performance Computing Cluster ◽

Performance Computing ◽

Computing Cluster

Download Full-text

Fine-Grained Complexity of Safety Verification

Journal of Automated Reasoning ◽

10.1007/s10817-020-09572-x ◽

2020 ◽

Vol 64 (7) ◽

pp. 1419-1444

Author(s):

Peter Chini ◽

Roland Meyer ◽

Prakash Saivasan

Keyword(s):

Lower Bounds ◽

Maximum Size ◽

Set Cover ◽

Polynomial Kernel ◽

Concurrent Programs ◽

Safety Verification ◽

Fixed Parameter Tractable ◽

Fine Grained ◽

Fixed Parameter ◽

Key Techniques

Abstract We study the fine-grained complexity of Leader Contributor Reachability ($${\textsf {LCR}} $$ LCR ) and Bounded-Stage Reachability ($${\textsf {BSR}} $$ BSR ), two variants of the safety verification problem for shared memory concurrent programs. For both problems, the memory is a single variable over a finite data domain. Our contributions are new verification algorithms and lower bounds. The latter are based on the Exponential Time Hypothesis ($${\textsf {ETH}} $$ ETH ), the problem $${\textsf {Set~Cover}} $$ Set Cover , and cross-compositions. $${\textsf {LCR}} $$ LCR is the question whether a designated leader thread can reach an unsafe state when interacting with a certain number of equal contributor threads. We suggest two parameterizations: (1) By the size of the data domain $${\texttt {D}}$$ D and the size of the leader $${\texttt {L}}$$ L , and (2) by the size of the contributors $${\texttt {C}}$$ C . We present algorithms for both cases. The key techniques are compact witnesses and dynamic programming. The algorithms run in $${\mathcal {O}}^*(({\texttt {L}}\cdot ({\texttt {D}}+1))^{{\texttt {L}}\cdot {\texttt {D}}} \cdot {\texttt {D}}^{{\texttt {D}}})$$ O ∗ ( ( L · ( D + 1 ) ) L · D · D D ) and $${\mathcal {O}}^*(2^{{\texttt {C}}})$$ O ∗ ( 2 C ) time, showing that both parameterizations are fixed-parameter tractable. We complement the upper bounds by (matching) lower bounds based on $${\textsf {ETH}} $$ ETH and $${\textsf {Set~Cover}} $$ Set Cover . Moreover, we prove the absence of polynomial kernels. For $${\textsf {BSR}} $$ BSR , we consider programs involving $${\texttt {t}}$$ t different threads. We restrict the analysis to computations where the write permission changes $${\texttt {s}}$$ s times between the threads. $${\textsf {BSR}} $$ BSR asks whether a given configuration is reachable via such an $${\texttt {s}}$$ s -stage computation. When parameterized by $${\texttt {P}}$$ P , the maximum size of a thread, and $${\texttt {t}}$$ t , the interesting observation is that the problem has a large number of difficult instances. Formally, we show that there is no polynomial kernel, no compression algorithm that reduces the size of the data domain $${\texttt {D}}$$ D or the number of stages $${\texttt {s}}$$ s to a polynomial dependence on $${\texttt {P}}$$ P and $${\texttt {t}}$$ t . This indicates that symbolic methods may be harder to find for this problem.

Download Full-text