Evaluating optimizations that reduce global memory accesses of stencil computations in GPGPUs

Author(s):  
Thiago Carrijo Nasciutti ◽  
Jairo Panetta ◽  
Pedro Pais Lopes
Author(s):  
Parosh Aziz Abdulla ◽  
Mohamed Faouzi Atig ◽  
Adwait Godbole ◽  
S. Krishna ◽  
Viktor Vafeiadis

AbstractWe consider the reachability problem for finite-state multi-threaded programs under thepromising semantics() of Lee et al., which captures most common program transformations. Since reachability is already known to be undecidable in the fragment of with only release-acquire accesses (-), we consider the fragment with only relaxed accesses and promises (). We show that reachability under is undecidable in general and that it becomes decidable, albeit non-primitive recursive, if we bound the number of promises.Given these results, we consider a bounded version of the reachability problem. To this end, we bound both the number of promises and of “view-switches”, i.e., the number of times the processes may switch their local views of the global memory. We provide a code-to-code translation from an input program under (with relaxed and release-acquire memory accesses along with promises) to a program under SC, thereby reducing the bounded reachability problem under to the bounded context-switching problem under SC. We have implemented a tool and tested it on a set of benchmarks, demonstrating that typical bugs in programs can be found with a small bound.


2018 ◽  
Vol 3 (7) ◽  
pp. 12
Author(s):  
DaeHwan Kim

Nowadays, GPU processors are widely used for general-purpose parallel computation applications. In the GPU programming, thread and block configuration is one of the most important decisions to be made, which increases parallelism and hides instruction latency. However, in many cases, it is often difficult to have sufficient parallelism to hide all the latencies, where the high latencies are often caused by the global memory accesses. In order to reduce the number of  those accesses, the shared memory is instead used which is  much faster than the global memory being located on a chip. The performance of the proposed thread configuration is evaluated on the GPU 960 processor. The experimental result shows that the best configuration improves the performance by 7.3 times compared to the worst configuration in the experiment. The experiences are also discussed for the shared memory performance when compared to that of the global memory.


Author(s):  
Antonio Fuentes-Alventosa ◽  
Juan Gómez-Luna ◽  
José Maria González-Linares ◽  
Nicolás Guil ◽  
R. Medina-Carnicer

AbstractCAVLC (Context-Adaptive Variable Length Coding) is a high-performance entropy method for video and image compression. It is the most commonly used entropy method in the video standard H.264. In recent years, several hardware accelerators for CAVLC have been designed. In contrast, high-performance software implementations of CAVLC (e.g., GPU-based) are scarce. A high-performance GPU-based implementation of CAVLC is desirable in several scenarios. On the one hand, it can be exploited as the entropy component in GPU-based H.264 encoders, which are a very suitable solution when GPU built-in H.264 hardware encoders lack certain necessary functionality, such as data encryption and information hiding. On the other hand, a GPU-based implementation of CAVLC can be reused in a wide variety of GPU-based compression systems for encoding images and videos in formats other than H.264, such as medical images. This is not possible with hardware implementations of CAVLC, as they are non-separable components of hardware H.264 encoders. In this paper, we present CAVLCU, an efficient implementation of CAVLC on GPU, which is based on four key ideas. First, we use only one kernel to avoid the long latency global memory accesses required to transmit intermediate results among different kernels, and the costly launches and terminations of additional kernels. Second, we apply an efficient synchronization mechanism for thread-blocks (In this paper, to prevent confusion, a block of pixels of a frame will be referred to as simply block and a GPU thread block as thread-block.) that process adjacent frame regions (in horizontal and vertical dimensions) to share results in global memory space. Third, we exploit fully the available global memory bandwidth by using vectorized loads to move directly the quantized transform coefficients to registers. Fourth, we use register tiling to implement the zigzag sorting, thus obtaining high instruction-level parallelism. An exhaustive experimental evaluation showed that our approach is between 2.5$$\times$$ × and 5.4$$\times$$ × faster than the only state-of-the-art GPU-based implementation of CAVLC.


2021 ◽  
Vol 47 (2) ◽  
pp. 1-28
Author(s):  
Goran Flegar ◽  
Hartwig Anzt ◽  
Terry Cojean ◽  
Enrique S. Quintana-Ortí

The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator–like a preconditioner–in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.


2021 ◽  
Vol 31 ◽  
Author(s):  
THOMAS VAN STRYDONCK ◽  
FRANK PIESSENS ◽  
DOMINIQUE DEVRIESE

Abstract Separation logic is a powerful program logic for the static modular verification of imperative programs. However, dynamic checking of separation logic contracts on the boundaries between verified and untrusted modules is hard because it requires one to enforce (among other things) that outcalls from a verified to an untrusted module do not access memory resources currently owned by the verified module. This paper proposes an approach to dynamic contract checking by relying on support for capabilities, a well-studied form of unforgeable memory pointers that enables fine-grained, efficient memory access control. More specifically, we rely on a form of capabilities called linear capabilities for which the hardware enforces that they cannot be copied. We formalize our approach as a fully abstract compiler from a statically verified source language to an unverified target language with support for linear capabilities. The key insight behind our compiler is that memory resources described by spatial separation logic predicates can be represented at run time by linear capabilities. The compiler is separation-logic-proof-directed: it uses the separation logic proof of the source program to determine how memory accesses in the source program should be compiled to linear capability accesses in the target program. The full abstraction property of the compiler essentially guarantees that compiled verified modules can interact with untrusted target language modules as if they were compiled from verified code as well. This article is an extended version of one that was presented at ICFP 2019 (Van Strydonck et al., 2019).


2021 ◽  
Vol 25 (4) ◽  
pp. 1031-1045
Author(s):  
Helang Lai ◽  
Keke Wu ◽  
Lingli Li

Emotion recognition in conversations is crucial as there is an urgent need to improve the overall experience of human-computer interactions. A promising improvement in this field is to develop a model that can effectively extract adequate contexts of a test utterance. We introduce a novel model, termed hierarchical memory networks (HMN), to address the issues of recognizing utterance level emotions. HMN divides the contexts into different aspects and employs different step lengths to represent the weights of these aspects. To model the self dependencies, HMN takes independent local memory networks to model these aspects. Further, to capture the interpersonal dependencies, HMN employs global memory networks to integrate the local outputs into global storages. Such storages can generate contextual summaries and help to find the emotional dependent utterance that is most relevant to the test utterance. With an attention-based multi-hops scheme, these storages are then merged with the test utterance using an addition operation in the iterations. Experiments on the IEMOCAP dataset show our model outperforms the compared methods with accuracy improvement.


2021 ◽  
Vol 0 (0) ◽  
Author(s):  
Hamish Patel ◽  
Reza Zamani

Abstract Long-term memories are thought to be stored in neurones and synapses that undergo physical changes, such as long-term potentiation (LTP), and these changes can be maintained for long periods of time. A candidate enzyme for the maintenance of LTP is protein kinase M zeta (PKMζ), a constitutively active protein kinase C isoform that is elevated during LTP and long-term memory maintenance. This paper reviews the evidence and controversies surrounding the role of PKMζ in the maintenance of long-term memory. PKMζ maintains synaptic potentiation by preventing AMPA receptor endocytosis and promoting stabilisation of dendritic spine growth. Inhibition of PKMζ, with zeta-inhibitory peptide (ZIP), can reverse LTP and impair established long-term memories. However, a deficit of memory retrieval cannot be ruled out. Furthermore, ZIP, and in high enough doses the control peptide scrambled ZIP, was recently shown to be neurotoxic, which may explain some of the effects of ZIP on memory impairment. PKMζ knockout mice show normal learning and memory. However, this is likely due to compensation by protein-kinase C iota/lambda (PKCι/λ), which is normally responsible for induction of LTP. It is not clear how, or if, this compensatory mechanism is activated under normal conditions. Future research should utilise inducible PKMζ knockdown in adult rodents to investigate whether PKMζ maintains memory in specific parts of the brain, or if it represents a global memory maintenance molecule. These insights may inform future therapeutic targets for disorders of memory loss.


2013 ◽  
Vol 48 (8) ◽  
pp. 57-68 ◽  
Author(s):  
Bo Wu ◽  
Zhijia Zhao ◽  
Eddy Zheng Zhang ◽  
Yunlian Jiang ◽  
Xipeng Shen

Sign in / Sign up

Export Citation Format

Share Document