CART: Cache Access Reordering Tree for Efficient Cache and Memory Accesses in GPUs

The use of mixed precision in numerical algorithms is a promising strategy for accelerating scientific applications. In particular, the adoption of specialized hardware and data formats for low-precision arithmetic in high-end GPUs (graphics processing units) has motivated numerous efforts aiming at carefully reducing the working precision in order to speed up the computations. For algorithms whose performance is bound by the memory bandwidth, the idea of compressing its data before (and after) memory accesses has received considerable attention. One idea is to store an approximate operator–like a preconditioner–in lower than working precision hopefully without impacting the algorithm output. We realize the first high-performance implementation of an adaptive precision block-Jacobi preconditioner which selects the precision format used to store the preconditioner data on-the-fly, taking into account the numerical properties of the individual preconditioner blocks. We implement the adaptive block-Jacobi preconditioner as production-ready functionality in the Ginkgo linear algebra library, considering not only the precision formats that are part of the IEEE standard, but also customized formats which optimize the length of the exponent and significand to the characteristics of the preconditioner blocks. Experiments run on a state-of-the-art GPU accelerator show that our implementation offers attractive runtime savings.

Download Full-text

Linear capabilities for fully abstract compilation of separation-logic-verified code

Journal of Functional Programming ◽

10.1017/s0956796821000022 ◽

2021 ◽

Vol 31 ◽

Author(s):

THOMAS VAN STRYDONCK ◽

FRANK PIESSENS ◽

DOMINIQUE DEVRIESE

Keyword(s):

Spatial Separation ◽

Separation Logic ◽

Target Language ◽

Fine Grained ◽

Dynamic Contract ◽

Modular Verification ◽

Source Program ◽

Memory Accesses ◽

Memory Resources ◽

Efficient Memory

Abstract Separation logic is a powerful program logic for the static modular verification of imperative programs. However, dynamic checking of separation logic contracts on the boundaries between verified and untrusted modules is hard because it requires one to enforce (among other things) that outcalls from a verified to an untrusted module do not access memory resources currently owned by the verified module. This paper proposes an approach to dynamic contract checking by relying on support for capabilities, a well-studied form of unforgeable memory pointers that enables fine-grained, efficient memory access control. More specifically, we rely on a form of capabilities called linear capabilities for which the hardware enforces that they cannot be copied. We formalize our approach as a fully abstract compiler from a statically verified source language to an unverified target language with support for linear capabilities. The key insight behind our compiler is that memory resources described by spatial separation logic predicates can be represented at run time by linear capabilities. The compiler is separation-logic-proof-directed: it uses the separation logic proof of the source program to determine how memory accesses in the source program should be compiled to linear capability accesses in the target program. The full abstraction property of the compiler essentially guarantees that compiled verified modules can interact with untrusted target language modules as if they were compiled from verified code as well. This article is an extended version of one that was presented at ICFP 2019 (Van Strydonck et al., 2019).

Download Full-text

Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU

ACM SIGPLAN Notices ◽

10.1145/2517327.2442523 ◽

2013 ◽

Vol 48 (8) ◽

pp. 57-68 ◽

Cited By ~ 11

Author(s):

Bo Wu ◽

Zhijia Zhao ◽

Eddy Zheng Zhang ◽

Yunlian Jiang ◽

Xipeng Shen

Keyword(s):

Complexity Analysis ◽

Algorithm Design ◽

Memory Accesses ◽

Coalesced Memory

Download Full-text

On the Impact of Partial Sums on Interconnect Bandwidth and Memory Accesses in a DNN Accelerator

2020 IEEE 15th International Conference on Industrial and Information Systems (ICIIS) ◽

10.1109/iciis51140.2020.9342717 ◽

2020 ◽

Author(s):

Mahesh Chandra

Keyword(s):

Partial Sums ◽

Memory Accesses ◽

The Impact

Download Full-text

Automated bug detection for pointers and memory accesses in High-Level Synthesis compilers

2016 26th International Conference on Field Programmable Logic and Applications (FPL) ◽

10.1109/fpl.2016.7577369 ◽

2016 ◽

Cited By ~ 1

Author(s):

Pietro Fezzardi ◽

Fabrizio Ferrandi

Keyword(s):

High Level Synthesis ◽

Bug Detection ◽

Memory Accesses ◽

High Level

Download Full-text

Compiler analysis of irregular memory accesses

ACM SIGPLAN Notices ◽

10.1145/358438.349322 ◽

2000 ◽

Vol 35 (5) ◽

pp. 157-168 ◽

Cited By ~ 5

Author(s):

Yuan Lin ◽

David Padua

Keyword(s):

Compiler Analysis ◽

Memory Accesses

Download Full-text

Online Thread and Data Mapping Using a Sharing-Aware Memory Management Unit

ACM Transactions on Modeling and Performance Evaluation of Computing Systems ◽

10.1145/3433687 ◽

2021 ◽

Vol 5 (4) ◽

pp. 1-28

Author(s):

Eduardo H. M. Cruz ◽

Matthias Diener ◽

Laércio L. Pilla ◽

Philippe O. A. Navaux

Keyword(s):

Energy Efficiency ◽

Memory Management ◽

Substantial Reduction ◽

Management Unit ◽

Memory Access ◽

Parallel Applications ◽

Data Mapping ◽

Wide Range ◽

Memory Accesses ◽

Level Parallelism

Current and future architectures rely on thread-level parallelism to sustain performance growth. These architectures have introduced a complex memory hierarchy, consisting of several cores organized hierarchically with multiple cache levels and NUMA nodes. These memory hierarchies can have an impact on the performance and energy efficiency of parallel applications as the importance of memory access locality is increased. In order to improve locality, the analysis of the memory access behavior of parallel applications is critical for mapping threads and data. Nevertheless, most previous work relies on indirect information about the memory accesses, or does not combine thread and data mapping, resulting in less accurate mappings. In this paper, we propose the Sharing-Aware Memory Management Unit (SAMMU), an extension to the memory management unit that allows it to detect the memory access behavior in hardware. With this information, the operating system can perform online mapping without any previous knowledge about the behavior of the application. In the evaluation with a wide range of parallel applications (NAS Parallel Benchmarks and PARSEC Benchmark Suite), performance was improved by up to 35.7% (10.0% on average) and energy efficiency was improved by up to 11.9% (4.1% on average). These improvements happened due to a substantial reduction of cache misses and interconnection traffic.

Download Full-text

Mechanical Verification of Transactional Memories with Non-transactional Memory Accesses

Computer Aided Verification - Lecture Notes in Computer Science ◽

10.1007/978-3-540-70545-1_13 ◽

2008 ◽

pp. 121-134 ◽

Cited By ~ 15

Author(s):

Ariel Cohen ◽

Amir Pnueli ◽

Lenore D. Zuck

Keyword(s):

Transactional Memory ◽

Mechanical Verification ◽

Memory Accesses ◽

Transactional Memories

Download Full-text

Improving Memory Accesses for Heterogeneous Parallel Multi-objective Feature Selection on EEG Classification

Euro-Par 2016: Parallel Processing Workshops - Lecture Notes in Computer Science ◽

10.1007/978-3-319-58943-5_30 ◽

2017 ◽

pp. 372-383 ◽

Cited By ~ 4

Author(s):

Juan José Escobar ◽

Julio Ortega ◽

Jesús González ◽

Miguel Damas

Keyword(s):

Feature Selection ◽

Eeg Classification ◽

Multi Objective ◽

Memory Accesses

Download Full-text

CART: Cache Access Reordering Tree for Efficient Cache and Memory Accesses in GPUs

Pointing in the Right Direction - Securing Memory Accesses in a Faulty World

Adaptive Precision Block-Jacobi for High Performance Preconditioning in the Ginkgo Linear Algebra Software

Linear capabilities for fully abstract compilation of separation-logic-verified code

Complexity analysis and algorithm design for reorganizing data to minimize non-coalesced memory accesses on GPU

On the Impact of Partial Sums on Interconnect Bandwidth and Memory Accesses in a DNN Accelerator

Automated bug detection for pointers and memory accesses in High-Level Synthesis compilers

Compiler analysis of irregular memory accesses

Online Thread and Data Mapping Using a Sharing-Aware Memory Management Unit

Mechanical Verification of Transactional Memories with Non-transactional Memory Accesses

Improving Memory Accesses for Heterogeneous Parallel Multi-objective Feature Selection on EEG Classification

Export Citation Format