ACM Transactions on Architecture and Code Optimization

Joint Program and Layout Transformations to Enable Convolutional Operators on Specialized Hardware Based on Constraint Programming

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3487922 ◽

2022 ◽

Vol 19 (1) ◽

pp. 1-26

Author(s):

Dennis Rieber ◽

Axel Acosta ◽

Holger Fröning

Keyword(s):

Constraint Satisfaction Problem ◽

Search Space ◽

Program Transformations ◽

Data Layout ◽

New Approach ◽

Deployment Strategy ◽

Joint Program ◽

Reference Implementation ◽

And Performance ◽

Specialized Hardware

The success of Deep Artificial Neural Networks (DNNs) in many domains created a rich body of research concerned with hardware accelerators for compute-intensive DNN operators. However, implementing such operators efficiently with complex hardware intrinsics such as matrix multiply is a task not yet automated gracefully. Solving this task often requires joint program and data layout transformations. First solutions to this problem have been proposed, such as TVM, UNIT, or ISAMIR, which work on a loop-level representation of operators and specify data layout and possible program transformations before the embedding into the operator is performed. This top-down approach creates a tension between exploration range and search space complexity, especially when also exploring data layout transformations such as im2col, channel packing, or padding. In this work, we propose a new approach to this problem. We created a bottom-up method that allows the joint transformation of both computation and data layout based on the found embedding. By formulating the embedding as a constraint satisfaction problem over the scalar dataflow, every possible embedding solution is contained in the search space. Adding additional constraints and optimization targets to the solver generates the subset of preferable solutions. An evaluation using the VTA hardware accelerator with the Baidu DeepBench inference benchmark shows that our approach can automatically generate code competitive to reference implementations. Further, we show that dynamically determining the data layout based on intrinsic and workload is beneficial for hardware utilization and performance. In cases where the reference implementation has low hardware utilization due to its fixed deployment strategy, we achieve a geomean speedup of up to × 2.813, while individual operators can improve as much as × 170.

Download Full-text

Iterative Compilation Optimization Based on Metric Learning and Collaborative Filtering

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3480250 ◽

2022 ◽

Vol 19 (1) ◽

pp. 1-25

Author(s):

Hongzhi Liu ◽

Jie Luo ◽

Ying Li ◽

Zhonghai Wu

Keyword(s):

Collaborative Filtering ◽

Training Program ◽

Expert Knowledge ◽

Metric Learning ◽

Principal Component ◽

Search Space ◽

Model Learning ◽

Iterative Compilation ◽

Standard Level ◽

Phase Ordering

Pass selection and phase ordering are two critical compiler auto-tuning problems. Traditional heuristic methods cannot effectively address these NP-hard problems especially given the increasing number of compiler passes and diverse hardware architectures. Recent research efforts have attempted to address these problems through machine learning. However, the large search space of candidate pass sequences, the large numbers of redundant and irrelevant features, and the lack of training program instances make it difficult to learn models well. Several methods have tried to use expert knowledge to simplify the problems, such as using only the compiler passes or subsequences in the standard levels (e.g., -O1, -O2, and -O3) provided by compiler designers. However, these methods ignore other useful compiler passes that are not contained in the standard levels. Principal component analysis (PCA) and exploratory factor analysis (EFA) have been utilized to reduce the redundancy of feature data. However, these unsupervised methods retain all the information irrelevant to the performance of compilation optimization, which may mislead the subsequent model learning. To solve these problems, we propose a compiler pass selection and phase ordering approach, called Iterative Compilation based on Metric learning and Collaborative filtering (ICMC) . First, we propose a data-driven method to construct pass subsequences according to the observed collaborative interactions and dependency among passes on a given program set. Therefore, we can make use of all available compiler passes and prune the search space. Then, a supervised metric learning method is utilized to retain useful feature information for compilation optimization while removing both the irrelevant and the redundant information. Based on the learned similarity metric, a neighborhood-based collaborative filtering method is employed to iteratively recommend a few superior compiler passes for each target program. Last, an iterative data enhancement method is designed to alleviate the problem of lacking training program instances and to enhance the performance of iterative pass recommendations. The experimental results using the LLVM compiler on all 32 cBench programs show the following: (1) ICMC significantly outperforms several state-of-the-art compiler phase ordering methods, (2) it performs the same or better than the standard level -O3 on all the test programs, and (3) it can reach an average performance speedup of 1.20 (up to 1.46) compared with the standard level -O3.

Download Full-text

ReuseTracker : Fast Yet Accurate Multicore Reuse Distance Analyzer

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3484199 ◽

2022 ◽

Vol 19 (1) ◽

pp. 1-25

Author(s):

Muhammad Aditya Sasongko ◽

Milind Chabbi ◽

Mandana Bagheri Marzijarani ◽

Didem Unat

Keyword(s):

Performance Monitoring ◽

State Of The Art ◽

Data Locality ◽

Parallel Applications ◽

Use Case ◽

Memory Location ◽

Reuse Distance ◽

Shared Caches ◽

Code Refactoring ◽

Cache Line

One widely used metric that measures data locality is reuse distance —the number of unique memory locations that are accessed between two consecutive accesses to a particular memory location. State-of-the-art techniques that measure reuse distance in parallel applications rely on simulators or binary instrumentation tools that incur large performance and memory overheads. Moreover, the existing sampling-based tools are limited to measuring reuse distances of a single thread and discard interactions among threads in multi-threaded programs. In this work, we propose ReuseTracker —a fast and accurate reuse distance analyzer that leverages existing hardware features in commodity CPUs. ReuseTracker is designed for multi-threaded programs and takes cache-coherence effects into account. By utilizing hardware features like performance monitoring units and debug registers, ReuseTracker can accurately profile reuse distance in parallel applications with much lower overheads than existing tools. It introduces only 2.9× runtime and 2.8× memory overheads. Our tool achieves 92% accuracy when verified against a newly developed configurable benchmark that can generate a variety of different reuse distance patterns. We demonstrate the tool’s functionality with two use-case scenarios using PARSEC, Rodinia, and Synchrobench benchmark suites where ReuseTracker guides code refactoring in these benchmarks by detecting spatial reuses in shared caches that are also false sharing and successfully predicts whether some benchmarks in these suites can benefit from adjacent cache line prefetch optimization.

Download Full-text

Locality-Aware CTA Scheduling for Gaming Applications

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3477497 ◽

2022 ◽

Vol 19 (1) ◽

pp. 1-26

Author(s):

Aditya Ukarande ◽

Suryakant Patidar ◽

Ram Rangan

Keyword(s):

Capacity Increase ◽

Two Generations ◽

Working Set ◽

Bandwidth Savings ◽

Nvidia Gpu ◽

Design Simplicity ◽

Operational Aspects ◽

Bandwidth Demand ◽

Cache Capacity

The compute work rasterizer or the GigaThread Engine of a modern NVIDIA GPU focuses on maximizing compute work occupancy across all streaming multiprocessors in a GPU while retaining design simplicity. In this article, we identify the operational aspects of the GigaThread Engine that help it meet those goals but also lead to less-than-ideal cache locality for texture accesses in 2D compute shaders, which are an important optimization target for gaming applications. We develop three software techniques, namely LargeCTAs , Swizzle , and Agents , to show that it is possible to effectively exploit the texture data working set overlap intrinsic to 2D compute shaders. We evaluate these techniques on gaming applications across two generations of NVIDIA GPUs, RTX 2080 and RTX 3080, and find that they are effective on both GPUs. We find that the bandwidth savings from all our software techniques on RTX 2080 is much higher than the bandwidth savings on baseline execution from inter-generational cache capacity increase going from RTX 2080 to RTX 3080. Our best-performing technique, Agents , records up to a 4.7% average full-frame speedup by reducing bandwidth demand of targeted shaders at the L1-L2 and L2-DRAM interfaces by 23% and 32%, respectively, on the latest generation RTX 3080. These results acutely highlight the sensitivity of cache locality to compute work rasterization order and the importance of locality-aware cooperative thread array scheduling for gaming applications.

Download Full-text

GPU Domain Specialization via Composable On-Package Architecture

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3484505 ◽

2022 ◽

Vol 19 (1) ◽

pp. 1-23

Author(s):

Yaosheng Fu ◽

Evgeny Bolotin ◽

Niladrish Chatterjee ◽

David Nellans ◽

Stephen W. Keckler

Keyword(s):

Deep Learning ◽

Memory System ◽

Design Reuse ◽

Application Domain ◽

Precision Matrix ◽

Practical Solution ◽

Optimal Configurations ◽

Gpu Architecture ◽

With Memory ◽

Cache Capacity

As GPUs scale their low-precision matrix math throughput to boost deep learning (DL) performance, they upset the balance between math throughput and memory system capabilities. We demonstrate that a converged GPU design trying to address diverging architectural requirements between FP32 (or larger)-based HPC and FP16 (or smaller)-based DL workloads results in sub-optimal configurations for either of the application domains. We argue that a C omposable O n- PA ckage GPU (COPA-GPU) architecture to provide domain-specialized GPU products is the most practical solution to these diverging requirements. A COPA-GPU leverages multi-chip-module disaggregation to support maximal design reuse, along with memory system specialization per application domain. We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4× higher off-die bandwidth, 32× larger on-package cache, and 2.3× higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs. This work explores the microarchitectural design necessary to enable composable GPUs and evaluates the benefits composability can provide to HPC, DL training, and DL inference. We show that when compared to a converged GPU design, a DL-optimized COPA-GPU featuring a combination of 16× larger cache capacity and 1.6× higher DRAM bandwidth scales per-GPU training and inference performance by 31% and 35%, respectively, and reduces the number of GPU instances by 50% in scale-out training scenarios.

Download Full-text

SecNVM: An Efficient and Write-Friendly Metadata Crash Consistency Scheme for Secure NVM

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3488724 ◽

2022 ◽

Vol 19 (1) ◽

pp. 1-26

Author(s):

Mengya Lei ◽

Fan Li ◽

Fang Wang ◽

Dan Feng ◽

Xiaomin Zou ◽

...

Keyword(s):

Data Security ◽

Recovery Time ◽

State Of The Art ◽

The State ◽

Fast Recovery ◽

Non Volatile Memory ◽

User Data ◽

Metadata Cache ◽

Volatile Memory ◽

The Cost

Data security is an indispensable part of non-volatile memory (NVM) systems. However, implementing data security efficiently on NVM is challenging, since we have to guarantee the consistency of user data and the related security metadata. Existing consistency schemes ignore the recoverability of the SGX style integrity tree (SIT) and the access correlation between metadata blocks, thereby generating unnecessary NVM write traffic. In this article, we propose SecNVM, an efficient and write-friendly metadata crash consistency scheme for secure NVM. SecNVM utilizes the observation that for a lazily updated SIT, the lost tree nodes after a crash can be recovered by the corresponding child nodes in NVM. It reduces the SIT persistency overhead through a restrained write-back metadata cache and exploits the SIT inter-layer dependency for recovery. Next, leveraging the strong access correlation between the counter and DMAC, SecNVM improves the efficiency of security metadata access through a novel collaborative counter-DMAC scheme. In addition, it adopts a lightweight address tracker to reduce the cost of address tracking for fast recovery. Experiments show that compared to the state-of-the-art schemes, SecNVM improves the performance and decreases write traffic a lot, and achieves an acceptable recovery time.

Download Full-text

TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware Scheduling

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3491218 ◽

2022 ◽

Vol 19 (1) ◽

pp. 1-23

Author(s):

Bang Di ◽

Daokun Hu ◽

Zhen Xie ◽

Jianhua Sun ◽

Hao Chen ◽

...

Keyword(s):

State Of The Art ◽

Denial Of Service ◽

System Throughput ◽

Resource Requirement ◽

Application Security ◽

Translation Lookaside Buffer ◽

Load Imbalance ◽

High System ◽

The Common ◽

Software And Hardware

Co-running GPU kernels on a single GPU can provide high system throughput and improve hardware utilization, but this raises concerns on application security. We reveal that translation lookaside buffer (TLB) attack, one of the common attacks on CPU, can happen on GPU when multiple GPU kernels co-run. We investigate conditions or principles under which a TLB attack can take effect, including the awareness of GPU TLB microarchitecture, being lightweight, and bypassing existing software and hardware mechanisms. This TLB-based attack can be leveraged to conduct Denial-of-Service (or Degradation-of-Service) attacks. Furthermore, we propose a solution to mitigate TLB attacks. In particular, based on the microarchitecture properties of GPU, we introduce a software-based system, TLB-pilot, that binds thread blocks of different kernels to different groups of streaming multiprocessors by considering hardware isolation of last-level TLBs and the application’s resource requirement. TLB-pilot employs lightweight online profiling to collect kernel information before kernel launches. By coordinating software- and hardware-based scheduling and employing a kernel splitting scheme to reduce load imbalance, TLB-pilot effectively mitigates TLB attacks. The result shows that when under TLB attack, TLB-pilot mitigates the attack and provides on average 56.2% and 60.6% improvement in average normalized turnaround times and overall system throughput, respectively, compared to the traditional Multi-Process Service based co-running solution. When under TLB attack, TLB-pilot also provides up to 47.3% and 64.3% improvement (41% and 42.9% on average) in average normalized turnaround times and overall system throughput, respectively, compared to a state-of-the-art co-running solution for efficiently scheduling of thread blocks.

Download Full-text

Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3485137 ◽

2022 ◽

Vol 19 (1) ◽

pp. 1-26

Author(s):

Prasanth Chatarasi ◽

Hyoukjun Kwon ◽

Angshuman Parashar ◽

Michael Pellauer ◽

Tushar Krishna ◽

...

Keyword(s):

Deep Learning ◽

Cost Model ◽

Cost Models ◽

Mapping Space ◽

Loop Nest ◽

Loop Nests ◽

Higher Dimensional ◽

On Chip ◽

The Cost ◽

Dimensional Mapping

A spatial accelerator’s efficiency depends heavily on both its mapper and cost models to generate optimized mappings for various operators of DNN models. However, existing cost models lack a formal boundary over their input programs (operators) for accurate and tractable cost analysis of the mappings, and this results in adaptability challenges to the cost models for new operators. We consider the recently introduced Maestro Data-Centric (MDC) notation and its analytical cost model to address this challenge because any mapping expressed in the notation is precisely analyzable using the MDC’s cost model. In this article, we characterize the set of input operators and their mappings expressed in the MDC notation by introducing a set of conformability rules . The outcome of these rules is that any loop nest that is perfectly nested with affine tensor subscripts and without conditionals is conformable to the MDC notation. A majority of the primitive operators in deep learning are such loop nests. In addition, our rules enable us to automatically translate a mapping expressed in the loop nest form to MDC notation and use the MDC’s cost model to guide upstream mappers. Our conformability rules over the input operators result in a structured mapping space of the operators, which enables us to introduce a mapper based on our decoupled off-chip/on-chip approach to accelerate mapping space exploration. Our mapper decomposes the original higher-dimensional mapping space of operators into two lower-dimensional off-chip and on-chip subspaces and then optimizes the off-chip subspace followed by the on-chip subspace. We implemented our overall approach in a tool called Marvel , and a benefit of our approach is that it applies to any operator conformable with the MDC notation. We evaluated Marvel over major DNN operators and compared it with past optimizers.

Download Full-text

SMT-Based Contention-Free Task Mapping and Scheduling on 2D/3D SMART NoC with Mixed Dimension-Order Routing

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3487018 ◽

2022 ◽

Vol 19 (1) ◽

pp. 1-21

Author(s):

Daeyeal Lee ◽

Bill Lin ◽

Chung-Kuan Cheng

Keyword(s):

System Performance ◽

Search Space ◽

Satisfiability Modulo Theories ◽

Low Latency ◽

Task Mapping ◽

Single Cycle ◽

Space Reduction ◽

Reduction Techniques ◽

2D And 3D ◽

Mixed Dimension

SMART NoCs achieve ultra-low latency by enabling single-cycle multiple-hop transmission via bypass channels. However, contention along bypass channels can seriously degrade the performance of SMART NoCs by breaking the bypass paths. Therefore, contention-free task mapping and scheduling are essential for optimal system performance. In this article, we propose an SMT (Satisfiability Modulo Theories)-based framework to find optimal contention-free task mappings with minimum application schedule lengths on 2D/3D SMART NoCs with mixed dimension-order routing. On top of SMT’s fast reasoning capability for conditional constraints, we develop efficient search-space reduction techniques to achieve practical scalability. Experiments demonstrate that our SMT framework achieves 10× higher scalability than ILP (Integer Linear Programming) with 931.1× (ranges from 2.2× to 1532.1×) and 1237.1× (ranges from 4× to 4373.8×) faster average runtimes for finding optimum solutions on 2D and 3D SMART NoCs and our 2D and 3D extensions of the SMT framework with mixed dimension-order routing also maintain the improved scalability with the extended and diversified routing paths, resulting in reduced application schedule lengths throughout various application benchmarks.

Download Full-text

Scenario-Aware Program Specialization for Timing Predictability

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3473333 ◽

2021 ◽

Vol 18 (4) ◽

pp. 1-26

Author(s):

Joscha Benz ◽

Oliver Bringmann

Keyword(s):

Program Analysis ◽

Control Flow ◽

Wcet Analysis ◽

Specific System ◽

Worst Case ◽

Program Specialization ◽

Operating Modes ◽

Source Level ◽

Dependent Flow ◽

Timing Simulation

The successful application of static program analysis strongly depends on flow facts of a program such as loop bounds, control-flow constraints, and operating modes. This problem heavily affects the design of real-time systems, since static program analyses are a prerequisite to determine the timing behavior of a program. For example, this becomes obvious in worst-case execution time (WCET) analysis, which is often infeasible without user-annotated flow facts. Moreover, many timing simulation approaches use statically derived timings of partial program paths to reduce simulation overhead. Annotating flow facts on binary or source level is either error-prone and tedious, or requires specialized compilers that can transform source-level annotations along with the program during optimization. To overcome these obstacles, so-called scenarios can be used. Scenarios are a design-time methodology that describe a set of possible system parameters, such as image resolutions, operating modes, or application-dependent flow facts. The information described by a scenario is unknown in general but known and constant for a specific system. In this article, 1 we present a methodology for scenario-aware program specialization to improve timing predictability. Moreover, we provide an implementation of this methodology for embedded software written in C/C++. We show the effectiveness of our approach by evaluating its impact on WCET analysis using almost all of TACLeBench–achieving an average reduction of WCET of 31%. In addition, we provide a thorough qualitative and evaluation-based comparison to closely related work, as well as two case studies.

Download Full-text

ACM Transactions on Architecture and Code Optimization
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Association For Computing Machinery

Joint Program and Layout Transformations to Enable Convolutional Operators on Specialized Hardware Based on Constraint Programming

Iterative Compilation Optimization Based on Metric Learning and Collaborative Filtering

ReuseTracker : Fast Yet Accurate Multicore Reuse Distance Analyzer

Locality-Aware CTA Scheduling for Gaming Applications

GPU Domain Specialization via Composable On-Package Architecture

SecNVM: An Efficient and Write-Friendly Metadata Crash Consistency Scheme for Secure NVM

TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware Scheduling

Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators

SMT-Based Contention-Free Task Mapping and Scheduling on 2D/3D SMART NoC with Mixed Dimension-Order Routing

Scenario-Aware Program Specialization for Timing Predictability

Export Citation Format

ACM Transactions on Architecture and Code OptimizationLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Association For Computing Machinery

Joint Program and Layout Transformations to Enable Convolutional Operators on Specialized Hardware Based on Constraint Programming

Iterative Compilation Optimization Based on Metric Learning and Collaborative Filtering

ReuseTracker : Fast Yet Accurate Multicore Reuse Distance Analyzer

Locality-Aware CTA Scheduling for Gaming Applications

GPU Domain Specialization via Composable On-Package Architecture

SecNVM: An Efficient and Write-Friendly Metadata Crash Consistency Scheme for Secure NVM

TLB-pilot: Mitigating TLB Contention Attack on GPUs with Microarchitecture-Aware Scheduling

Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators

SMT-Based Contention-Free Task Mapping and Scheduling on 2D/3D SMART NoC with Mixed Dimension-Order Routing

Scenario-Aware Program Specialization for Timing Predictability

ACM Transactions on Architecture and Code Optimization
Latest Publications