on chip
Recently Published Documents


TOTAL DOCUMENTS

26885
(FIVE YEARS 8311)

H-INDEX

154
(FIVE YEARS 46)

2022 ◽  
Vol 15 (1) ◽  
pp. 1-26
Author(s):  
Mathieu Gross ◽  
Konrad Hohentanner ◽  
Stefan Wiehler ◽  
Georg Sigl

Isolated execution is a concept commonly used for increasing the security of a computer system. In the embedded world, ARM TrustZone technology enables this goal and is currently used on mobile devices for applications such as secure payment or biometric authentication. In this work, we investigate the security benefits achievable through the usage of ARM TrustZone on FPGA-SoCs. We first adapt Microsoft’s implementation of a firmware Trusted Platform Module (fTPM) running inside ARM TrustZone for the Zynq UltraScale+ platform. This adaptation consists in integrating hardware accelerators available on the device to fTPM’s implementation and to enhance fTPM with an entropy source derived from on-chip SRAM start-up patterns. With our approach, we transform a software implementation of a TPM into a hybrid hardware/software design that could address some of the security drawbacks of the original implementation while keeping its flexibility. To demonstrate the security gains obtained via the usage of ARM TrustZone and our hybrid-TPM on FPGA-SoCs, we propose a framework that combines them for enabling a secure remote bitstream loading. The approach consists in preventing the insecure usages of a bitstream reconfiguration interface that are made possible by the manufacturer and to integrate the interface inside a Trusted Execution Environment.


2022 ◽  
Vol 26 ◽  
pp. 101282
Author(s):  
León Romano Brandt ◽  
Kazunori Nishio ◽  
Enrico Salvati ◽  
Kevin P. Simon ◽  
Chrysanthi Papadaki ◽  
...  

2022 ◽  
Vol 18 (2) ◽  
pp. 1-20
Author(s):  
Yandong Luo ◽  
Panni Wang ◽  
Shimeng Yu

In this article, we propose a hardware accelerator design using ferroelectric transistor (FeFET)-based hybrid precision synapse (HPS) for deep neural network (DNN) on-chip training. The drain erase scheme for FeFET programming is incorporated for both FeFET HPS design and FeFET buffer design. By using drain erase, high-density FeFET buffers can be integrated onchip to store the intermediate input-output activations and gradients, which reduces the energy consuming off-chip DRAM access. Architectural evaluation results show that the energy efficiency could be improved by 1.2× ∼ 2.1×, 3.9× ∼ 6.0× compared to the other HPS-based designs and emerging non-volatile memory baselines, respectively. The chip area is reduced by 19% ∼ 36% compared with designs using SRAM on-chip buffer even though the capacity of FeFET buffer is increased. Besides, by utilizing drain erase scheme for FeFET programming, the chip area is reduced by 11% ∼ 28.5% compared with the designs using body erase scheme.


2022 ◽  
Vol 147 ◽  
pp. 107634
Author(s):  
Akash Kumar Pradhan ◽  
Mrinal Sen ◽  
Tanmoy Datta
Keyword(s):  

2022 ◽  
Vol 15 (2) ◽  
pp. 1-33
Author(s):  
Mikhail Asiatici ◽  
Paolo Ienne

Applications such as large-scale sparse linear algebra and graph analytics are challenging to accelerate on FPGAs due to the short irregular memory accesses, resulting in low cache hit rates. Nonblocking caches reduce the bandwidth required by misses by requesting each cache line only once, even when there are multiple misses corresponding to it. However, such reuse mechanism is traditionally implemented using an associative lookup. This limits the number of misses that are considered for reuse to a few tens, at most. In this article, we present an efficient pipeline that can process and store thousands of outstanding misses in cuckoo hash tables in on-chip SRAM with minimal stalls. This brings the same bandwidth advantage as a larger cache for a fraction of the area budget, because outstanding misses do not need a data array, which can significantly speed up irregular memory-bound latency-insensitive applications. In addition, we extend nonblocking caches to generate variable-length bursts to memory, which increases the bandwidth delivered by DRAMs and their controllers. The resulting miss-optimized memory system provides up to 25% speedup with 24× area reduction on 15 large sparse matrix-vector multiplication benchmarks evaluated on an embedded and a datacenter FPGA system.


2022 ◽  
Vol 19 (1) ◽  
pp. 1-26
Author(s):  
Prasanth Chatarasi ◽  
Hyoukjun Kwon ◽  
Angshuman Parashar ◽  
Michael Pellauer ◽  
Tushar Krishna ◽  
...  

A spatial accelerator’s efficiency depends heavily on both its mapper and cost models to generate optimized mappings for various operators of DNN models. However, existing cost models lack a formal boundary over their input programs (operators) for accurate and tractable cost analysis of the mappings, and this results in adaptability challenges to the cost models for new operators. We consider the recently introduced Maestro Data-Centric (MDC) notation and its analytical cost model to address this challenge because any mapping expressed in the notation is precisely analyzable using the MDC’s cost model. In this article, we characterize the set of input operators and their mappings expressed in the MDC notation by introducing a set of conformability rules . The outcome of these rules is that any loop nest that is perfectly nested with affine tensor subscripts and without conditionals is conformable to the MDC notation. A majority of the primitive operators in deep learning are such loop nests. In addition, our rules enable us to automatically translate a mapping expressed in the loop nest form to MDC notation and use the MDC’s cost model to guide upstream mappers. Our conformability rules over the input operators result in a structured mapping space of the operators, which enables us to introduce a mapper based on our decoupled off-chip/on-chip approach to accelerate mapping space exploration. Our mapper decomposes the original higher-dimensional mapping space of operators into two lower-dimensional off-chip and on-chip subspaces and then optimizes the off-chip subspace followed by the on-chip subspace. We implemented our overall approach in a tool called Marvel , and a benefit of our approach is that it applies to any operator conformable with the MDC notation. We evaluated Marvel over major DNN operators and compared it with past optimizers.


2022 ◽  
Vol 3 (1) ◽  
pp. 1-31
Author(s):  
Roman Trüb ◽  
Reto Da Forno ◽  
Lukas Daschinger ◽  
Andreas Biri ◽  
Jan Beutel ◽  
...  

Testbeds for wireless IoT devices facilitate testing and validation of distributed target nodes. A testbed usually provides methods to control, observe, and log the execution of the software. However, most of the methods used for tracing the execution require code instrumentation and change essential properties of the observed system. Methods that are non-intrusive are typically not applicable in a distributed fashion due to a lack of time synchronization or necessary hardware/software support. In this article, we present a tracing system for validating time-critical software running on multiple distributed wireless devices that does not require code instrumentation, is non-intrusive and is designed to trace the distributed state of an entire network. For this purpose, we make use of the on-chip debug and trace hardware that is part of most modern microcontrollers. We introduce a testbed architecture as well as models and methods that accurately synchronize the timestamps of observations collected by distributed observers. In a case study, we demonstrate how the tracing system can be applied to observe the distributed state of a flooding-based low-power communication protocol for wireless sensor networks. The presented non-intrusive tracing system is implemented as a service of the publicly accessible open source FlockLab 2 testbed.


2022 ◽  
Vol 18 (2) ◽  
pp. 1-22
Author(s):  
Gokul Krishnan ◽  
Sumit K. Mandal ◽  
Chaitali Chakrabarti ◽  
Jae-Sun Seo ◽  
Umit Y. Ogras ◽  
...  

With the widespread use of Deep Neural Networks (DNNs), machine learning algorithms have evolved in two diverse directions—one with ever-increasing connection density for better accuracy and the other with more compact sizing for energy efficiency. The increase in connection density increases on-chip data movement, which makes efficient on-chip communication a critical function of the DNN accelerator. The contribution of this work is threefold. First, we illustrate that the point-to-point (P2P)-based interconnect is incapable of handling a high volume of on-chip data movement for DNNs. Second, we evaluate P2P and network-on-chip (NoC) interconnect (with a regular topology such as a mesh) for SRAM- and ReRAM-based in-memory computing (IMC) architectures for a range of DNNs. This analysis shows the necessity for the optimal interconnect choice for an IMC DNN accelerator. Finally, we perform an experimental evaluation for different DNNs to empirically obtain the performance of the IMC architecture with both NoC-tree and NoC-mesh. We conclude that, at the tile level, NoC-tree is appropriate for compact DNNs employed at the edge, and NoC-mesh is necessary to accelerate DNNs with high connection density. Furthermore, we propose a technique to determine the optimal choice of interconnect for any given DNN. In this technique, we use analytical models of NoC to evaluate end-to-end communication latency of any given DNN. We demonstrate that the interconnect optimization in the IMC architecture results in up to 6 × improvement in energy-delay-area product for VGG-19 inference compared to the state-of-the-art ReRAM-based IMC architectures.


2022 ◽  
Vol 15 (2) ◽  
pp. 1-31
Author(s):  
Joel Mandebi Mbongue ◽  
Danielle Tchuinkou Kwadjo ◽  
Alex Shuping ◽  
Christophe Bobda

Cloud deployments now increasingly exploit Field-Programmable Gate Array (FPGA) accelerators as part of virtual instances. While cloud FPGAs are still essentially single-tenant, the growing demand for efficient hardware acceleration paves the way to FPGA multi-tenancy. It then becomes necessary to explore architectures, design flows, and resource management features that aim at exposing multi-tenant FPGAs to the cloud users. In this article, we discuss a hardware/software architecture that supports provisioning space-shared FPGAs in Kernel-based Virtual Machine (KVM) clouds. The proposed hardware/software architecture introduces an FPGA organization that improves hardware consolidation and support hardware elasticity with minimal data movement overhead. It also relies on VirtIO to decrease communication latency between hardware and software domains. Prototyping the proposed architecture with a Virtex UltraScale+ FPGA demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization, which is one of the goals of virtualization. Overall, our FPGA design achieved about 2× higher maximum frequency than the state of the art and a bandwidth reaching up to 28 Gbps on 32-bit data width.


Sign in / Sign up

Export Citation Format

Share Document