on chip Latest Research Papers

Developing a joint-on-chip platform : a multi-organ-on-chip model to mimic healthy and diseased conditions of the synovial joints

10.3990/1.9789036553261 ◽

2022 ◽

Author(s):

Carlo Alberto Paggi

Keyword(s):

Synovial Joints ◽

On Chip

Download Full-text

Enhancing the Security of FPGA-SoCs via the Usage of ARM TrustZone and a Hybrid-TPM

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3472959 ◽

2022 ◽

Vol 15 (1) ◽

pp. 1-26

Author(s):

Mathieu Gross ◽

Konrad Hohentanner ◽

Stefan Wiehler ◽

Georg Sigl

Keyword(s):

Software Design ◽

Hardware Accelerators ◽

Biometric Authentication ◽

Trusted Platform Module ◽

Start Up ◽

Execution Environment ◽

On Chip ◽

Trusted Platform ◽

Trusted Execution Environment ◽

Entropy Source

Isolated execution is a concept commonly used for increasing the security of a computer system. In the embedded world, ARM TrustZone technology enables this goal and is currently used on mobile devices for applications such as secure payment or biometric authentication. In this work, we investigate the security benefits achievable through the usage of ARM TrustZone on FPGA-SoCs. We first adapt Microsoft’s implementation of a firmware Trusted Platform Module (fTPM) running inside ARM TrustZone for the Zynq UltraScale+ platform. This adaptation consists in integrating hardware accelerators available on the device to fTPM’s implementation and to enhance fTPM with an entropy source derived from on-chip SRAM start-up patterns. With our approach, we transform a software implementation of a TPM into a hybrid hardware/software design that could address some of the security drawbacks of the original implementation while keeping its flexibility. To demonstrate the security gains obtained via the usage of ARM TrustZone and our hybrid-TPM on FPGA-SoCs, we propose a framework that combines them for enabling a secure remote bitstream loading. The approach consists in preventing the insecure usages of a bitstream reconfiguration interface that are made possible by the manufacturer and to integrate the interface inside a Trusted Execution Environment.

Download Full-text

Improving ultra-fast charging performance and durability of all solid state thin film Li-NMC battery-on-chip systems by in situ TEM lamella analysis

Applied Materials Today ◽

10.1016/j.apmt.2021.101282 ◽

2022 ◽

Vol 26 ◽

pp. 101282

Author(s):

León Romano Brandt ◽

Kazunori Nishio ◽

Enrico Salvati ◽

Kevin P. Simon ◽

Chrysanthi Papadaki ◽

...

Keyword(s):

Thin Film ◽

Solid State ◽

In Situ Tem ◽

Fast Charging ◽

On Chip ◽

Tem Lamella

Download Full-text

Accelerating On-Chip Training with Ferroelectric-Based Hybrid Precision Synapse

ACM Journal on Emerging Technologies in Computing Systems ◽

10.1145/3473461 ◽

2022 ◽

Vol 18 (2) ◽

pp. 1-20

Author(s):

Yandong Luo ◽

Panni Wang ◽

Shimeng Yu

Keyword(s):

Deep Neural Network ◽

The Other ◽

Hardware Accelerator ◽

Chip Area ◽

Non Volatile Memory ◽

Energy Consuming ◽

Architectural Evaluation ◽

Buffer Design ◽

On Chip ◽

Accelerator Design

In this article, we propose a hardware accelerator design using ferroelectric transistor (FeFET)-based hybrid precision synapse (HPS) for deep neural network (DNN) on-chip training. The drain erase scheme for FeFET programming is incorporated for both FeFET HPS design and FeFET buffer design. By using drain erase, high-density FeFET buffers can be integrated onchip to store the intermediate input-output activations and gradients, which reduces the energy consuming off-chip DRAM access. Architectural evaluation results show that the energy efficiency could be improved by 1.2× ∼ 2.1×, 3.9× ∼ 6.0× compared to the other HPS-based designs and emerging non-volatile memory baselines, respectively. The chip area is reduced by 19% ∼ 36% compared with designs using SRAM on-chip buffer even though the capacity of FeFET buffer is increased. Besides, by utilizing drain erase scheme for FeFET programming, the chip area is reduced by 11% ∼ 28.5% compared with the designs using body erase scheme.

Download Full-text

LED pumped Raman laser: Towards the design of an on-chip all-silicon laser

Optics & Laser Technology ◽

10.1016/j.optlastec.2021.107634 ◽

2022 ◽

Vol 147 ◽

pp. 107634

Author(s):

Akash Kumar Pradhan ◽

Mrinal Sen ◽

Tanmoy Datta

Keyword(s):

Raman Laser ◽

Silicon Laser ◽

On Chip

Download Full-text

Request, Coalesce, Serve, and Forget: Miss-Optimized Memory Systems for Bandwidth-Bound Cache-Unfriendly Applications on FPGAs

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3466823 ◽

2022 ◽

Vol 15 (2) ◽

pp. 1-33

Author(s):

Mikhail Asiatici ◽

Paolo Ienne

Keyword(s):

Large Scale ◽

Sparse Matrix ◽

Memory Systems ◽

Graph Analytics ◽

Matrix Vector Multiplication ◽

Area Reduction ◽

Cache Line ◽

Speed Up ◽

Memory Accesses ◽

On Chip

Applications such as large-scale sparse linear algebra and graph analytics are challenging to accelerate on FPGAs due to the short irregular memory accesses, resulting in low cache hit rates. Nonblocking caches reduce the bandwidth required by misses by requesting each cache line only once, even when there are multiple misses corresponding to it. However, such reuse mechanism is traditionally implemented using an associative lookup. This limits the number of misses that are considered for reuse to a few tens, at most. In this article, we present an efficient pipeline that can process and store thousands of outstanding misses in cuckoo hash tables in on-chip SRAM with minimal stalls. This brings the same bandwidth advantage as a larger cache for a fraction of the area budget, because outstanding misses do not need a data array, which can significantly speed up irregular memory-bound latency-insensitive applications. In addition, we extend nonblocking caches to generate variable-length bursts to memory, which increases the bandwidth delivered by DRAMs and their controllers. The resulting miss-optimized memory system provides up to 25% speedup with 24× area reduction on 15 large sparse matrix-vector multiplication benchmarks evaluated on an embedded and a datacenter FPGA system.

Download Full-text

Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3485137 ◽

2022 ◽

Vol 19 (1) ◽

pp. 1-26

Author(s):

Prasanth Chatarasi ◽

Hyoukjun Kwon ◽

Angshuman Parashar ◽

Michael Pellauer ◽

Tushar Krishna ◽

...

Keyword(s):

Deep Learning ◽

Cost Model ◽

Cost Models ◽

Mapping Space ◽

Loop Nest ◽

Loop Nests ◽

Higher Dimensional ◽

On Chip ◽

The Cost ◽

Dimensional Mapping

A spatial accelerator’s efficiency depends heavily on both its mapper and cost models to generate optimized mappings for various operators of DNN models. However, existing cost models lack a formal boundary over their input programs (operators) for accurate and tractable cost analysis of the mappings, and this results in adaptability challenges to the cost models for new operators. We consider the recently introduced Maestro Data-Centric (MDC) notation and its analytical cost model to address this challenge because any mapping expressed in the notation is precisely analyzable using the MDC’s cost model. In this article, we characterize the set of input operators and their mappings expressed in the MDC notation by introducing a set of conformability rules . The outcome of these rules is that any loop nest that is perfectly nested with affine tensor subscripts and without conditionals is conformable to the MDC notation. A majority of the primitive operators in deep learning are such loop nests. In addition, our rules enable us to automatically translate a mapping expressed in the loop nest form to MDC notation and use the MDC’s cost model to guide upstream mappers. Our conformability rules over the input operators result in a structured mapping space of the operators, which enables us to introduce a mapper based on our decoupled off-chip/on-chip approach to accelerate mapping space exploration. Our mapper decomposes the original higher-dimensional mapping space of operators into two lower-dimensional off-chip and on-chip subspaces and then optimizes the off-chip subspace followed by the on-chip subspace. We implemented our overall approach in a tool called Marvel , and a benefit of our approach is that it applies to any operator conformable with the MDC notation. We evaluated Marvel over major DNN operators and compared it with past optimizers.

Download Full-text

Non-Intrusive Distributed Tracing of Wireless IoT Devices with the FlockLab 2 Testbed

ACM Transactions on Internet of Things ◽

10.1145/3480248 ◽

2022 ◽

Vol 3 (1) ◽

pp. 1-31

Author(s):

Roman Trüb ◽

Reto Da Forno ◽

Lukas Daschinger ◽

Andreas Biri ◽

Jan Beutel ◽

...

Keyword(s):

Time Synchronization ◽

Wireless Devices ◽

Software Support ◽

Essential Properties ◽

Distributed Target ◽

Code Instrumentation ◽

Iot Devices ◽

On Chip ◽

Time Critical

Testbeds for wireless IoT devices facilitate testing and validation of distributed target nodes. A testbed usually provides methods to control, observe, and log the execution of the software. However, most of the methods used for tracing the execution require code instrumentation and change essential properties of the observed system. Methods that are non-intrusive are typically not applicable in a distributed fashion due to a lack of time synchronization or necessary hardware/software support. In this article, we present a tracing system for validating time-critical software running on multiple distributed wireless devices that does not require code instrumentation, is non-intrusive and is designed to trace the distributed state of an entire network. For this purpose, we make use of the on-chip debug and trace hardware that is part of most modern microcontrollers. We introduce a testbed architecture as well as models and methods that accurately synchronize the timestamps of observations collected by distributed observers. In a case study, we demonstrate how the tracing system can be applied to observe the distributed state of a flooding-based low-power communication protocol for wireless sensor networks. The presented non-intrusive tracing system is implemented as a service of the publicly accessible open source FlockLab 2 testbed.

Download Full-text

Impact of On-chip Interconnect on In-memory Acceleration of Deep Neural Networks

ACM Journal on Emerging Technologies in Computing Systems ◽

10.1145/3460233 ◽

2022 ◽

Vol 18 (2) ◽

pp. 1-22

Author(s):

Gokul Krishnan ◽

Sumit K. Mandal ◽

Chaitali Chakrabarti ◽

Jae-Sun Seo ◽

Umit Y. Ogras ◽

...

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Optimal Choice ◽

Machine Learning Algorithms ◽

Analytical Models ◽

Critical Function ◽

Data Movement ◽

Chip Data ◽

On Chip ◽

Connection Density

With the widespread use of Deep Neural Networks (DNNs), machine learning algorithms have evolved in two diverse directions—one with ever-increasing connection density for better accuracy and the other with more compact sizing for energy efficiency. The increase in connection density increases on-chip data movement, which makes efficient on-chip communication a critical function of the DNN accelerator. The contribution of this work is threefold. First, we illustrate that the point-to-point (P2P)-based interconnect is incapable of handling a high volume of on-chip data movement for DNNs. Second, we evaluate P2P and network-on-chip (NoC) interconnect (with a regular topology such as a mesh) for SRAM- and ReRAM-based in-memory computing (IMC) architectures for a range of DNNs. This analysis shows the necessity for the optimal interconnect choice for an IMC DNN accelerator. Finally, we perform an experimental evaluation for different DNNs to empirically obtain the performance of the IMC architecture with both NoC-tree and NoC-mesh. We conclude that, at the tile level, NoC-tree is appropriate for compact DNNs employed at the edge, and NoC-mesh is necessary to accelerate DNNs with high connection density. Furthermore, we propose a technique to determine the optimal choice of interconnect for any given DNN. In this technique, we use analytical models of NoC to evaluate end-to-end communication latency of any given DNN. We demonstrate that the interconnect optimization in the IMC architecture results in up to 6 × improvement in energy-delay-area product for VGG-19 inference compared to the state-of-the-art ReRAM-based IMC architectures.

Download Full-text

Deploying Multi-tenant FPGAs within Linux-based Cloud Infrastructure

ACM Transactions on Reconfigurable Technology and Systems ◽

10.1145/3474058 ◽

2022 ◽

Vol 15 (2) ◽

pp. 1-31

Author(s):

Joel Mandebi Mbongue ◽

Danielle Tchuinkou Kwadjo ◽

Alex Shuping ◽

Christophe Bobda

Keyword(s):

Software Architecture ◽

Hardware Acceleration ◽

Maximum Frequency ◽

Cloud Infrastructure ◽

Fpga Design ◽

Data Movement ◽

Field Programmable ◽

Minimal Data ◽

On Chip ◽

Cloud Users

Cloud deployments now increasingly exploit Field-Programmable Gate Array (FPGA) accelerators as part of virtual instances. While cloud FPGAs are still essentially single-tenant, the growing demand for efficient hardware acceleration paves the way to FPGA multi-tenancy. It then becomes necessary to explore architectures, design flows, and resource management features that aim at exposing multi-tenant FPGAs to the cloud users. In this article, we discuss a hardware/software architecture that supports provisioning space-shared FPGAs in Kernel-based Virtual Machine (KVM) clouds. The proposed hardware/software architecture introduces an FPGA organization that improves hardware consolidation and support hardware elasticity with minimal data movement overhead. It also relies on VirtIO to decrease communication latency between hardware and software domains. Prototyping the proposed architecture with a Virtex UltraScale+ FPGA demonstrated near specification maximum frequency for on-chip data movement and high throughput in virtual instance access to hardware accelerators. We demonstrate similar performance compared to single-tenant deployment while increasing FPGA utilization, which is one of the goals of virtualization. Overall, our FPGA design achieved about 2× higher maximum frequency than the state of the art and a bandwidth reaching up to 28 Gbps on 32-bit data width.

Download Full-text

on chip
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

Developing a joint-on-chip platform : a multi-organ-on-chip model to mimic healthy and diseased conditions of the synovial joints

Enhancing the Security of FPGA-SoCs via the Usage of ARM TrustZone and a Hybrid-TPM

Improving ultra-fast charging performance and durability of all solid state thin film Li-NMC battery-on-chip systems by in situ TEM lamella analysis

Accelerating On-Chip Training with Ferroelectric-Based Hybrid Precision Synapse

LED pumped Raman laser: Towards the design of an on-chip all-silicon laser

Request, Coalesce, Serve, and Forget: Miss-Optimized Memory Systems for Bandwidth-Bound Cache-Unfriendly Applications on FPGAs

Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators

Non-Intrusive Distributed Tracing of Wireless IoT Devices with the FlockLab 2 Testbed

Impact of On-chip Interconnect on In-memory Acceleration of Deep Neural Networks

Deploying Multi-tenant FPGAs within Linux-based Cloud Infrastructure

Export Citation Format

on chipRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

Developing a joint-on-chip platform : a multi-organ-on-chip model to mimic healthy and diseased conditions of the synovial joints

Enhancing the Security of FPGA-SoCs via the Usage of ARM TrustZone and a Hybrid-TPM

Improving ultra-fast charging performance and durability of all solid state thin film Li-NMC battery-on-chip systems by in situ TEM lamella analysis

Accelerating On-Chip Training with Ferroelectric-Based Hybrid Precision Synapse

LED pumped Raman laser: Towards the design of an on-chip all-silicon laser

Request, Coalesce, Serve, and Forget: Miss-Optimized Memory Systems for Bandwidth-Bound Cache-Unfriendly Applications on FPGAs

Marvel: A Data-Centric Approach for Mapping Deep Learning Operators on Spatial Accelerators

Non-Intrusive Distributed Tracing of Wireless IoT Devices with the FlockLab 2 Testbed

Impact of On-chip Interconnect on In-memory Acceleration of Deep Neural Networks

Deploying Multi-tenant FPGAs within Linux-based Cloud Infrastructure

on chip
Recently Published Documents