ACM Journal on Emerging Technologies in Computing Systems

Hardware-accelerated Simulation-based Inference of Stochastic Epidemiology Models for COVID-19

ACM Journal on Emerging Technologies in Computing Systems ◽

10.1145/3471188 ◽

2022 ◽

Vol 18 (2) ◽

pp. 1-24

Author(s):

Sourabh Kulkarni ◽

Mario Michael Krell ◽

Seth Nabarro ◽

Csaba Andras Moritz

Keyword(s):

Performance Analysis ◽

Large Scale ◽

Approximate Bayesian Computation ◽

Hardware Acceleration ◽

Processing Unit ◽

Accelerated Simulation ◽

Simulation Based ◽

The Difference ◽

Approximate Bayesian ◽

Extensive Performance

Epidemiology models are central to understanding and controlling large-scale pandemics. Several epidemiology models require simulation-based inference such as Approximate Bayesian Computation (ABC) to fit their parameters to observations. ABC inference is highly amenable to efficient hardware acceleration. In this work, we develop parallel ABC inference of a stochastic epidemiology model for COVID-19. The statistical inference framework is implemented and compared on Intel’s Xeon CPU, NVIDIA’s Tesla V100 GPU, Google’s V2 Tensor Processing Unit (TPU), and the Graphcore’s Mk1 Intelligence Processing Unit (IPU), and the results are discussed in the context of their computational architectures. Results show that TPUs are 3×, GPUs are 4×, and IPUs are 30× faster than Xeon CPUs. Extensive performance analysis indicates that the difference between IPU and GPU can be attributed to higher communication bandwidth, closeness of memory to compute, and higher compute power in the IPU. The proposed framework scales across 16 IPUs, with scaling overhead not exceeding 8% for the experiments performed. We present an example of our framework in practice, performing inference on the epidemiology model across three countries and giving a brief overview of the results.

Early Design Space Exploration Framework for Memristive Crossbar Arrays

ACM Journal on Emerging Technologies in Computing Systems ◽

10.1145/3461644 ◽

2022 ◽

Vol 18 (2) ◽

pp. 1-26

Author(s):

Md Adnan Zaman ◽

Rajeev Joshi ◽

Srinivas Katkoori

Keyword(s):

Energy Consumption ◽

Cycle Time ◽

Design Space Exploration ◽

Space Exploration ◽

Estimation Errors ◽

Design Alternatives ◽

And Performance ◽

The Individual ◽

High Level

For memristive crossbar arrays, currently, no high-level design validation and early space exploration tools exist in the literature. Such tools are essential to quickly verify the design functionality as well as compare design alternatives in terms of power and performance. In this work, we propose a VHDL-based framework that enables us to quickly perform behavioral simulation as well as estimate dynamic energy consumption and speed of any large memristive crossbar array. We propose a high-level (VHDL) model of a memristor based on which crossbar architectures can be modeled. The individual memristor model is embedded with power and delay numbers obtained from a detailed memristor model. We demonstrate the framework for MAGIC-style memristive crossbars. We validate the framework against detailed Verilog-A based model on fifteen combinational benchmarks. For the single row model, we obtained 153x simulation speedup over HSPICE, average estimation errors of 6.64% and 0% for dynamic energy consumption and cycle-time, respectively. For the transpose model, we obtained average estimation errors of 5.51% and 10.90% for dynamic energy consumption and cycle-time, respectively. We also extend our framework to support another prominent logic style and validate through a case study. The proposed framework can be easily extended to other emerging technologies.

Parallel Computing of Graph-based Functions in ReRAM

ACM Journal on Emerging Technologies in Computing Systems ◽

10.1145/3453163 ◽

2022 ◽

Vol 18 (2) ◽

pp. 1-24

Author(s):

Saman Froehlich ◽

Saeideh Shirinzadeh ◽

Rolf Drechsler

Keyword(s):

Parallel Computing ◽

Boolean Functions ◽

State Of The Art ◽

Random Access ◽

Resistive Random Access Memory ◽

Computer Architectures ◽

Non Volatile Memory ◽

Promising Solution ◽

High Scalability ◽

Memory Bottleneck

Resistive Random Access Memory (ReRAM) is an emerging non-volatile memory technology. Besides its low power consumption and its high scalability, its inherent computation capabilities make ReRAM especially interesting for future computer architectures. Merging computations into the memory is a promising solution for overcoming the memory bottleneck. To perform computations in ReRAM, efficient synthesis strategies for Boolean functions have to be developed. In this article, we give a thorough presentation of how to employ parallel computing capabilities of ReRAM for the synthesis of functions given state-of-the-art graph-based representations AIGs or BDDs. Additionally, we introduce a new graph-based representation called m-And-Inverter Graph (m-AIGs), which allows us to fully exploit the computing capabilities of ReRAM. In the simulations, we show that our proposed approaches outperform state-of-the art synthesis strategies, and we show the superiority of m-AIGs over the standard AIG representation for ReRAM-based synthesis.

Accelerating On-Chip Training with Ferroelectric-Based Hybrid Precision Synapse

ACM Journal on Emerging Technologies in Computing Systems ◽

10.1145/3473461 ◽

2022 ◽

Vol 18 (2) ◽

pp. 1-20

Author(s):

Yandong Luo ◽

Panni Wang ◽

Shimeng Yu

Keyword(s):

Deep Neural Network ◽

The Other ◽

Hardware Accelerator ◽

Chip Area ◽

Non Volatile Memory ◽

Energy Consuming ◽

Architectural Evaluation ◽

Buffer Design ◽

On Chip ◽

Accelerator Design

In this article, we propose a hardware accelerator design using ferroelectric transistor (FeFET)-based hybrid precision synapse (HPS) for deep neural network (DNN) on-chip training. The drain erase scheme for FeFET programming is incorporated for both FeFET HPS design and FeFET buffer design. By using drain erase, high-density FeFET buffers can be integrated onchip to store the intermediate input-output activations and gradients, which reduces the energy consuming off-chip DRAM access. Architectural evaluation results show that the energy efficiency could be improved by 1.2× ∼ 2.1×, 3.9× ∼ 6.0× compared to the other HPS-based designs and emerging non-volatile memory baselines, respectively. The chip area is reduced by 19% ∼ 36% compared with designs using SRAM on-chip buffer even though the capacity of FeFET buffer is increased. Besides, by utilizing drain erase scheme for FeFET programming, the chip area is reduced by 11% ∼ 28.5% compared with the designs using body erase scheme.

COSMO: Computing with Stochastic Numbers in Memory

ACM Journal on Emerging Technologies in Computing Systems ◽

10.1145/3484731 ◽

2022 ◽

Vol 18 (2) ◽

pp. 1-25

Author(s):

Saransh Gupta ◽

Mohsen Imani ◽

Joonseop Sim ◽

Andrew Huang ◽

Fan Wu ◽

...

Keyword(s):

Neural Networks ◽

Image Processing ◽

Energy Efficient ◽

Deep Neural Networks ◽

Parallel Architecture ◽

Low Energy ◽

Stochastic Computing ◽

Wide Range ◽

Low Energy Consumption ◽

Sc Addition

Stochastic computing (SC) reduces the complexity of computation by representing numbers with long streams of independent bits. However, increasing performance in SC comes with either an increase in area or a loss in accuracy. Processing in memory (PIM) computes data in-place while having high memory density and supporting bit-parallel operations with low energy consumption. In this article, we propose COSMO, an architecture for co mputing with s tochastic numbers in me mo ry, which enables SC in memory. The proposed architecture is general and can be used for a wide range of applications. It is a highly dense and parallel architecture that supports most SC encodings and operations in memory. It maximizes the performance and energy efficiency of SC by introducing several innovations: (i) in-memory parallel stochastic number generation, (ii) efficient implication-based logic in memory, (iii) novel memory bit line segmenting, (iv) a new memory-compatible SC addition operation, and (v) enabling flexible block allocation. To show the generality and efficiency of our stochastic architecture, we implement image processing, deep neural networks (DNNs), and hyperdimensional (HD) computing on the proposed hardware. Our evaluations show that running DNN inference on COSMO is 141× faster and 80× more energy efficient as compared to GPU.

STAP: An Architecture and Design Tool for Automata Processing on Memristor TCAMs

ACM Journal on Emerging Technologies in Computing Systems ◽

10.1145/3450769 ◽

2022 ◽

Vol 18 (2) ◽

pp. 1-22

Author(s):

João Paulo Cardoso de Lima ◽

Marcelo Brandalero ◽

Michael Hübner ◽

Luigi Carro

Keyword(s):

Finite Automata ◽

Design Tool ◽

Design Flow ◽

Von Neumann ◽

Mapping Algorithm ◽

Processing Elements ◽

Finite State ◽

Specific Mapping ◽

Architecture And Design ◽

Communication Demands

Accelerating finite-state automata benefits several emerging application domains that are built on pattern matching. In-memory architectures, such as the Automata Processor (AP), are efficient to speed them up, at least for outperforming traditional von-Neumann architectures. In spite of the AP’s massive parallelism, current APs suffer from poor memory density, inefficient routing architectures, and limited capabilities. Although these limitations can be lessened by emerging memory technologies, its architecture is still the major source of huge communication demands and lack of scalability. To address these issues, we present STAP , a Scalable TCAM-based architecture for Automata Processing . STAP adopts a reconfigurable array of processing elements, which are based on memristive Ternary CAMs (TCAMs), to efficiently implement Non-deterministic finite automata (NFAs) through proper encoding and mapping methods. The CAD tool for STAP integrates the design flow of automata applications, a specific mapping algorithm, and place and route tools for connecting processing elements by RRAM-based programmable interconnects. Results showed 1.47× higher throughput when processing 16-bit input symbols, and improvements of 3.9× and 25× on state and routing densities over the state-of-the-art AP, while preserving 10 4 programming cycles.

Guest Editorial: ACM JETC Special Issue on Hardware-Aware Learning for Medical Applications

ACM Journal on Emerging Technologies in Computing Systems ◽

10.1145/3503262 ◽

2022 ◽

Vol 18 (2) ◽

pp. 1-3

Author(s):

Yiyu Shi ◽

Yongpan Liu ◽

Jianxu Chen ◽

Steve Jiang

Keyword(s):

Guest Editorial ◽

Medical Applications ◽

Special Issue

Impact of On-chip Interconnect on In-memory Acceleration of Deep Neural Networks

ACM Journal on Emerging Technologies in Computing Systems ◽

10.1145/3460233 ◽

2022 ◽

Vol 18 (2) ◽

pp. 1-22

Author(s):

Gokul Krishnan ◽

Sumit K. Mandal ◽

Chaitali Chakrabarti ◽

Jae-Sun Seo ◽

Umit Y. Ogras ◽

...

Keyword(s):

Neural Networks ◽

Deep Neural Networks ◽

Optimal Choice ◽

Machine Learning Algorithms ◽

Analytical Models ◽

Critical Function ◽

Data Movement ◽

Chip Data ◽

On Chip ◽

Connection Density

With the widespread use of Deep Neural Networks (DNNs), machine learning algorithms have evolved in two diverse directions—one with ever-increasing connection density for better accuracy and the other with more compact sizing for energy efficiency. The increase in connection density increases on-chip data movement, which makes efficient on-chip communication a critical function of the DNN accelerator. The contribution of this work is threefold. First, we illustrate that the point-to-point (P2P)-based interconnect is incapable of handling a high volume of on-chip data movement for DNNs. Second, we evaluate P2P and network-on-chip (NoC) interconnect (with a regular topology such as a mesh) for SRAM- and ReRAM-based in-memory computing (IMC) architectures for a range of DNNs. This analysis shows the necessity for the optimal interconnect choice for an IMC DNN accelerator. Finally, we perform an experimental evaluation for different DNNs to empirically obtain the performance of the IMC architecture with both NoC-tree and NoC-mesh. We conclude that, at the tile level, NoC-tree is appropriate for compact DNNs employed at the edge, and NoC-mesh is necessary to accelerate DNNs with high connection density. Furthermore, we propose a technique to determine the optimal choice of interconnect for any given DNN. In this technique, we use analytical models of NoC to evaluate end-to-end communication latency of any given DNN. We demonstrate that the interconnect optimization in the IMC architecture results in up to 6 × improvement in energy-delay-area product for VGG-19 inference compared to the state-of-the-art ReRAM-based IMC architectures.

Image Complexity Guided Network Compression for Biomedical Image Segmentation

ACM Journal on Emerging Technologies in Computing Systems ◽

10.1145/3471190 ◽

2022 ◽

Vol 18 (2) ◽

pp. 1-23

Author(s):

Suraj Mishra ◽

Danny Z. Chen ◽

X. Sharon Hu

Keyword(s):

Image Segmentation ◽

Network Architecture ◽

Resource Constraints ◽

Standard Procedure ◽

Network Size ◽

Compression Technique ◽

Image Complexity ◽

Biomedical Image ◽

Segmentation Accuracy ◽

Network Compression

Compression is a standard procedure for making convolutional neural networks (CNNs) adhere to some specific computing resource constraints. However, searching for a compressed architecture typically involves a series of time-consuming training/validation experiments to determine a good compromise between network size and performance accuracy. To address this, we propose an image complexity-guided network compression technique for biomedical image segmentation. Given any resource constraints, our framework utilizes data complexity and network architecture to quickly estimate a compressed model which does not require network training. Specifically, we map the dataset complexity to the target network accuracy degradation caused by compression. Such mapping enables us to predict the final accuracy for different network sizes, based on the computed dataset complexity. Thus, one may choose a solution that meets both the network size and segmentation accuracy requirements. Finally, the mapping is used to determine the convolutional layer-wise multiplicative factor for generating a compressed network. We conduct experiments using 5 datasets, employing 3 commonly-used CNN architectures for biomedical image segmentation as representative networks. Our proposed framework is shown to be effective for generating compressed segmentation networks, retaining up to ≈95% of the full-sized network segmentation accuracy, and at the same time, utilizing ≈32x fewer network trainable weights (average reduction) of the full-sized networks.

Guest Editorial: Computation-In-Memory (CIM): from Device to Applications

ACM Journal on Emerging Technologies in Computing Systems ◽

10.1145/3503263 ◽

2022 ◽

Vol 18 (2) ◽

pp. 1-3

Author(s):

Said Hamdioui ◽

Elena-Ioana Vatajelu ◽

Alberto Bosio

Keyword(s):

Guest Editorial

ACM Journal on Emerging Technologies in Computing Systems
Latest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Association For Computing Machinery

Hardware-accelerated Simulation-based Inference of Stochastic Epidemiology Models for COVID-19

Early Design Space Exploration Framework for Memristive Crossbar Arrays

Parallel Computing of Graph-based Functions in ReRAM

Accelerating On-Chip Training with Ferroelectric-Based Hybrid Precision Synapse

COSMO: Computing with Stochastic Numbers in Memory

STAP: An Architecture and Design Tool for Automata Processing on Memristor TCAMs

Guest Editorial: ACM JETC Special Issue on Hardware-Aware Learning for Medical Applications

Impact of On-chip Interconnect on In-memory Acceleration of Deep Neural Networks

Image Complexity Guided Network Compression for Biomedical Image Segmentation

Guest Editorial: Computation-In-Memory (CIM): from Device to Applications

Export Citation Format

ACM Journal on Emerging Technologies in Computing SystemsLatest Publications

TOTAL DOCUMENTS

H-INDEX

Published By Association For Computing Machinery

Hardware-accelerated Simulation-based Inference of Stochastic Epidemiology Models for COVID-19

Early Design Space Exploration Framework for Memristive Crossbar Arrays

Parallel Computing of Graph-based Functions in ReRAM

Accelerating On-Chip Training with Ferroelectric-Based Hybrid Precision Synapse

COSMO: Computing with Stochastic Numbers in Memory

STAP: An Architecture and Design Tool for Automata Processing on Memristor TCAMs

Guest Editorial: ACM JETC Special Issue on Hardware-Aware Learning for Medical Applications

Impact of On-chip Interconnect on In-memory Acceleration of Deep Neural Networks

Image Complexity Guided Network Compression for Biomedical Image Segmentation

Guest Editorial: Computation-In-Memory (CIM): from Device to Applications

ACM Journal on Emerging Technologies in Computing Systems
Latest Publications