systolic arrays Latest Research Papers

Depthwise convolutions are widely used in convolutional neural networks (CNNs) targeting mobile and embedded systems. Depthwise convolution layers reduce the computation loads and the number of parameters compared to the conventional convolution layers. Many deep neural network (DNN) accelerators adopt an architecture that exploits the high data-reuse factor of DNN computations, such as a systolic array. However, depthwise convolutions have low data-reuse factor and under-utilize the processing elements (PEs) in systolic arrays. In this paper, we present a DNN accelerator design called RiSA, which provides a novel mechanism that boosts the PE utilization for depthwise convolutions on a systolic array with minimal overheads. In addition, the PEs in systolic arrays can be efficiently used only if the data items ( tensors ) are arranged in the desired layout. Typical DNN accelerators provide various types of PE interconnects or additional modules to flexibly rearrange the data items and manage data movements during DNN computations. RiSA provides a lightweight set of tensor management tasks within the PE array itself that eliminates the need for an additional module for tensor reshaping tasks. Using this embedded tensor reshaping, RiSA supports various DNN models, including convolutional neural networks and natural language processing models while maintaining a high area efficiency. Compared to Eyeriss v2, RiSA improves the area and energy efficiency for MobileNet-V1 inference by 1.91× and 1.31×, respectively.

Get full-text (via PubEx)

Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3460776 ◽

2021 ◽

Vol 18 (4) ◽

pp. 1-24

Author(s):

Rui Xu ◽

Sheng Ma ◽

Yaohua Wang ◽

Xinhai Chen ◽

Yang Guo

Keyword(s):

Systolic Array ◽

Utilization Rate ◽

Small Scale ◽

Systolic Arrays ◽

Hardware Accelerators ◽

Efficient Design ◽

Multiple Data ◽

Transmission Modes ◽

Array Architecture ◽

Speed Up

The systolic array architecture is one of the most popular choices for convolutional neural network hardware accelerators. The biggest advantage of the systolic array architecture is its simple and efficient design principle. Without complicated control and dataflow, hardware accelerators with the systolic array can calculate traditional convolution very efficiently. However, this advantage also brings new challenges to the systolic array. When computing special types of convolution, such as the small-scale convolution or depthwise convolution, the processing element (PE) utilization rate of the array decreases sharply. The main reason is that the simple architecture design limits the flexibility of the systolic array. In this article, we design a configurable multi-directional systolic array (CMSA) to address these issues. First, we added a data path to the systolic array. It allows users to split the systolic array through configuration to speed up the calculation of small-scale convolution. Second, we redesigned the PE unit so that the array has multiple data transmission modes and dataflow strategies. This allows users to switch the dataflow of the PE array to speed up the calculation of depthwise convolution. In addition, unlike other works, we only make a few changes and modifications to the existing systolic array architecture. It avoids additional hardware overheads and can be easily deployed in application scenarios that require small systolic arrays such as mobile terminals. Based on our evaluation, CMSA can increase the PE utilization rate by up to 1.6 times compared to the typical systolic array when running the last layers of ResNet-18. When running depthwise convolution in MobileNet, CMSA can increase the utilization rate by up to 14.8 times. At the same time, CMSA and the traditional systolic arrays are similar in area and energy consumption.

Get full-text (via PubEx)

Systolic Arrays and the TPU

10.20935/al1403 ◽

2021 ◽

Author(s):

Neeraj Kumar Cheryala

Keyword(s):

Systolic Arrays

Get full-text (via PubEx)

Accelerating Homomorphic Encryption using Systolic Arrays with Polynomial Optimization

2021 International Symposium on Electronics and Smart Devices (ISESD) ◽

10.1109/isesd53023.2021.9501677 ◽

2021 ◽

Author(s):

Hamdani Fadhli ◽

Infall Syafalni ◽

Nana Sutisna ◽

Rahmat Mulyawan ◽

M. Iqbal Arsyad ◽

...

Keyword(s):

Homomorphic Encryption ◽

Polynomial Optimization ◽

Systolic Arrays

Get full-text (via PubEx)

Latency Analysis in the 2-Dimensional Systolic Arrays for Matrix Multiplication

International Journal of Computers ◽

10.46300/9108.2021.15.1 ◽

2021 ◽

Vol 15 ◽

pp. 1-7

Author(s):

Halil Snopce ◽

Azir Aliu

Keyword(s):

Lower Bound ◽

Systolic Array ◽

Matrix Multiplication ◽

Systolic Arrays ◽

Latency Analysis

This paper deals with the latency analysis in a twodimensional systolic array for matrix multiplication. The latency for all possible connection schemes is discussed. In this way there is obtained the lower bound of the latency that can be achieved using such arrays.

Get full-text (via PubEx)

Carry-Propagation-Adder-Factored Gemmini Systolic Array for Machine Learning Acceleration

Electronics ◽

10.3390/electronics10060652 ◽

2021 ◽

Vol 10 (6) ◽

pp. 652

Author(s):

Kashif Inayat ◽

Jaeyong Chung

Keyword(s):

Machine Learning ◽

Systolic Array ◽

Substantial Reduction ◽

Real Life ◽

Systolic Arrays ◽

Processing Elements ◽

Self Driving Cars ◽

The Cost ◽

Reduction In Area ◽

Base Design

Systolic arrays are the primary part of modern deep learning accelerators and are being used widely in real-life applications such as self-driving cars. This paper presents a novel factored systolic array, where the carry propagation adder for accumulation and the rounding logic are extracted out from each processing element, which reduces the area, power and delay of the processing elements substantially. The factoring is performed in the column-wise manner and the cost of the factored logic, placed at each column output, is amortized by the processing elements in a column. We demonstrate the proposed factoring in an open source systolic array, Gemmini. The factoring technique does not change the functionality of the base design and is transparent to applications. We show that the proposed technique leads to substantial reduction in area and delay up to 45.3% and 23.7%, respectively, compared to the Gemmini baseline.

Get full-text (via PubEx)

FuSeConv: Fully Separable Convolutions for Fast Inference on Systolic Arrays

10.23919/date51398.2021.9473985 ◽

2021 ◽

Author(s):

Surya Selvam ◽

Vinod Ganesan ◽

Pratyush Kumar

Keyword(s):

Systolic Arrays

Get full-text (via PubEx)

Implementation of Autoencoders with Systolic Arrays through OpenCL

Electronics ◽

10.3390/electronics10010070 ◽

2021 ◽

Vol 10 (1) ◽

pp. 70

Author(s):

Rafael Gadea-Gironés ◽

Vicente Herrero-Bosch ◽

Jose Monzó-Ferrer ◽

Ricardo Colom-Palero

Keyword(s):

Deep Neural Networks ◽

Systolic Arrays ◽

Graphic Processor Unit ◽

Field Programmable ◽

Hardware Description ◽

Processor Unit ◽

Description Languages ◽

Gpu Architectures ◽

And Control ◽

Adequate Communication

In the world of algorithm acceleration and the implementation of deep neural networks’ recall phase, OpenCL based solutions have a clear tendency to produce perfectly adapted kernels in graphic processor unit (GPU) architectures. However, they fail to obtain the same results when applied to field-programmable gate array (FPGA) based architectures. This situation, along with an enormous advance in new GPU architectures, makes it unfeasible to defend an acceleration solution based on FPGA, even in terms of energy efficiency. Our goal in this paper is to demonstrate that multikernel structures can be written based on classic systolic arrays in OpenCL, trying to extract the most advanced features of FPGAs without having to resort to traditional FPGA development using lower level hardware description languages (HDLs) such as Verilog or VHDL. This OpenCL methodology is based on the intensive use of channels (IntelFPGA extension of OpenCL) for the communication of both data and control and on the refinement of the OpenCL libraries using register transfer logic (RTL) code to improve the performance of the implementation of the base and activation functions of the neurons and, above all, to reflect the importance of adequate communication between the layers when implementing neuronal networks.

Get full-text (via PubEx)

Evaluation of the use of systolic arrays in the implementation of matrix multiplication algorithms on FPGAs

Problems of advanced micro- and nanoelectronic systems development ◽

10.31114/2078-7707-2021-3-76-80 ◽

2021 ◽

pp. 76-80

Author(s):

S.V. Pasynkov ◽

R.F. Iliasov ◽

◽

Keyword(s):

Matrix Multiplication ◽

Systolic Arrays

Get full-text (via PubEx)

HPCGRA - An Orthogonal Designed CGRA Generator for High Performance Spatial Accelerators

10.5753/wscad.2020.14055 ◽

2020 ◽

Author(s):

Lucas Silva ◽

Michael Canesche ◽

Ricardo Ferreira ◽

José Augusto Nacif

Keyword(s):

Energy Consumption ◽

High Performance ◽

Building Blocks ◽

Reconfigurable Architectures ◽

Systolic Arrays ◽

Good Balance ◽

Domain Specific ◽

Functional Units ◽

And Performance ◽

The Ideal

Recently, the increasing adoption of domain-speciﬁc architectures to execute kernels with high computing density and the exploration of sparse architectures using Systolic Arrays created the ideal scenario for using Coarsegrained reconﬁgurable architectures (CGRAs) to accelerate applications. Unlike Systolic Array, CGRA can run different kernel sets and keep a good balance between energy consumption and performance. In this work, we present the HPCGRA, an orthogonal designed CGRA generator for high-performance spatial accelerators. Our tool does not require any expertise in Verilog design. In our approach, the CGRA is designed and implemented in an orthogonal fashion, through wrapping the main building blocks: functional units, interconnection patterns, routing, and elastic buffer capabilities, conﬁguration words, and memories. It optimizes and simpliﬁes the process of creating CGRAs architectures using a portable description (JSON ﬁle) and generating a generic, scalable, and efﬁcient Verilog RTL code with Veriloggen. The tool automatically generates CGRA with up to 46x66 functional units, reaching 1.2 Tera ops/s.

Get full-text (via PubEx)

systolic arrays
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

RiSA: A Reinforced Systolic Array for Depthwise Convolutions and Embedded Tensor Reshaping

Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

Systolic Arrays and the TPU

Accelerating Homomorphic Encryption using Systolic Arrays with Polynomial Optimization

Latency Analysis in the 2-Dimensional Systolic Arrays for Matrix Multiplication

Carry-Propagation-Adder-Factored Gemmini Systolic Array for Machine Learning Acceleration

FuSeConv: Fully Separable Convolutions for Fast Inference on Systolic Arrays

Implementation of Autoencoders with Systolic Arrays through OpenCL

Evaluation of the use of systolic arrays in the implementation of matrix multiplication algorithms on FPGAs

HPCGRA - An Orthogonal Designed CGRA Generator for High Performance Spatial Accelerators

Export Citation Format

systolic arraysRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

RiSA: A Reinforced Systolic Array for Depthwise Convolutions and Embedded Tensor Reshaping

Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

Systolic Arrays and the TPU

Accelerating Homomorphic Encryption using Systolic Arrays with Polynomial Optimization

Latency Analysis in the 2-Dimensional Systolic Arrays for Matrix Multiplication

Carry-Propagation-Adder-Factored Gemmini Systolic Array for Machine Learning Acceleration

FuSeConv: Fully Separable Convolutions for Fast Inference on Systolic Arrays

Implementation of Autoencoders with Systolic Arrays through OpenCL

Evaluation of the use of systolic arrays in the implementation of matrix multiplication algorithms on FPGAs

HPCGRA - An Orthogonal Designed CGRA Generator for High Performance Spatial Accelerators

systolic arrays
Recently Published Documents