systolic arrays
Recently Published Documents


TOTAL DOCUMENTS

516
(FIVE YEARS 26)

H-INDEX

31
(FIVE YEARS 2)

2021 ◽  
Vol 20 (5s) ◽  
pp. 1-20
Author(s):  
Hyungmin Cho

Depthwise convolutions are widely used in convolutional neural networks (CNNs) targeting mobile and embedded systems. Depthwise convolution layers reduce the computation loads and the number of parameters compared to the conventional convolution layers. Many deep neural network (DNN) accelerators adopt an architecture that exploits the high data-reuse factor of DNN computations, such as a systolic array. However, depthwise convolutions have low data-reuse factor and under-utilize the processing elements (PEs) in systolic arrays. In this paper, we present a DNN accelerator design called RiSA, which provides a novel mechanism that boosts the PE utilization for depthwise convolutions on a systolic array with minimal overheads. In addition, the PEs in systolic arrays can be efficiently used only if the data items ( tensors ) are arranged in the desired layout. Typical DNN accelerators provide various types of PE interconnects or additional modules to flexibly rearrange the data items and manage data movements during DNN computations. RiSA provides a lightweight set of tensor management tasks within the PE array itself that eliminates the need for an additional module for tensor reshaping tasks. Using this embedded tensor reshaping, RiSA supports various DNN models, including convolutional neural networks and natural language processing models while maintaining a high area efficiency. Compared to Eyeriss v2, RiSA improves the area and energy efficiency for MobileNet-V1 inference by 1.91× and 1.31×, respectively.


2021 ◽  
Vol 18 (4) ◽  
pp. 1-24
Author(s):  
Rui Xu ◽  
Sheng Ma ◽  
Yaohua Wang ◽  
Xinhai Chen ◽  
Yang Guo

The systolic array architecture is one of the most popular choices for convolutional neural network hardware accelerators. The biggest advantage of the systolic array architecture is its simple and efficient design principle. Without complicated control and dataflow, hardware accelerators with the systolic array can calculate traditional convolution very efficiently. However, this advantage also brings new challenges to the systolic array. When computing special types of convolution, such as the small-scale convolution or depthwise convolution, the processing element (PE) utilization rate of the array decreases sharply. The main reason is that the simple architecture design limits the flexibility of the systolic array. In this article, we design a configurable multi-directional systolic array (CMSA) to address these issues. First, we added a data path to the systolic array. It allows users to split the systolic array through configuration to speed up the calculation of small-scale convolution. Second, we redesigned the PE unit so that the array has multiple data transmission modes and dataflow strategies. This allows users to switch the dataflow of the PE array to speed up the calculation of depthwise convolution. In addition, unlike other works, we only make a few changes and modifications to the existing systolic array architecture. It avoids additional hardware overheads and can be easily deployed in application scenarios that require small systolic arrays such as mobile terminals. Based on our evaluation, CMSA can increase the PE utilization rate by up to 1.6 times compared to the typical systolic array when running the last layers of ResNet-18. When running depthwise convolution in MobileNet, CMSA can increase the utilization rate by up to 14.8 times. At the same time, CMSA and the traditional systolic arrays are similar in area and energy consumption.


2021 ◽  
Author(s):  
Neeraj Kumar Cheryala
Keyword(s):  

2021 ◽  
Vol 15 ◽  
pp. 1-7
Author(s):  
Halil Snopce ◽  
Azir Aliu

This paper deals with the latency analysis in a twodimensional systolic array for matrix multiplication. The latency for all possible connection schemes is discussed. In this way there is obtained the lower bound of the latency that can be achieved using such arrays.


Electronics ◽  
2021 ◽  
Vol 10 (6) ◽  
pp. 652
Author(s):  
Kashif Inayat ◽  
Jaeyong Chung

Systolic arrays are the primary part of modern deep learning accelerators and are being used widely in real-life applications such as self-driving cars. This paper presents a novel factored systolic array, where the carry propagation adder for accumulation and the rounding logic are extracted out from each processing element, which reduces the area, power and delay of the processing elements substantially. The factoring is performed in the column-wise manner and the cost of the factored logic, placed at each column output, is amortized by the processing elements in a column. We demonstrate the proposed factoring in an open source systolic array, Gemmini. The factoring technique does not change the functionality of the base design and is transparent to applications. We show that the proposed technique leads to substantial reduction in area and delay up to 45.3% and 23.7%, respectively, compared to the Gemmini baseline.


2021 ◽  
Author(s):  
Surya Selvam ◽  
Vinod Ganesan ◽  
Pratyush Kumar
Keyword(s):  

Electronics ◽  
2021 ◽  
Vol 10 (1) ◽  
pp. 70
Author(s):  
Rafael Gadea-Gironés ◽  
Vicente Herrero-Bosch ◽  
Jose Monzó-Ferrer ◽  
Ricardo Colom-Palero

In the world of algorithm acceleration and the implementation of deep neural networks’ recall phase, OpenCL based solutions have a clear tendency to produce perfectly adapted kernels in graphic processor unit (GPU) architectures. However, they fail to obtain the same results when applied to field-programmable gate array (FPGA) based architectures. This situation, along with an enormous advance in new GPU architectures, makes it unfeasible to defend an acceleration solution based on FPGA, even in terms of energy efficiency. Our goal in this paper is to demonstrate that multikernel structures can be written based on classic systolic arrays in OpenCL, trying to extract the most advanced features of FPGAs without having to resort to traditional FPGA development using lower level hardware description languages (HDLs) such as Verilog or VHDL. This OpenCL methodology is based on the intensive use of channels (IntelFPGA extension of OpenCL) for the communication of both data and control and on the refinement of the OpenCL libraries using register transfer logic (RTL) code to improve the performance of the implementation of the base and activation functions of the neurons and, above all, to reflect the importance of adequate communication between the layers when implementing neuronal networks.


2020 ◽  
Author(s):  
Lucas Silva ◽  
Michael Canesche ◽  
Ricardo Ferreira ◽  
José Augusto Nacif

Recently, the increasing adoption of domain-specific architectures to execute kernels with high computing density and the exploration of sparse architectures using Systolic Arrays created the ideal scenario for using Coarsegrained reconfigurable architectures (CGRAs) to accelerate applications. Unlike Systolic Array, CGRA can run different kernel sets and keep a good balance between energy consumption and performance. In this work, we present the HPCGRA, an orthogonal designed CGRA generator for high-performance spatial accelerators. Our tool does not require any expertise in Verilog design. In our approach, the CGRA is designed and implemented in an orthogonal fashion, through wrapping the main building blocks: functional units, interconnection patterns, routing, and elastic buffer capabilities, configuration words, and memories. It optimizes and simplifies the process of creating CGRAs architectures using a portable description (JSON file) and generating a generic, scalable, and efficient Verilog RTL code with Veriloggen. The tool automatically generates CGRA with up to 46x66 functional units, reaching 1.2 Tera ops/s.


Sign in / Sign up

Export Citation Format

Share Document