Configurable Multi-directional Systolic Array Architecture for Convolutional Neural Networks

The systolic array architecture is one of the most popular choices for convolutional neural network hardware accelerators. The biggest advantage of the systolic array architecture is its simple and efficient design principle. Without complicated control and dataflow, hardware accelerators with the systolic array can calculate traditional convolution very efficiently. However, this advantage also brings new challenges to the systolic array. When computing special types of convolution, such as the small-scale convolution or depthwise convolution, the processing element (PE) utilization rate of the array decreases sharply. The main reason is that the simple architecture design limits the flexibility of the systolic array. In this article, we design a configurable multi-directional systolic array (CMSA) to address these issues. First, we added a data path to the systolic array. It allows users to split the systolic array through configuration to speed up the calculation of small-scale convolution. Second, we redesigned the PE unit so that the array has multiple data transmission modes and dataflow strategies. This allows users to switch the dataflow of the PE array to speed up the calculation of depthwise convolution. In addition, unlike other works, we only make a few changes and modifications to the existing systolic array architecture. It avoids additional hardware overheads and can be easily deployed in application scenarios that require small systolic arrays such as mobile terminals. Based on our evaluation, CMSA can increase the PE utilization rate by up to 1.6 times compared to the typical systolic array when running the last layers of ResNet-18. When running depthwise convolution in MobileNet, CMSA can increase the utilization rate by up to 14.8 times. At the same time, CMSA and the traditional systolic arrays are similar in area and energy consumption.

Download Full-text

HeSA: Heterogeneous Systolic Array Architecture for Compact CNNs Hardware Accelerators

10.23919/date51398.2021.9474145 ◽

2021 ◽

Author(s):

Rui Xu ◽

Sheng Ma ◽

Yaohua Wang ◽

Yang Guo

Keyword(s):

Systolic Array ◽

Hardware Accelerators ◽

Array Architecture

Download Full-text

Heterogeneous Systolic Array Architecture for Compact CNNs Hardware Accelerators

IEEE Transactions on Parallel and Distributed Systems ◽

10.1109/tpds.2021.3129647 ◽

2021 ◽

pp. 1-1

Author(s):

Rui Xu ◽

Sheng Ma ◽

Yaohua Wang ◽

Yang Guo ◽

Dongsheng Li ◽

...

Keyword(s):

Systolic Array ◽

Hardware Accelerators ◽

Array Architecture

Download Full-text

A new VLSI algorithm and architecture for the hardware implementation of type IV discrete cosine transform using a pseudo-band correlation structure

Open Computer Science ◽

10.2478/s13537-011-0015-z ◽

2011 ◽

Vol 1 (2) ◽

Cited By ~ 3

Author(s):

Doru Chiper

Keyword(s):

Discrete Cosine Transform ◽

Systolic Array ◽

Correlation Structure ◽

Efficient Design ◽

Hardware Complexity ◽

Cosine Transform ◽

Type Iv ◽

Vlsi Chip ◽

Array Architecture ◽

Vlsi Algorithm

AbstractA new VLSI algorithm and its associated systolic array architecture for a prime length type IV discrete cosine transform is presented. They represent the basis of an efficient design approach for deriving a linear systolic array architecture for type IV DCT. The proposed algorithm uses a regular computational structure called pseudoband correlation structure that is appropriate for a VLSI implementation. The proposed algorithm is then mapped onto a linear systolic array with a small number of I/O channels and low I/O bandwidth. The proposed architecture can be unified with that obtained for type IV DST due to a similar kernel. A highly efficient VLSI chip can be thus obtained with good performance in the architectural topology, computing parallelism, processing speed, hardware complexity and I/O costs similar to those obtained for circular correlation and cyclic convolution computational structures.

Download Full-text

RiSA: A Reinforced Systolic Array for Depthwise Convolutions and Embedded Tensor Reshaping

ACM Transactions on Embedded Computing Systems ◽

10.1145/3476984 ◽

2021 ◽

Vol 20 (5s) ◽

pp. 1-20

Author(s):

Hyungmin Cho

Keyword(s):

Neural Networks ◽

Convolutional Neural Networks ◽

Language Processing ◽

Systolic Array ◽

Data Reuse ◽

Systolic Arrays ◽

High Data ◽

Area Efficiency ◽

High Area ◽

Accelerator Design

Depthwise convolutions are widely used in convolutional neural networks (CNNs) targeting mobile and embedded systems. Depthwise convolution layers reduce the computation loads and the number of parameters compared to the conventional convolution layers. Many deep neural network (DNN) accelerators adopt an architecture that exploits the high data-reuse factor of DNN computations, such as a systolic array. However, depthwise convolutions have low data-reuse factor and under-utilize the processing elements (PEs) in systolic arrays. In this paper, we present a DNN accelerator design called RiSA, which provides a novel mechanism that boosts the PE utilization for depthwise convolutions on a systolic array with minimal overheads. In addition, the PEs in systolic arrays can be efficiently used only if the data items ( tensors ) are arranged in the desired layout. Typical DNN accelerators provide various types of PE interconnects or additional modules to flexibly rearrange the data items and manage data movements during DNN computations. RiSA provides a lightweight set of tensor management tasks within the PE array itself that eliminates the need for an additional module for tensor reshaping tasks. Using this embedded tensor reshaping, RiSA supports various DNN models, including convolutional neural networks and natural language processing models while maintaining a high area efficiency. Compared to Eyeriss v2, RiSA improves the area and energy efficiency for MobileNet-V1 inference by 1.91× and 1.31×, respectively.

Download Full-text

A scalable systolic array architecture for the 2D discrete wavelet transform

38th Midwest Symposium on Circuits and Systems. Proceedings ◽

10.1109/mwscas.1995.510290 ◽

2002 ◽

Author(s):

Jijun Chen ◽

M.A. Bayoumi

Keyword(s):

Wavelet Transform ◽

Discrete Wavelet Transform ◽

Systolic Array ◽

Discrete Wavelet ◽

Array Architecture

Download Full-text

An Analysis of the Current Status of Woody Biomass Gasification Power Generation in Japan

Energies ◽

10.3390/en13184903 ◽

2020 ◽

Vol 13 (18) ◽

pp. 4903

Author(s):

Yasutsugu Baba ◽

Andante Hadi Pandyaswargo ◽

Hiroshi Onoda

Keyword(s):

Power Generation ◽

Moisture Content ◽

Tree Species ◽

Woody Biomass ◽

Biomass Gasification ◽

Energy Utilization ◽

Current Status ◽

Utilization Rate ◽

Small Scale ◽

Bio Oil

Forests cover two-thirds of Japan’s land area, and woody biomass is attracting attention as one of the most promising renewable energy sources in the country. The Feed-in Tariff (FIT) Act came into effect in 2012, and since then, woody biomass power generation has spread rapidly. Gasification power generation, which can generate electricity on a relatively small scale, has attracted a lot of attention. However, the technical issues of this technology remain poorly defined. This paper aims to clarify the problems of woody biomass gasification power generation in Japan, specifically on the challenges of improving energy utilization rate, the problem of controlling the moisture content, and the different performance of power generation facilities that uses different tree species. We also describe the technological development of a 2 MW updraft reactor for gasification and bio-oil coproduction to improve the energy utilization rate. The lower heating value of bio-oil, which was obtained in the experiment, was found to be about 70% of A-fuel oil. Among the results, the importance of controlling the moisture content of wood chips is identified from the measurement evaluation of a 0.36 MW-scale downdraft gasifier’s actual operation. We discuss the effects of tree species variation and ash on gasification power generation based on the results of pyrolysis analysis, industry analysis for each tree species. These results indicate the necessity of building a system specifically suited to Japan’s climate and forestry industry to allow woody biomass gasification power generation to become widespread in Japan.

Download Full-text

FAMOUS, faster: using parallel computing techniques to accelerate the FAMOUS/HadCM3 climate model with a focus on the radiative transfer algorithm

Geoscientific Model Development ◽

10.5194/gmd-4-835-2011 ◽

2011 ◽

Vol 4 (3) ◽

pp. 835-844 ◽

Cited By ~ 10

Author(s):

P. Hanappe ◽

A. Beurivé ◽

F. Laguzet ◽

L. Steels ◽

N. Bellouin ◽

...

Keyword(s):

Climate Model ◽

Processing Element ◽

Atmospheric Radiation ◽

Multiple Data ◽

Fortran Code ◽

Thread Pool ◽

Speed Up ◽

Single Data ◽

Hardware Platforms ◽

Air Column

Abstract. We have optimised the atmospheric radiation algorithm of the FAMOUS climate model on several hardware platforms. The optimisation involved translating the Fortran code to C and restructuring the algorithm around the computation of a single air column. Instead of the existing MPI-based domain decomposition, we used a task queue and a thread pool to schedule the computation of individual columns on the available processors. Finally, four air columns are packed together in a single data structure and computed simultaneously using Single Instruction Multiple Data operations. The modified algorithm runs more than 50 times faster on the CELL's Synergistic Processing Element than on its main PowerPC processing element. On Intel-compatible processors, the new radiation code runs 4 times faster. On the tested graphics processor, using OpenCL, we find a speed-up of more than 2.5 times as compared to the original code on the main CPU. Because the radiation code takes more than 60 % of the total CPU time, FAMOUS executes more than twice as fast. Our version of the algorithm returns bit-wise identical results, which demonstrates the robustness of our approach. We estimate that this project required around two and a half man-years of work.

Download Full-text

A systolic array architecture for the Applebaum-Howells array

IEEE Transactions on Antennas and Propagation ◽

10.1109/8.56974 ◽

1990 ◽

Vol 38 (8) ◽

pp. 1310-1313 ◽

Cited By ~ 3

Author(s):

M. Ueno ◽

K. Kawabata ◽

T. Morooka

Keyword(s):

Systolic Array ◽

Array Architecture

Download Full-text

A Performance Comparison of Three Micro-Sized Blade Rotor Designs for Malaysia Wind Speed Condition

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.785.310 ◽

2015 ◽

Vol 785 ◽

pp. 310-314 ◽

Cited By ~ 1

Author(s):

Norzanah Rosmin ◽

N.A. Rahman ◽

A.H. Mustaamal

Keyword(s):

Electrical Power ◽

Vertical Axis ◽

Performance Comparison ◽

Power Coefficient ◽

Small Scale ◽

Experimental Setup ◽

Wind Speeds ◽

Vertical Axis Wind Turbines ◽

Speed Up ◽

Electrical Generation

Vertical-Axis Wind Turbines (VAWTs) are known as the most suitable wind turbine for small-scale electrical generation. There are many types of VAWTs and each of it has different performances and efficiency. In this work, three types of VAWT systems (Savo-B2, Savo-B4 and Giro-B3) were designed, constructed and tested to investigate the amount of electrical power that could be generated under several constant wind speeds. The blade rotors were designed and built using 2 mm thickness of aluminum plate. The tip speed ratios, power coefficients, blade rotations for each blade rotor and the simplicity of the proposed designs were studied via an experimental setup. The experimental work demonstrates that Savo-B2 provides the highest power coefficient which is up to 0.32. Meanwhile, Giro-B3 offers the fastest rotational blade speed, up to 20.53 rad/s, among the three designs.

Download Full-text

A fully distributed unstructured Navier-Stokes solver for large-scale aeroelasticity computations

The Aeronautical Journal ◽

10.1017/s0001924000012392 ◽

2001 ◽

Vol 105 (1050) ◽

pp. 419-426 ◽

Cited By ~ 10

Author(s):

G. Barakos ◽

M. Vahdati ◽

A.I. Sayma ◽

C. Bréard ◽

M. Imregun

Keyword(s):

Large Scale ◽

Numerical Models ◽

Navier Stokes ◽

Multiple Data ◽

Computational Mesh ◽

Blade Row ◽

Scale Modelling ◽

Modelling Methodology ◽

Speed Up ◽

Development And Validation

Abstract This paper presents the development and validation of a parallel unsteady flow and aeroelasticity code for large-scale numerical models used in turbo machinery applications. The work is based on an existing unstructured Navier-Stokes solver developed over the past ten years by the Aeroelasticity Research Group at Imperial College Vibration University Technology Centre. The single-process multiple-data paradigm was adopted for the parallelisation of the solver and several validation cases were considered. The computational mesh was divided into several sub-sections using a domain decomposition technique. The performance and numerical accuracy of the parallel solver was validated across several computer platforms for various problem sizes. In cases where the solution could be obtained on a single CPU, the serial and parallel versions of the code were found to produce identical results. Studies on up to 32 CPUs showed varying levels of parallelisation efficiency, an almost linear speed-up being obtained in some cases. Finally, an industrial configuration, a 17 blade row turbine with a 47 million point mesh, was discussed to illustrate the potential of the proposed large-scale modelling methodology.

Download Full-text