Embedded Processor Architectures

AbstractThe join and group-by aggregation are two memory intensive operators that are affecting the performance of relational databases. Hashing is a common approach used to implement both operators. Recent paradigm shifts in multi-core processor architectures have reinvigorated research into how the join and group-by aggregation operators can leverage these advances. However, the poor spatial locality of the hashing approach has hindered performance on multi-core processor architectures which rely on using large cache hierarchies for latency mitigation. Multithreaded architectures can better cope with poor spatial locality by masking memory latency with many outstanding requests. Nevertheless, the number of parallel threads, even in the most advanced multithreaded processors, such as UltraSPARC, is not enough to fully cover the main memory access latency. In this paper, we explore the hardware re-configurability of FPGAs to enable deeper execution pipelines that maintain hundreds (instead of tens) of outstanding memory requests across four FPGAs-drastically increasing concurrency and throughput. We present two end-to-end in-memory accelerators for the join and group-by aggregation operators using FPGAs. Both accelerators use massive multithreading to mask long memory delays of traversing linked-list data structures, while concurrently managing hundreds of thread states across four FPGAs locally. We explore how content addressable memories can be intermixed within our multithreaded designs to act as a synchronizing cache, which enforces locks and merges jobs together before they are written to memory. Throughput results for our hash-join operator accelerator show a speedup between 2$$\times $$ × and 3.4$$\times $$ × over the best multi-core approaches with comparable memory bandwidths on uniform and skewed datasets. The accelerator for the hash-based group-by aggregation operator demonstrates that leveraging CAMs achieves average speedup of 3.3$$\times $$ × with a best case of 9.4$$\times $$ × in terms of throughput over CPU implementations across five types of data distributions.

Download Full-text

Low-Complexity High-Throughput QC-LDPC Decoder for 5G New Radio Wireless Communication

Electronics ◽

10.3390/electronics10040516 ◽

2021 ◽

Vol 10 (4) ◽

pp. 516

Author(s):

Tram Thi Bao Nguyen ◽

Tuy Nguyen Tan ◽

Hanho Lee

Keyword(s):

Wireless Communication ◽

High Throughput ◽

Low Complexity ◽

Ldpc Decoder ◽

Processor Architectures ◽

New Radio ◽

Decoder Architecture ◽

Check Node ◽

Information Update ◽

Wireless Standards

This paper presents a pipelined layered quasi-cyclic low-density parity-check (QC-LDPC) decoder architecture targeting low-complexity, high-throughput, and efficient use of hardware resources compliant with the specifications of 5G new radio (NR) wireless communication standard. First, a combined min-sum (CMS) decoding algorithm, which is a combination of the offset min-sum and the original min-sum algorithm, is proposed. Then, a low-complexity and high-throughput pipelined layered QC-LDPC decoder architecture for enhanced mobile broadband specifications in 5G NR wireless standards based on CMS algorithm with pipeline layered scheduling is presented. Enhanced versions of check node-based processor architectures are proposed to improve the complexity of the LDPC decoders. An efficient minimum-finder for the check node unit architecture that reduces the hardware required for the computation of the first two minima is introduced. Moreover, a low complexity a posteriori information update unit architecture, which only requires one adder array for their operations, is presented. The proposed architecture shows significant improvements in terms of area and throughput compared to other QC-LDPC decoder architectures available in the literature.

Download Full-text

The Design of a 2D Graphics Accelerator for Embedded Systems

Electronics ◽

10.3390/electronics10040469 ◽

2021 ◽

Vol 10 (4) ◽

pp. 469

Author(s):

Hyun Woo Oh ◽

Ji Kwang Kim ◽

Gwan Beom Hwang ◽

Seung Eun Lee

Keyword(s):

Embedded Systems ◽

Embedded System ◽

Real Time ◽

Line Drawing ◽

Cmos Process ◽

Embedded Processor ◽

Processor Core ◽

Field Programmable ◽

The Embedded System ◽

Graphics Processing

Recently, advances in technology have enabled embedded systems to be adopted for a variety of applications. Some of these applications require real-time 2D graphics processing running on limited design specifications such as low power consumption and a small area. In order to satisfy such conditions, including a specific 2D graphics accelerator in the embedded system is an effective method. This method reduces the workload of the processor in the embedded system by exploiting the accelerator. The accelerator assists the system to perform 2D graphics processing in real-time. Therefore, a variety of applications that require 2D graphics processing can be implemented with an embedded processor. In this paper, we present a 2D graphics accelerator for tiny embedded systems. The accelerator includes an optimized line-drawing operation based on Bresenham’s algorithm. The optimized operation enables the accelerator to deal with various kinds of 2D graphics processing and to perform the line-drawing instead of the system processor. Moreover, the accelerator also distributes the workload of the processor core by removing the need for the core to access the frame buffer memory. We measure the performance of the accelerator by implementing the processor, including the accelerator, on a field-programmable gate array (FPGA), and ascertaining the possibility of realization by synthesizing using the 180 nm CMOS process.

Download Full-text

Dynamic register file resizing and frequency scaling to improve embedded processor performance and energy-delay efficiency

Proceedings of the 45th annual conference on Design automation - DAC '08 ◽

10.1145/1391469.1391488 ◽

2008 ◽

Cited By ~ 8

Author(s):

Houman Homayoun ◽

Sudeep Pasricha ◽

Mohammad Makhzan ◽

Alex Veidenbaum

Keyword(s):

Register File ◽

Frequency Scaling ◽

Embedded Processor ◽

Processor Performance

Download Full-text

Research on Fuzzy Control of Mine Ventilation Based on Embedded Systems

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.686.126 ◽

2014 ◽

Vol 686 ◽

pp. 126-131

Author(s):

Xiao Yan Sha

Keyword(s):

Fuzzy Control ◽

Control Algorithm ◽

Fuzzy Controller ◽

Simulation Analysis ◽

Control Unit ◽

Real Time Control ◽

Embedded Processor ◽

Feed Forward ◽

Time Control ◽

Software And Hardware

Taking embedded processor as the core control unit, the paper designs the fan monitoring system software and hardware to achieve the fan working condition detection and real-time control. For the control algorithm, the paper analyzes the fuzzy control system theory and composition, and then combined with tunnel ventilation particularity, introduce feed-forward model to predict the incremental acquisition of pollutants to reduce lag, combined with the system feedback value and the set value, by calculate of two independent computing fuzzy controller, and ultimately determine the number of units increase or decrease in the tunnel jet fans start and stop. Through simulation analysis, the introduction of a feed-forward signal, it can more effectively improve the capability of the system impact of interference.

Download Full-text

Efficient RGB to YCbCr Color Space Conversion for Embedded Application

Applied Mechanics and Materials ◽

10.4028/www.scientific.net/amm.543-547.2873 ◽

2014 ◽

Vol 543-547 ◽

pp. 2873-2878

Author(s):

Hui Yong Li ◽

Hong Xu Jiang ◽

Ping Zhang ◽

Han Qing Li ◽

Qian Cao

Keyword(s):

Direct Method ◽

Color Space ◽

Video Data ◽

Floating Point ◽

Portable Devices ◽

Embedded Processor ◽

Color Space Conversion ◽

Ycbcr Color Space ◽

Original Approach ◽

General Method

Modern embedded portable devices usually have to deal with large amounts of video data. Due to massive floating-point multiplications, the color space conversion is inefficient on the embedded processor. Considering the characteristics of RGB to YCbCr color space conversion, this paper proposed a strategy for truncated-based LUT Multiplier (T-LUT Multiplier). On this base, an original approach converting RGB to YCbCr is presented which employs the T-LUT Multiplier and the pipeline-based adder. Experimental results demonstrate that the proposed method can obtain maximum operating frequency of 358MHz, 3.5 times faster than the direct method. Furthermore, the power consumption is less than that of the general method approximately by 15%~27%.

Download Full-text