Embedded Processor Architectures

2021 ◽  
pp. 341-389
Author(s):  
KCS Murti
Author(s):  
Dimitris Gizopoulos ◽  
Antonis Paschalis ◽  
Yervant Zorian
Keyword(s):  

2021 ◽  
Author(s):  
Bashar Romanous ◽  
Skyler Windh ◽  
Ildar Absalyamov ◽  
Prerna Budhkar ◽  
Robert Halstead ◽  
...  

AbstractThe join and group-by aggregation are two memory intensive operators that are affecting the performance of relational databases. Hashing is a common approach used to implement both operators. Recent paradigm shifts in multi-core processor architectures have reinvigorated research into how the join and group-by aggregation operators can leverage these advances. However, the poor spatial locality of the hashing approach has hindered performance on multi-core processor architectures which rely on using large cache hierarchies for latency mitigation. Multithreaded architectures can better cope with poor spatial locality by masking memory latency with many outstanding requests. Nevertheless, the number of parallel threads, even in the most advanced multithreaded processors, such as UltraSPARC, is not enough to fully cover the main memory access latency. In this paper, we explore the hardware re-configurability of FPGAs to enable deeper execution pipelines that maintain hundreds (instead of tens) of outstanding memory requests across four FPGAs-drastically increasing concurrency and throughput. We present two end-to-end in-memory accelerators for the join and group-by aggregation operators using FPGAs. Both accelerators use massive multithreading to mask long memory delays of traversing linked-list data structures, while concurrently managing hundreds of thread states across four FPGAs locally. We explore how content addressable memories can be intermixed within our multithreaded designs to act as a synchronizing cache, which enforces locks and merges jobs together before they are written to memory. Throughput results for our hash-join operator accelerator show a speedup between 2$$\times $$ × and 3.4$$\times $$ × over the best multi-core approaches with comparable memory bandwidths on uniform and skewed datasets. The accelerator for the hash-based group-by aggregation operator demonstrates that leveraging CAMs achieves average speedup of 3.3$$\times $$ × with a best case of 9.4$$\times $$ × in terms of throughput over CPU implementations across five types of data distributions.


Electronics ◽  
2021 ◽  
Vol 10 (4) ◽  
pp. 516
Author(s):  
Tram Thi Bao Nguyen ◽  
Tuy Nguyen Tan ◽  
Hanho Lee

This paper presents a pipelined layered quasi-cyclic low-density parity-check (QC-LDPC) decoder architecture targeting low-complexity, high-throughput, and efficient use of hardware resources compliant with the specifications of 5G new radio (NR) wireless communication standard. First, a combined min-sum (CMS) decoding algorithm, which is a combination of the offset min-sum and the original min-sum algorithm, is proposed. Then, a low-complexity and high-throughput pipelined layered QC-LDPC decoder architecture for enhanced mobile broadband specifications in 5G NR wireless standards based on CMS algorithm with pipeline layered scheduling is presented. Enhanced versions of check node-based processor architectures are proposed to improve the complexity of the LDPC decoders. An efficient minimum-finder for the check node unit architecture that reduces the hardware required for the computation of the first two minima is introduced. Moreover, a low complexity a posteriori information update unit architecture, which only requires one adder array for their operations, is presented. The proposed architecture shows significant improvements in terms of area and throughput compared to other QC-LDPC decoder architectures available in the literature.


Electronics ◽  
2021 ◽  
Vol 10 (4) ◽  
pp. 469
Author(s):  
Hyun Woo Oh ◽  
Ji Kwang Kim ◽  
Gwan Beom Hwang ◽  
Seung Eun Lee

Recently, advances in technology have enabled embedded systems to be adopted for a variety of applications. Some of these applications require real-time 2D graphics processing running on limited design specifications such as low power consumption and a small area. In order to satisfy such conditions, including a specific 2D graphics accelerator in the embedded system is an effective method. This method reduces the workload of the processor in the embedded system by exploiting the accelerator. The accelerator assists the system to perform 2D graphics processing in real-time. Therefore, a variety of applications that require 2D graphics processing can be implemented with an embedded processor. In this paper, we present a 2D graphics accelerator for tiny embedded systems. The accelerator includes an optimized line-drawing operation based on Bresenham’s algorithm. The optimized operation enables the accelerator to deal with various kinds of 2D graphics processing and to perform the line-drawing instead of the system processor. Moreover, the accelerator also distributes the workload of the processor core by removing the need for the core to access the frame buffer memory. We measure the performance of the accelerator by implementing the processor, including the accelerator, on a field-programmable gate array (FPGA), and ascertaining the possibility of realization by synthesizing using the 180 nm CMOS process.


2014 ◽  
Vol 686 ◽  
pp. 126-131
Author(s):  
Xiao Yan Sha

Taking embedded processor as the core control unit, the paper designs the fan monitoring system software and hardware to achieve the fan working condition detection and real-time control. For the control algorithm, the paper analyzes the fuzzy control system theory and composition, and then combined with tunnel ventilation particularity, introduce feed-forward model to predict the incremental acquisition of pollutants to reduce lag, combined with the system feedback value and the set value, by calculate of two independent computing fuzzy controller, and ultimately determine the number of units increase or decrease in the tunnel jet fans start and stop. Through simulation analysis, the introduction of a feed-forward signal, it can more effectively improve the capability of the system impact of interference.


2014 ◽  
Vol 543-547 ◽  
pp. 2873-2878
Author(s):  
Hui Yong Li ◽  
Hong Xu Jiang ◽  
Ping Zhang ◽  
Han Qing Li ◽  
Qian Cao

Modern embedded portable devices usually have to deal with large amounts of video data. Due to massive floating-point multiplications, the color space conversion is inefficient on the embedded processor. Considering the characteristics of RGB to YCbCr color space conversion, this paper proposed a strategy for truncated-based LUT Multiplier (T-LUT Multiplier). On this base, an original approach converting RGB to YCbCr is presented which employs the T-LUT Multiplier and the pipeline-based adder. Experimental results demonstrate that the proposed method can obtain maximum operating frequency of 358MHz, 3.5 times faster than the direct method. Furthermore, the power consumption is less than that of the general method approximately by 15%~27%.


Sign in / Sign up

Export Citation Format

Share Document