Survey of the counterflow pipeline processor architectures

Author(s):  
P. Balaji ◽  
W. Mahmoud ◽  
E. Ososanya ◽  
K. Thangarajan
2021 ◽  
Author(s):  
Bashar Romanous ◽  
Skyler Windh ◽  
Ildar Absalyamov ◽  
Prerna Budhkar ◽  
Robert Halstead ◽  
...  

AbstractThe join and group-by aggregation are two memory intensive operators that are affecting the performance of relational databases. Hashing is a common approach used to implement both operators. Recent paradigm shifts in multi-core processor architectures have reinvigorated research into how the join and group-by aggregation operators can leverage these advances. However, the poor spatial locality of the hashing approach has hindered performance on multi-core processor architectures which rely on using large cache hierarchies for latency mitigation. Multithreaded architectures can better cope with poor spatial locality by masking memory latency with many outstanding requests. Nevertheless, the number of parallel threads, even in the most advanced multithreaded processors, such as UltraSPARC, is not enough to fully cover the main memory access latency. In this paper, we explore the hardware re-configurability of FPGAs to enable deeper execution pipelines that maintain hundreds (instead of tens) of outstanding memory requests across four FPGAs-drastically increasing concurrency and throughput. We present two end-to-end in-memory accelerators for the join and group-by aggregation operators using FPGAs. Both accelerators use massive multithreading to mask long memory delays of traversing linked-list data structures, while concurrently managing hundreds of thread states across four FPGAs locally. We explore how content addressable memories can be intermixed within our multithreaded designs to act as a synchronizing cache, which enforces locks and merges jobs together before they are written to memory. Throughput results for our hash-join operator accelerator show a speedup between 2$$\times $$ × and 3.4$$\times $$ × over the best multi-core approaches with comparable memory bandwidths on uniform and skewed datasets. The accelerator for the hash-based group-by aggregation operator demonstrates that leveraging CAMs achieves average speedup of 3.3$$\times $$ × with a best case of 9.4$$\times $$ × in terms of throughput over CPU implementations across five types of data distributions.


Electronics ◽  
2021 ◽  
Vol 10 (4) ◽  
pp. 516
Author(s):  
Tram Thi Bao Nguyen ◽  
Tuy Nguyen Tan ◽  
Hanho Lee

This paper presents a pipelined layered quasi-cyclic low-density parity-check (QC-LDPC) decoder architecture targeting low-complexity, high-throughput, and efficient use of hardware resources compliant with the specifications of 5G new radio (NR) wireless communication standard. First, a combined min-sum (CMS) decoding algorithm, which is a combination of the offset min-sum and the original min-sum algorithm, is proposed. Then, a low-complexity and high-throughput pipelined layered QC-LDPC decoder architecture for enhanced mobile broadband specifications in 5G NR wireless standards based on CMS algorithm with pipeline layered scheduling is presented. Enhanced versions of check node-based processor architectures are proposed to improve the complexity of the LDPC decoders. An efficient minimum-finder for the check node unit architecture that reduces the hardware required for the computation of the first two minima is introduced. Moreover, a low complexity a posteriori information update unit architecture, which only requires one adder array for their operations, is presented. The proposed architecture shows significant improvements in terms of area and throughput compared to other QC-LDPC decoder architectures available in the literature.


Author(s):  
Nikolaos Zombakis ◽  
Yahya H. Yassin ◽  
Michail Noltsis ◽  
Dimitrios Soudris ◽  
Per Gunnar Kjeldsberg ◽  
...  

2019 ◽  
Vol 26 (1) ◽  
pp. 39-62
Author(s):  
Stanislav O. Bezzubtsev ◽  
Vyacheslav V. Vasin ◽  
Dmitry Yu. Volkanov ◽  
Shynar R. Zhailauova ◽  
Vladislav A. Miroshnik ◽  
...  

The paper proposes the architecture and basic requirements for a network processor for OpenFlow switches of software-defined networks. An analysis of the architectures of well-known network processors is presented − NP-5 from EZchip (now Mellanox) and Tofino from Barefoot Networks. The advantages and disadvantages of two different versions of network processor architectures are considered: pipeline-based architecture, the stages of which are represented by a set of general-purpose processor cores, and pipeline-based architecture whose stages correspond to cores specialized for specific packet processing operations. Based on a dedicated set of the most common use case scenarios, a new architecture of the network processor unit (NPU) with functionally specialized pipeline stages was proposed. The article presents a description of the simulation model of the NPU of the proposed architecture. The simulation model of the network processor is implemented in C ++ languages using SystemC, the open-source C++ library. For the functional testing of the obtained NPU model, the described use case scenarios were implemented in C. In order to evaluate the performance of the proposed NPU architecture a set of software products developed by KM211 company and the KMX32 family of microcontrollers were used. Evaluation of NPU performance was made on the basis of a simulation model. Estimates of the processing time of one packet and the average throughput of the NPU model for each scenario are obtained.


Author(s):  
Khushi Gupta ◽  
Tushar Sharma

In the modern world, we use microprocessors which are either based on ARM or x86 architecture which are the most common processor architectures. ARM originally stood for ‘Acorn RISC Machines’ but over the years changed to ‘Advanced RISC Machines’. It was started as just an experiment but showed promising results and now it is omnipresent in our modern devices. Unlike x86 which is designed for high performance, ARM focuses on low power consumption with considerable performance. Because of the advancements in the ARM technology, they are becoming more powerful than their x86 counterparts. In this analysis we will collate the two architectures briefly and conclude which microprocessor will dominate the microprocessor industry. The processor which will perform better in different tests will be more suitable for the reader to use in their application. The shift in the industry towards ARM processors can change how we write softwares which in turn will affect the whole software development environment.


Sign in / Sign up

Export Citation Format

Share Document