processor architectures
Recently Published Documents


TOTAL DOCUMENTS

293
(FIVE YEARS 32)

H-INDEX

20
(FIVE YEARS 3)

Electronics ◽  
2022 ◽  
Vol 11 (2) ◽  
pp. 215
Author(s):  
Quentin Berthet ◽  
Joachim Schmidt ◽  
Andres Upegui

Nowadays, one of the main challenges in computer architectures is scalability; indeed, novel processor architectures can include thousands of processing elements on a single chip and using them efficiently remains a big issue. An interesting source of inspiration for handling scalability is the mammalian brain and different works on neuromorphic computation have attempted to address this question. The Self-configurable 3D Cellular Adaptive Platform (SCALP) has been designed with the goal of prototyping such types of systems and has led to the proposal of the Cellular Self-Organizing Maps (CSOM) algorithm. In this paper, we present a hardware architecture for CSOM in the form of interconnected neural units with the specific property of supporting an asynchronous deployment on a multi-FPGA 3D array. The Asynchronous CSOM (ACSOM) algorithm exploits the underlying Network-on-Chip structure to be provided by SCALP in order to overcome the multi-path propagation issue presented by a straightforward CSOM implementation. We explore its behaviour under different map topologies and scalar representations. The results suggest that a larger network size with low precision coding obtains an optimal ratio between algorithm accuracy and FPGA resources.


2021 ◽  
Vol 5 (1) ◽  
Author(s):  
Domenico Giordano ◽  
Manfred Alef ◽  
Luca Atzori ◽  
Jean-Michel Barbet ◽  
Olga Datskova ◽  
...  

AbstractThe HEPiX Benchmarking Working Group has developed a framework to benchmark the performance of a computational server using the software applications of the High Energy Physics (HEP) community. This framework consists of two main components, named HEP-Workloads and HEPscore. HEP-Workloads is a collection of standalone production applications provided by a number of HEP experiments. HEPscore is designed to run HEP-Workloads and provide an overall measurement that is representative of the computing power of a system. HEPscore is able to measure the performance of systems with different processor architectures and accelerators. The framework is completed by the HEP Benchmark Suite that simplifies the process of executing HEPscore and other benchmarks such as HEP-SPEC06, SPEC CPU 2017, and DB12. This paper describes the motivation, the design choices, and the results achieved by the HEPiX Benchmarking Working group. A perspective on future plans is also presented.


Sensors ◽  
2021 ◽  
Vol 21 (22) ◽  
pp. 7771
Author(s):  
Jinjae Lee ◽  
Derry Pratama ◽  
Minjae Kim ◽  
Howon Kim ◽  
Donghyun Kwon

Commodity processor architectures are releasing various instruction set extensions to support security solutions for the efficient mitigation of memory vulnerabilities. Among them, tagged memory extension (TME), such as ARM MTE and SPARC ADI, can prevent unauthorized memory access by utilizing tagged memory. However, our analysis found that TME has performance and security issues in practical use. To alleviate these, in this paper, we propose CoMeT, a new instruction set extension for tagged memory. The key idea behind CoMeT is not only to check whether the tag values in the address tag and memory tag are matched, but also to check the access permissions for each tag value. We implemented the prototype of CoMeT on the RISC-V platform. Our evaluation results confirm that CoMeT can be utilized to efficiently implement well-known security solutions, i.e., shadow stack and in-process isolation, without compromising security.


Author(s):  
Khushi Gupta ◽  
Tushar Sharma

In the modern world, we use microprocessors which are either based on ARM or x86 architecture which are the most common processor architectures. ARM originally stood for ‘Acorn RISC Machines’ but over the years changed to ‘Advanced RISC Machines’. It was started as just an experiment but showed promising results and now it is omnipresent in our modern devices. Unlike x86 which is designed for high performance, ARM focuses on low power consumption with considerable performance. Because of the advancements in the ARM technology, they are becoming more powerful than their x86 counterparts. In this analysis we will collate the two architectures briefly and conclude which microprocessor will dominate the microprocessor industry. The processor which will perform better in different tests will be more suitable for the reader to use in their application. The shift in the industry towards ARM processors can change how we write softwares which in turn will affect the whole software development environment.


2021 ◽  
Vol 18 (3) ◽  
pp. 1-23
Author(s):  
Wim Heirman ◽  
Stijn Eyerman ◽  
Kristof Du Bois ◽  
Ibrahim Hur

Sparse memory accesses, which are scattered accesses to single elements of a large data structure, are a challenge for current processor architectures. Their lack of spatial and temporal locality and their irregularity makes caches and traditional stream prefetchers useless. Furthermore, performing standard caching and prefetching on sparse accesses wastes precious memory bandwidth and thrashes caches, deteriorating performance for regular accesses. Bypassing prefetchers and caches for sparse accesses, and fetching only a single element (e.g., 8 B) from main memory (subline access), can solve these issues. Deciding which accesses to handle as sparse accesses and which as regular cached accesses, is a challenging task, with a large potential impact on performance. Not only is performance reduced by treating sparse accesses as regular accesses, not caching accesses that do have locality also negatively impacts performance by significantly increasing their latency and bandwidth consumption. Furthermore, this decision depends on the dynamic environment, such as input set characteristics and system load, making a static decision by the programmer or compiler suboptimal. We propose the Instruction Spatial Locality Estimator ( ISLE ), a hardware detector that finds instructions that access isolated words in a sea of unused data. These sparse accesses are dynamically converted into uncached subline accesses, while keeping regular accesses cached. ISLE does not require modifying source code or binaries, and adapts automatically to a changing environment (input data, available bandwidth, etc.). We apply ISLE to a graph analytics processor running sparse graph workloads, and show that ISLE outperforms the performance of no subline accesses, manual sublining, and prior work on detecting sparse accesses.


Electronics ◽  
2021 ◽  
Vol 10 (4) ◽  
pp. 516
Author(s):  
Tram Thi Bao Nguyen ◽  
Tuy Nguyen Tan ◽  
Hanho Lee

This paper presents a pipelined layered quasi-cyclic low-density parity-check (QC-LDPC) decoder architecture targeting low-complexity, high-throughput, and efficient use of hardware resources compliant with the specifications of 5G new radio (NR) wireless communication standard. First, a combined min-sum (CMS) decoding algorithm, which is a combination of the offset min-sum and the original min-sum algorithm, is proposed. Then, a low-complexity and high-throughput pipelined layered QC-LDPC decoder architecture for enhanced mobile broadband specifications in 5G NR wireless standards based on CMS algorithm with pipeline layered scheduling is presented. Enhanced versions of check node-based processor architectures are proposed to improve the complexity of the LDPC decoders. An efficient minimum-finder for the check node unit architecture that reduces the hardware required for the computation of the first two minima is introduced. Moreover, a low complexity a posteriori information update unit architecture, which only requires one adder array for their operations, is presented. The proposed architecture shows significant improvements in terms of area and throughput compared to other QC-LDPC decoder architectures available in the literature.


2021 ◽  
Author(s):  
Bashar Romanous ◽  
Skyler Windh ◽  
Ildar Absalyamov ◽  
Prerna Budhkar ◽  
Robert Halstead ◽  
...  

AbstractThe join and group-by aggregation are two memory intensive operators that are affecting the performance of relational databases. Hashing is a common approach used to implement both operators. Recent paradigm shifts in multi-core processor architectures have reinvigorated research into how the join and group-by aggregation operators can leverage these advances. However, the poor spatial locality of the hashing approach has hindered performance on multi-core processor architectures which rely on using large cache hierarchies for latency mitigation. Multithreaded architectures can better cope with poor spatial locality by masking memory latency with many outstanding requests. Nevertheless, the number of parallel threads, even in the most advanced multithreaded processors, such as UltraSPARC, is not enough to fully cover the main memory access latency. In this paper, we explore the hardware re-configurability of FPGAs to enable deeper execution pipelines that maintain hundreds (instead of tens) of outstanding memory requests across four FPGAs-drastically increasing concurrency and throughput. We present two end-to-end in-memory accelerators for the join and group-by aggregation operators using FPGAs. Both accelerators use massive multithreading to mask long memory delays of traversing linked-list data structures, while concurrently managing hundreds of thread states across four FPGAs locally. We explore how content addressable memories can be intermixed within our multithreaded designs to act as a synchronizing cache, which enforces locks and merges jobs together before they are written to memory. Throughput results for our hash-join operator accelerator show a speedup between 2$$\times $$ × and 3.4$$\times $$ × over the best multi-core approaches with comparable memory bandwidths on uniform and skewed datasets. The accelerator for the hash-based group-by aggregation operator demonstrates that leveraging CAMs achieves average speedup of 3.3$$\times $$ × with a best case of 9.4$$\times $$ × in terms of throughput over CPU implementations across five types of data distributions.


Sign in / Sign up

Export Citation Format

Share Document