Efficient local locking for massively multithreaded in-memory hash-based operators

AbstractThe join and group-by aggregation are two memory intensive operators that are affecting the performance of relational databases. Hashing is a common approach used to implement both operators. Recent paradigm shifts in multi-core processor architectures have reinvigorated research into how the join and group-by aggregation operators can leverage these advances. However, the poor spatial locality of the hashing approach has hindered performance on multi-core processor architectures which rely on using large cache hierarchies for latency mitigation. Multithreaded architectures can better cope with poor spatial locality by masking memory latency with many outstanding requests. Nevertheless, the number of parallel threads, even in the most advanced multithreaded processors, such as UltraSPARC, is not enough to fully cover the main memory access latency. In this paper, we explore the hardware re-configurability of FPGAs to enable deeper execution pipelines that maintain hundreds (instead of tens) of outstanding memory requests across four FPGAs-drastically increasing concurrency and throughput. We present two end-to-end in-memory accelerators for the join and group-by aggregation operators using FPGAs. Both accelerators use massive multithreading to mask long memory delays of traversing linked-list data structures, while concurrently managing hundreds of thread states across four FPGAs locally. We explore how content addressable memories can be intermixed within our multithreaded designs to act as a synchronizing cache, which enforces locks and merges jobs together before they are written to memory. Throughput results for our hash-join operator accelerator show a speedup between 2$$\times $$ × and 3.4$$\times $$ × over the best multi-core approaches with comparable memory bandwidths on uniform and skewed datasets. The accelerator for the hash-based group-by aggregation operator demonstrates that leveraging CAMs achieves average speedup of 3.3$$\times $$ × with a best case of 9.4$$\times $$ × in terms of throughput over CPU implementations across five types of data distributions.

Download Full-text

Automatic Sublining for Efficient Sparse Memory Accesses

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3452141 ◽

2021 ◽

Vol 18 (3) ◽

pp. 1-23

Author(s):

Wim Heirman ◽

Stijn Eyerman ◽

Kristof Du Bois ◽

Ibrahim Hur

Keyword(s):

Dynamic Environment ◽

Large Data ◽

Main Memory ◽

Single Element ◽

Graph Analytics ◽

Available Bandwidth ◽

Processor Architectures ◽

Spatial Locality ◽

Potential Impact ◽

Memory Accesses

Sparse memory accesses, which are scattered accesses to single elements of a large data structure, are a challenge for current processor architectures. Their lack of spatial and temporal locality and their irregularity makes caches and traditional stream prefetchers useless. Furthermore, performing standard caching and prefetching on sparse accesses wastes precious memory bandwidth and thrashes caches, deteriorating performance for regular accesses. Bypassing prefetchers and caches for sparse accesses, and fetching only a single element (e.g., 8 B) from main memory (subline access), can solve these issues. Deciding which accesses to handle as sparse accesses and which as regular cached accesses, is a challenging task, with a large potential impact on performance. Not only is performance reduced by treating sparse accesses as regular accesses, not caching accesses that do have locality also negatively impacts performance by significantly increasing their latency and bandwidth consumption. Furthermore, this decision depends on the dynamic environment, such as input set characteristics and system load, making a static decision by the programmer or compiler suboptimal. We propose the Instruction Spatial Locality Estimator ( ISLE ), a hardware detector that finds instructions that access isolated words in a sea of unused data. These sparse accesses are dynamically converted into uncached subline accesses, while keeping regular accesses cached. ISLE does not require modifying source code or binaries, and adapts automatically to a changing environment (input data, available bandwidth, etc.). We apply ISLE to a graph analytics processor running sparse graph workloads, and show that ISLE outperforms the performance of no subline accesses, manual sublining, and prior work on detecting sparse accesses.

Download Full-text

An Energy-Efficient DRAM Cache Architecture for Mobile Platforms With PCM-Based Main Memory

ACM Transactions on Embedded Computing Systems ◽

10.1145/3451995 ◽

2022 ◽

Vol 21 (1) ◽

pp. 1-22

Author(s):

Dongsuk Shin ◽

Hakbeom Jang ◽

Kiseok Oh ◽

Jae W. Lee

Keyword(s):

Energy Consumption ◽

Main Memory ◽

Battery Life ◽

Mobile Platforms ◽

Total Energy Consumption ◽

Efficient Manner ◽

Hybrid Memory ◽

Spatial Locality ◽

Cache Architecture ◽

Energy Delay Product

A long battery life is a first-class design objective for mobile devices, and main memory accounts for a major portion of total energy consumption. Moreover, the energy consumption from memory is expected to increase further with ever-growing demands for bandwidth and capacity. A hybrid memory system with both DRAM and PCM can be an attractive solution to provide additional capacity and reduce standby energy. Although providing much greater density than DRAM, PCM has longer access latency and limited write endurance to make it challenging to architect it for main memory. To address this challenge, this article introduces CAMP, a novel DRAM c ache a rchitecture for m obile platforms with P CM-based main memory. A DRAM cache in this environment is required to filter most of the writes to PCM to increase its lifetime, and deliver highest efficiency even for a relatively small-sized DRAM cache that mobile platforms can afford. To address this CAMP divides DRAM space into two regions: a page cache for exploiting spatial locality in a bandwidth-efficient manner and a dirty block buffer for maximally filtering writes. CAMP improves the performance and energy-delay-product by 29.2% and 45.2%, respectively, over the baseline PCM-oblivious DRAM cache, while increasing PCM lifetime by 2.7×. And CAMP also improves the performance and energy-delay-product by 29.3% and 41.5%, respectively, over the state-of-the-art design with dirty block buffer, while increasing PCM lifetime by 2.5×.

Download Full-text

NCOR: An FPGA-Friendly Nonblocking Data Cache for Soft Processors with Runahead Execution

International Journal of Reconfigurable Computing ◽

10.1155/2012/915178 ◽

2012 ◽

Vol 2012 ◽

pp. 1-12 ◽

Cited By ~ 2

Author(s):

Kaveh Aasaraai ◽

Andreas Moshovos

Keyword(s):

High Efficiency ◽

Main Memory ◽

Data Cache ◽

Improve Performance ◽

Data Caches ◽

Content Addressable Memories ◽

Processor Designs ◽

Level Parallelism ◽

Order Execution ◽

Runahead Execution

Soft processors often use data caches to reduce the gap between processor and main memory speeds. To achieve high efficiency, simple, blocking caches are used. Such caches are not appropriate for processor designs such as Runahead and out-of-order execution that require nonblocking caches to tolerate main memory latencies. Instead, these processors use non-blocking caches to extract memory level parallelism and improve performance. However, conventional non-blocking cache designs are expensive and slow on FPGAs as they use content-addressable memories (CAMs). This work proposes NCOR, an FPGA-friendly non-blocking cache that exploits the key properties of Runahead execution. NCOR does not require CAMs and utilizes smart cache controllers. A 4 KB NCOR operates at 329 MHz on Stratix III FPGAs while it uses only 270 logic elements. A 32 KB NCOR operates at 278 Mhz and uses 269 logic elements.

Download Full-text

EMPLOYING MULTI-CORE PROCESSOR ARCHITECTURES TO ACCELERATE JAVA CRYPTOGRAPHY EXTENSIONS

Proceedings of the 7th International Conference on Web Information Systems and Technologies ◽

10.5220/0003339000050012 ◽

2011 ◽

Keyword(s):

Processor Architectures ◽

Multi Core Processor

Download Full-text

A Study of Trace-driven Simulation for Multi-core Processor Architectures

The Journal of the Institute of Webcasting, Internet and Telecommunication ◽

10.7236/jiwit.2012.12.3.9 ◽

2012 ◽

Vol 12 (3) ◽

pp. 9-13 ◽

Cited By ~ 2

Author(s):

Jong-Bok Lee

Keyword(s):

Processor Architectures ◽

Multi Core Processor

Download Full-text

Main-Memory Hash Joins on Modern Processor Architectures

IEEE Transactions on Knowledge and Data Engineering ◽

10.1109/tkde.2014.2313874 ◽

2015 ◽

Vol 27 (7) ◽

pp. 1754-1766 ◽

Cited By ~ 24

Author(s):

Cagri Balkesen ◽

Jens Teubner ◽

Gustavo Alonso ◽

M. Tamer ozsu

Keyword(s):

Main Memory ◽

Processor Architectures

Download Full-text

A Study On Statistical Simulation for Asymmetric Multi-Core Processor Architectures

The Journal of The Institute of Internet Broadcasting and Communication ◽

10.7236/jiibc.2016.16.2.157 ◽

2016 ◽

Vol 16 (2) ◽

pp. 157-163

Author(s):

Jongbok Lee ◽

Keyword(s):

Statistical Simulation ◽

Processor Architectures ◽

Multi Core Processor

Download Full-text

Cache coherency controller for MESI protocol based on FPGA

International Journal of Electrical and Computer Engineering (IJECE) ◽

10.11591/ijece.v11i2.pp1043-1052 ◽

2021 ◽

Vol 11 (2) ◽

pp. 1043

Author(s):

Mays K. Faeq ◽

Safaa S. Omran

Keyword(s):

Integrated Circuit ◽

High Speed ◽

Main Memory ◽

Test Results ◽

Description Language ◽

Cache Consistency ◽

Hardware Description ◽

Cache Coherency ◽

Protocol Test ◽

Multi Core Processor

In modern techniques of building processors, manufactures using more than one processor in the integrated circuit (chip) and each processor called a core. The new chips of processors called a multi-core processor. This new design makes the processors to work simultanously for more than one job or all the cores working in parallel for the same job. All cores are similar in their design, and each core has its own cache memory, while all cores shares the same main memory. So if one core requestes a block of data from main memory to its cache, there should be a protocol to declare the situation of this block in the main memory and other cores.This is called the cache coherency or cache consistency of multi-core. In this paper a special circuit is designed using very high speed integrated circuit hardware description language (VHDL) coding and implemented using ISE Xilinx software. The protocol used in this design is the modified, exclusive, shared and invalid (MESI) protocol. Test results were taken by using test bench, and showed all the states of the protocol are working correctly.

Download Full-text