Predictable dynamic instruction scratchpad for simultaneous multithreaded processors

AbstractThe join and group-by aggregation are two memory intensive operators that are affecting the performance of relational databases. Hashing is a common approach used to implement both operators. Recent paradigm shifts in multi-core processor architectures have reinvigorated research into how the join and group-by aggregation operators can leverage these advances. However, the poor spatial locality of the hashing approach has hindered performance on multi-core processor architectures which rely on using large cache hierarchies for latency mitigation. Multithreaded architectures can better cope with poor spatial locality by masking memory latency with many outstanding requests. Nevertheless, the number of parallel threads, even in the most advanced multithreaded processors, such as UltraSPARC, is not enough to fully cover the main memory access latency. In this paper, we explore the hardware re-configurability of FPGAs to enable deeper execution pipelines that maintain hundreds (instead of tens) of outstanding memory requests across four FPGAs-drastically increasing concurrency and throughput. We present two end-to-end in-memory accelerators for the join and group-by aggregation operators using FPGAs. Both accelerators use massive multithreading to mask long memory delays of traversing linked-list data structures, while concurrently managing hundreds of thread states across four FPGAs locally. We explore how content addressable memories can be intermixed within our multithreaded designs to act as a synchronizing cache, which enforces locks and merges jobs together before they are written to memory. Throughput results for our hash-join operator accelerator show a speedup between 2$$\times $$ × and 3.4$$\times $$ × over the best multi-core approaches with comparable memory bandwidths on uniform and skewed datasets. The accelerator for the hash-based group-by aggregation operator demonstrates that leveraging CAMs achieves average speedup of 3.3$$\times $$ × with a best case of 9.4$$\times $$ × in terms of throughput over CPU implementations across five types of data distributions.

Download Full-text

Real-time scheduling on multithreaded processors

Proceedings Seventh International Conference on Real-Time Computing Systems and Applications ◽

10.1109/rtcsa.2000.896384 ◽

2002 ◽

Cited By ~ 22

Author(s):

J. Kreuzinger ◽

A. Schulz ◽

M. Pfeffer ◽

T. Ungerer ◽

U. Brinkschulte ◽

...

Keyword(s):

Real Time ◽

Multithreaded Processors ◽

Real Time Scheduling ◽

Time Scheduling

Download Full-text

Rethread: A Low-Cost Transient Fault Recovery Scheme for Multithreaded Processors

10.1109/ares.2014.18 ◽

2014 ◽

Author(s):

Jian Fu ◽

Qiang Yang ◽

Raphael Poss ◽

Chris R. Jesshope ◽

Chunyuan Zhang

Keyword(s):

Low Cost ◽

Fault Recovery ◽

Multithreaded Processors ◽

Transient Fault ◽

Recovery Scheme

Download Full-text

Introduction to Multithreaded Processors

The Kluwer International Series in Engineering and Computer Science - Multithreaded Processor Design ◽

10.1007/978-1-4613-1383-0_4 ◽

1996 ◽

pp. 35-40

Author(s):

Simon W. Moore

Keyword(s):

Multithreaded Processors

Download Full-text

A Class of Queuing Network Models for Multithreaded Processors

Parallel and Distributed Computing and Systems ◽

10.2316/p.2011.757-073 ◽

2011 ◽

Author(s):

Miao Ju ◽

Hun Jung ◽

Hao Che

Keyword(s):

Network Models ◽

Queuing Network ◽

Multithreaded Processors

Download Full-text

Multi2Sim: A Simulation Framework to Evaluate Multicore-Multithreaded Processors

19th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD'07) ◽

10.1109/sbac-pad.2007.17 ◽

2007 ◽

Cited By ~ 96

Author(s):

Rafael Ubal ◽

Julio Sahuquillo ◽

Salvador Petit ◽

Pedro Lopez

Keyword(s):

Simulation Framework ◽

Multithreaded Processors

Download Full-text

Implicitly-multithreaded processors

30th Annual International Symposium on Computer Architecture, 2003. Proceedings. ◽

10.1109/isca.2003.1206987 ◽

2004 ◽

Cited By ~ 2

Author(s):

Il Park ◽

B. Falsafi ◽

T.N. Vijaykumar

Keyword(s):

Multithreaded Processors

Download Full-text

Applied On-Chip Machine Learning for Dynamic Resource Control in Multithreaded Processors

Parallel Processing Letters ◽

10.1142/s0129626419500130 ◽

2019 ◽

Vol 29 (03) ◽

pp. 1950013

Author(s):

Shane Carroll ◽

Wei-Ming Lin

Keyword(s):

Machine Learning ◽

Learning Algorithm ◽

Clock Cycle ◽

Shared Resources ◽

Multithreaded Processors ◽

Multiple Threads ◽

On Chip ◽

Control Instruction ◽

Cache Miss ◽

Fetch Bandwidth

In this paper, we propose a machine learning algorithm to control instruction fetch bandwidth in a simultaneous multithreaded CPU. In a simultaneous multithreaded CPU, multiple threads occupy pools of hardware resources in the same clock cycle. Under some conditions, one or more threads may undergo a period of inefficiency, e.g., a cache miss, thereby inefficiently using shared resources and degrading the performance of other threads. If these inefficiencies can be identified at runtime, the offending thread can be temporarily blocked from fetching new instructions into the pipeline and given time to recover from its inefficiency, and prevent the shared system resources from being wasted on a stalled thread. In this paper, we propose a machine learning approach to determine when a thread should be blocked from fetching new instructions. The model is trained offline and the parameters embedded in a CPU, which can be queried with runtime statistics to determine if a thread is running inefficiently and should be temporarily blocked from fetching. We propose two models: a simple linear model and a higher-capacity neural network. We test each model in a simulation environment and show that system performance can increase by up to 19% on average with a feasible implementation of the proposed algorithm.

Download Full-text