Request, Coalesce, Serve, and Forget: Miss-Optimized Memory Systems for Bandwidth-Bound Cache-Unfriendly Applications on FPGAs

Applications such as large-scale sparse linear algebra and graph analytics are challenging to accelerate on FPGAs due to the short irregular memory accesses, resulting in low cache hit rates. Nonblocking caches reduce the bandwidth required by misses by requesting each cache line only once, even when there are multiple misses corresponding to it. However, such reuse mechanism is traditionally implemented using an associative lookup. This limits the number of misses that are considered for reuse to a few tens, at most. In this article, we present an efficient pipeline that can process and store thousands of outstanding misses in cuckoo hash tables in on-chip SRAM with minimal stalls. This brings the same bandwidth advantage as a larger cache for a fraction of the area budget, because outstanding misses do not need a data array, which can significantly speed up irregular memory-bound latency-insensitive applications. In addition, we extend nonblocking caches to generate variable-length bursts to memory, which increases the bandwidth delivered by DRAMs and their controllers. The resulting miss-optimized memory system provides up to 25% speedup with 24× area reduction on 15 large sparse matrix-vector multiplication benchmarks evaluated on an embedded and a datacenter FPGA system.

Download Full-text

Optimising Sparse Matrix Vector multiplication for large scale FEM problems on FPGA

2016 26th International Conference on Field Programmable Logic and Applications (FPL) ◽

10.1109/fpl.2016.7577352 ◽

2016 ◽

Cited By ~ 8

Author(s):

Paul Grigoras ◽

Pavel Burovskiy ◽

Wayne Luk ◽

Spencer Sherwin

Keyword(s):

Large Scale ◽

Sparse Matrix ◽

Matrix Vector Multiplication ◽

Matrix Vector

Download Full-text

Precorrected-FFT Accelerated Singular Boundary Method for Large-Scale Three-Dimensional Potential Problems

Communications in Computational Physics ◽

10.4208/cicp.oa-2016-0075 ◽

2017 ◽

Vol 22 (2) ◽

pp. 460-472 ◽

Cited By ~ 13

Author(s):

Weiwei Li ◽

Wen Chen ◽

Zhuojia Fu

Keyword(s):

Computational Complexity ◽

Large Scale ◽

Three Dimensional ◽

Iteration Step ◽

Singular Boundary ◽

Matrix Vector Multiplication ◽

Boundary Method ◽

Potential Problems ◽

Speed Up ◽

Singular Boundary Method

AbstractThis study makes the first attempt to accelerate the singular boundary method (SBM) by the precorrected-FFT (PFFT) for large-scale three-dimensional potential problems. The SBM with the GMRES solver requires computational complexity, where N is the number of the unknowns. To speed up the SBM, the PFFT is employed to accelerate the SBM matrix-vector multiplication at each iteration step of the GMRES. Consequently, the computational complexity can be reduced to . Several numerical examples are presented to validate the developed PFFT accelerated SBM (PFFT-SBM) scheme, and the results are compared with those of the SBM without the PFFT and the analytical solutions. It is clearly found that the present PFFT-SBM is very efficient and suitable for 3D large-scale potential problems.

Download Full-text