A Novel Procedure for Implementing a Turbo Decoder on a GPU with Coalesced Memory Access

Sparse matrix-vector (SpMV) multiplication is a vital building block for numerous scientific and engineering applications. This paper proposes SURAA (translates to speed in arabic), a novel method for SpMV computations on graphics processing units (GPUs). The novelty lies in the way we group matrix rows into different segments, and adaptively schedule various segments to different types of kernels. The sparse matrix data structure is created by sorting the rows of the matrix on the basis of the nonzero elements per row ( n p r) and forming segments of equal size (containing approximately an equal number of nonzero elements per row) using the Freedman–Diaconis rule. The segments are assembled into three groups based on the mean n p r of the segments. For each group, we use multiple kernels to execute the group segments on different streams. Hence, the number of threads to execute each segment is adaptively chosen. Dynamic Parallelism available in Nvidia GPUs is utilized to execute the group containing segments with the largest mean n p r, providing improved load balancing and coalesced memory access, and hence more efficient SpMV computations on GPUs. Therefore, SURAA minimizes the adverse effects of the n p r variance by uniformly distributing the load using equal sized segments. We implement the SURAA method as a tool and compare its performance with the de facto best commercial (cuSPARSE) and open source (CUSP, MAGMA) tools using widely used benchmarks comprising 26 high n p r v a r i a n c e matrices from 13 diverse domains. SURAA outperforms the other tools by delivering 13.99x speedup on average. We believe that our approach provides a fundamental shift in addressing SpMV related challenges on GPUs including coalesced memory access, thread divergence, and load balancing, and is set to open new avenues for further improving SpMV performance in the future.

Download Full-text

Optimizing non-coalesced memory access for irregular applications with GPU computing

Frontiers of Information Technology & Electronic Engineering ◽

10.1631/fitee.1900262 ◽

2020 ◽

Vol 21 (9) ◽

pp. 1285-1301 ◽

Cited By ~ 1

Author(s):

Ran Zheng ◽

Yuan-dong Liu ◽

Hai Jin

Keyword(s):

Gpu Computing ◽

Memory Access ◽

Irregular Applications ◽

Coalesced Memory

Download Full-text

Reducing memory access latency with asymmetric DRAM bank organizations

ACM SIGARCH Computer Architecture News ◽

10.1145/2508148.2485955 ◽

2013 ◽

Vol 41 (3) ◽

pp. 380-391 ◽

Cited By ~ 7

Author(s):

Young Hoon Son ◽

O. Seongil ◽

Yuhwan Ro ◽

Jae W. Lee ◽

Jung Ho Ahn

Keyword(s):

Memory Access ◽

Access Latency

Download Full-text

A novel storage scheme for parallel turbo decoder

VTC-2005-Fall. 2005 IEEE 62nd Vehicular Technology Conference, 2005. ◽

10.1109/vetecf.2005.1558448 ◽

2006 ◽

Cited By ~ 1

Author(s):

Xiang He ◽

HanWen Luo ◽

HaiBin Zhang

Keyword(s):

Turbo Decoder ◽

Storage Scheme

Download Full-text

ROMANet: Fine-Grained Reuse-Driven Off-Chip Memory Access Management and Data Organization for Deep Neural Network Accelerators

IEEE Transactions on Very Large Scale Integration (VLSI) Systems ◽

10.1109/tvlsi.2021.3060509 ◽

2021 ◽

pp. 1-14

Author(s):

Rachmad Vidya Wicaksana Putra ◽

Muhammad Abdullah Hanif ◽

Muhammad Shafique

Keyword(s):

Neural Network ◽

Deep Neural Network ◽

Memory Access ◽

Data Organization ◽

Access Management ◽

Fine Grained

Download Full-text

On the Applicability of PEBS based Online Memory Access Tracking for Heterogeneous Memory Management at Scale

Proceedings of the Workshop on Memory Centric High Performance Computing ◽

10.1145/3286475.3286477 ◽

2018 ◽

Author(s):

Aleix Roca Nonell ◽

Balazs Gerofi ◽

Leonardo Bautista-Gomez ◽

Dominique Martinet ◽

Vicenç Beltran Querol ◽

...

Keyword(s):

Memory Management ◽

Memory Access

Download Full-text

Pre-Emphasis Pulse Design for Random-Access Memory

Electronics ◽

10.3390/electronics10121454 ◽

2021 ◽

Vol 10 (12) ◽

pp. 1454

Author(s):

Yoshihiro Sugiura ◽

Toru Tanzawa

Keyword(s):

Time Constant ◽

Random Access ◽

Memory Cell ◽

Random Access Memory ◽

Memory Access ◽

Access Time ◽

Access Memory ◽

Delay Times ◽

Cell Current ◽

The Impact

This paper describes how one can reduce the memory access time with pre-emphasis (PE) pulses even in non-volatile random-access memory. Optimum PE pulse widths and resultant minimum word-line (WL) delay times are investigated as a function of column address. The impact of the process variation in the time constant of WL, the cell current, and the resistance of deciding path on optimum PE pulses are discussed. Optimum PE pulse widths and resultant minimum WL delay times are modeled with fitting curves as a function of column address of the accessed memory cell, which provides designers with the ability to set the optimum timing for WL and BL (bit-line) operations, reducing average memory access time.

Download Full-text

Decision Tree-Based Adaptive Reconfigurable Cache Scheme

Algorithms ◽

10.3390/a14060176 ◽

2021 ◽

Vol 14 (6) ◽

pp. 176

Author(s):

Wei Zhu ◽

Xiaoyang Zeng

Keyword(s):

Decision Tree ◽

Adaptive Algorithms ◽

Memory Access ◽

Access Time ◽

Decision Tree Algorithm ◽

Verilog Hdl ◽

Tree Model ◽

Cache Associativity ◽

Cache Scheme ◽

Reconfigurable Cache

Applications have different preferences for caches, sometimes even within the different running phases. Caches with fixed parameters may compromise the performance of a system. To solve this problem, we propose a real-time adaptive reconfigurable cache based on the decision tree algorithm, which can optimize the average memory access time of cache without modifying the cache coherent protocol. By monitoring the application running state, the cache associativity is periodically tuned to the optimal cache associativity, which is determined by the decision tree model. This paper implements the proposed decision tree-based adaptive reconfigurable cache in the GEM5 simulator and designs the key modules using Verilog HDL. The simulation results show that the proposed decision tree-based adaptive reconfigurable cache reduces the average memory access time compared with other adaptive algorithms.

Download Full-text