Subblock-Based BPE Scheme to Conquer Mismatch in Memory Access Pattern

Author(s):  
Bao-Feng Li ◽  
Yong Dou
Keyword(s):  
Author(s):  
Yuto Nakano ◽  
Shinsaku Kiyomoto ◽  
Yutaka Miyake
Keyword(s):  

Author(s):  
S. Arash Ostadzadeh ◽  
Roel J. Meeuws ◽  
Carlo Galuzzi ◽  
Koen Bertels
Keyword(s):  

2021 ◽  
Vol 40 (2) ◽  
pp. 1-17
Author(s):  
Milan Jaroš ◽  
Lubomír Říha ◽  
Petr Strakoš ◽  
Matěj Špeťko

This article presents a solution to path tracing of massive scenes on multiple GPUs. Our approach analyzes the memory access pattern of a path tracer and defines how the scene data should be distributed across up to 16 GPUs with minimal effect on performance. The key concept is that the parts of the scene that have the highest amount of memory accesses are replicated on all GPUs. We propose two methods for maximizing the performance of path tracing when working with partially distributed scene data. Both methods work on the memory management level and therefore path tracer data structures do not have to be redesigned, making our approach applicable to other path tracers with only minor changes in their code. As a proof of concept, we have enhanced the open-source Blender Cycles path tracer. The approach was validated on scenes of sizes up to 169 GB. We show that only 1–5% of the scene data needs to be replicated to all machines for such large scenes. On smaller scenes we have verified that the performance is very close to rendering a fully replicated scene. In terms of scalability we have achieved a parallel efficiency of over 94% using up to 16 GPUs.


2010 ◽  
Vol 19 (07) ◽  
pp. 1435-1447
Author(s):  
YOUNG-SU KWON ◽  
NAK-WOONG EUM

Programmability requirement in reconfigurable systems necessitates the integration of soft processors in FPGAs. The extensive memory bandwidth sets a major performance bottleneck in soft processors for media applications. While the parallel memory system is a viable solution to account for a large amount of memory transactions in media processors, memory access conflicts caused by multiple memory buses limit the overall performance. We propose and evaluate the configurable memory address shuffler integrated in memory access arbiter for the parallel memory system in a soft processor. The novel address shuffling algorithm profiles memory access pattern of the application, produces the access conflict graph, relocates decomposed memory sub-pages based on the access conflict graph, and finally generates a synthesizable code of the address shuffler. The address shuffler efficiently translates the requested memory addresses into the shuffled addresses such that the amount of simultaneous accesses to the identical physical memory block diminishes. The reconfigurability of the address shuffler enables the adaptive address shuffling depending on the memory access pattern of an application running on the soft processor. The configurable address shuffler removes 80% of access conflicts on average for benchmarks where the hardware overhead of the shuffler is 1592 LUTs which is 14% of LUT size of the processor core.


Sign in / Sign up

Export Citation Format

Share Document