zero copy
Recently Published Documents


TOTAL DOCUMENTS

92
(FIVE YEARS 16)

H-INDEX

10
(FIVE YEARS 1)

2021 ◽  
Vol 6 (65) ◽  
pp. 3419
Author(s):  
Dion Häfner ◽  
Filippo Vicentini
Keyword(s):  

2021 ◽  
Vol 14 (11) ◽  
pp. 2087-2100
Author(s):  
Seung Won Min ◽  
Kun Wu ◽  
Sitao Huang ◽  
Mert Hidayetoğlu ◽  
Jinjun Xiong ◽  
...  

Graph Convolutional Networks (GCNs) are increasingly adopted in large-scale graph-based recommender systems. Training GCN requires the minibatch generator traversing graphs and sampling the sparsely located neighboring nodes to obtain their features. Since real-world graphs often exceed the capacity of GPU memory, current GCN training systems keep the feature table in host memory and rely on the CPU to collect sparse features before sending them to the GPUs. This approach, however, puts tremendous pressure on host memory bandwidth and the CPU. This is because the CPU needs to (1) read sparse features from memory, (2) write features into memory as a dense format, and (3) transfer the features from memory to the GPUs. In this work, we propose a novel GPU-oriented data communication approach for GCN training, where GPU threads directly access sparse features in host memory through zero-copy accesses without much CPU help. By removing the CPU gathering stage, our method significantly reduces the consumption of the host resources and data access latency. We further present two important techniques to achieve high host memory access efficiency by the GPU: (1) automatic data access address alignment to maximize PCIe packet efficiency, and (2) asynchronous zero-copy access and kernel execution to fully overlap data transfer with training. We incorporate our method into PyTorch and evaluate its effectiveness using several graphs with sizes up to 111 million nodes and 1.6 billion edges. In a multi-GPU training setup, our method is 65--92% faster than the conventional data transfer method, and can even match the performance of all-in-GPU-memory training for some graphs that fit in GPU memory.


2021 ◽  
Vol 11 (2) ◽  
pp. 24
Author(s):  
Mirco De Marchi ◽  
Francesco Lumpp ◽  
Enrico Martini ◽  
Michele Boldo ◽  
Stefano Aldegheri ◽  
...  

Many modern programmable embedded devices contain CPUs and a GPU that share the same system memory on a single die. Such a unified memory architecture (UMA) allows programmers to implement different communication models between CPU and the integrated GPU (iGPU). Although the simpler model guarantees implicit synchronization at the cost of performance, the more advanced model allows, through the zero-copy paradigm, the explicit data copying between CPU and iGPU to be eliminated with the benefit of significantly improving performance and energy savings. On the other hand, the robot operating system (ROS) has become a de-facto reference standard for developing robotic applications. It allows for application re-use and the easy integration of software blocks in complex cyber-physical systems. Although ROS compliance is strongly required for SW portability and reuse, it can lead to performance loss and elude the benefits of the zero-copy communication. In this article we present efficient techniques to implement CPU–iGPU communication by guaranteeing compliance to the ROS standard. We show how key features of each communication model are maintained and the corresponding overhead involved by the ROS compliancy.


Author(s):  
Johan Peltenburg ◽  
Jeroen van Straten ◽  
Matthijs Brobbel ◽  
Zaid Al-Ars ◽  
H. Peter Hofstee

AbstractAs big data analytics systems are squeezing out the last bits of performance of CPUs and GPUs, the next near-term and widely available alternative industry is considering for higher performance in the data center and cloud is the FPGA accelerator. We discuss several challenges a developer has to face when designing and integrating FPGA accelerators for big data analytics pipelines. On the software side, we observe complex run-time systems, hardware-unfriendly in-memory layouts of data sets, and (de)serialization overhead. On the hardware side, we observe a relative lack of platform-agnostic open-source tooling, a high design effort for data structure-specific interfaces, and a high design effort for infrastructure. The open source Fletcher framework addresses these challenges. It is built on top of Apache Arrow, which provides a common, hardware-friendly in-memory format to allow zero-copy communication of large tabular data, preventing (de)serialization overhead. Fletcher adds FPGA accelerators to the list of over eleven supported software languages. To deal with the hardware challenges, we present Arrow-specific components, providing easy-to-use, high-performance interfaces to accelerated kernels. The components are combined based on a generic architecture that is specialized according to the application through an extensive infrastructure generation framework that is presented in this article. All generated hardware is vendor-agnostic, and software drivers add a platform-agnostic layer, allowing users to create portable implementations.


Author(s):  
Hidetoshi Iwashita ◽  
Masahiro Nakao

AbstractCoarray features have been implemented on the Omni XcalableMP compiler with a source-to-source translator and layered runtime libraries. Three memory allocation methods for coarrays were implemented for the GASNet and MPI-3 communication libraries and the native interface of Fujitsu. For the coarray PUT/GET communication, algorithms using DMA (zero-copy) and buffering were introduced. Important techniques for achieving high performance were the non-blocking PUT communication implemented in the runtime library and the optimization for the GET communication in the translator. Using the ping-pong benchmark and the modified version, the fundamental performance was evaluated and analyzed. The MPI version of the Himeno benchmark was ported to the coarray version and modified for fully using the non-blocking PUT. As a result of the evaluation, the non-blocking coarray version clearly outperformed the original and non-blocking MPI versions.


2020 ◽  
Vol 144 ◽  
pp. 1-13 ◽  
Author(s):  
Jahanzeb Maqbool Hashmi ◽  
Ching-Hsiang Chu ◽  
Sourav Chakraborty ◽  
Mohammadreza Bayatpour ◽  
Hari Subramoni ◽  
...  
Keyword(s):  

2020 ◽  
Vol 245 ◽  
pp. 01011
Author(s):  
Rafał Dominik Krawczyk ◽  
Tommaso Colombo ◽  
Niko Neufeld ◽  
Flavio Pisani ◽  
Sébastien Valat

This paper evaluates the utilization of Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) for the Run 3 LHCb event building at CERN. The acquisition system of the detector will collect partial data from approximately 1000 separate detector streams. The total estimated throughput equals 32 Terabits per second. Full events will be assembled for subsequent processing and data selection in the filtering farm of the online trigger. High-throughput transmissions with up to 90% links utilization will be an essential feature of the system. The data exchange mechanism must support zero-copy transmissions. In this work, the RoCE high-throughput kernel bypass Ethernet protocol is benchmarked as a potential alternative to InfiniBand. A RoCE-based event building network is presented and two implementations are considered. The former variant combined shallow-buffered and deep-buffered switches with enabled flow control. In the latter setup, only deep-buffered devices are used, where operation relied on their memory throughput and capacity. Feasibility tests were conducted with selected Ethernet switches. Memory bandwidth utilization was investigated, in comparison with InfiniBand. Relevant utilization and interoperability issues of RoCE flow control are detailed with lessons learned along the road.


IEEE Access ◽  
2020 ◽  
Vol 8 ◽  
pp. 59315-59325
Author(s):  
Hyunchan Park ◽  
Youngpil Kim ◽  
Seehwan Yoo
Keyword(s):  

Sign in / Sign up

Export Citation Format

Share Document