zero copy Latest Research Papers

Graph Convolutional Networks (GCNs) are increasingly adopted in large-scale graph-based recommender systems. Training GCN requires the minibatch generator traversing graphs and sampling the sparsely located neighboring nodes to obtain their features. Since real-world graphs often exceed the capacity of GPU memory, current GCN training systems keep the feature table in host memory and rely on the CPU to collect sparse features before sending them to the GPUs. This approach, however, puts tremendous pressure on host memory bandwidth and the CPU. This is because the CPU needs to (1) read sparse features from memory, (2) write features into memory as a dense format, and (3) transfer the features from memory to the GPUs. In this work, we propose a novel GPU-oriented data communication approach for GCN training, where GPU threads directly access sparse features in host memory through zero-copy accesses without much CPU help. By removing the CPU gathering stage, our method significantly reduces the consumption of the host resources and data access latency. We further present two important techniques to achieve high host memory access efficiency by the GPU: (1) automatic data access address alignment to maximize PCIe packet efficiency, and (2) asynchronous zero-copy access and kernel execution to fully overlap data transfer with training. We incorporate our method into PyTorch and evaluate its effectiveness using several graphs with sizes up to 111 million nodes and 1.6 billion edges. In a multi-GPU training setup, our method is 65--92% faster than the conventional data transfer method, and can even match the performance of all-in-GPU-memory training for some graphs that fit in GPU memory.

Download Full-text

Efficient ROS-Compliant CPU-iGPU Communication on Embedded Platforms

Journal of Low Power Electronics and Applications ◽

10.3390/jlpea11020024 ◽

2021 ◽

Vol 11 (2) ◽

pp. 24

Author(s):

Mirco De Marchi ◽

Francesco Lumpp ◽

Enrico Martini ◽

Michele Boldo ◽

Stefano Aldegheri ◽

...

Keyword(s):

Energy Savings ◽

Communication Model ◽

Embedded Devices ◽

Performance Loss ◽

Zero Copy ◽

System Memory ◽

Easy Integration ◽

Integrated Gpu ◽

The Cost ◽

Advanced Model

Many modern programmable embedded devices contain CPUs and a GPU that share the same system memory on a single die. Such a unified memory architecture (UMA) allows programmers to implement different communication models between CPU and the integrated GPU (iGPU). Although the simpler model guarantees implicit synchronization at the cost of performance, the more advanced model allows, through the zero-copy paradigm, the explicit data copying between CPU and iGPU to be eliminated with the benefit of significantly improving performance and energy savings. On the other hand, the robot operating system (ROS) has become a de-facto reference standard for developing robotic applications. It allows for application re-use and the easy integration of software blocks in complex cyber-physical systems. Although ROS compliance is strongly required for SW portability and reuse, it can lead to performance loss and elude the benefits of the zero-copy communication. In this article we present efficient techniques to implement CPU–iGPU communication by guaranteeing compliance to the ROS standard. We show how key features of each communication model are maintained and the corresponding overhead involved by the ROS compliancy.

Download Full-text

Generating High-Performance FPGA Accelerator Designs for Big Data Analytics with Fletcher and Apache Arrow

Journal of Signal Processing Systems ◽

10.1007/s11265-021-01650-6 ◽

2021 ◽

Author(s):

Johan Peltenburg ◽

Jeroen van Straten ◽

Matthijs Brobbel ◽

Zaid Al-Ars ◽

H. Peter Hofstee

Keyword(s):

Big Data ◽

Open Source ◽

Data Analytics ◽

High Performance ◽

Big Data Analytics ◽

Data Sets ◽

Zero Copy ◽

Generic Architecture ◽

Near Term ◽

Time Systems

AbstractAs big data analytics systems are squeezing out the last bits of performance of CPUs and GPUs, the next near-term and widely available alternative industry is considering for higher performance in the data center and cloud is the FPGA accelerator. We discuss several challenges a developer has to face when designing and integrating FPGA accelerators for big data analytics pipelines. On the software side, we observe complex run-time systems, hardware-unfriendly in-memory layouts of data sets, and (de)serialization overhead. On the hardware side, we observe a relative lack of platform-agnostic open-source tooling, a high design effort for data structure-specific interfaces, and a high design effort for infrastructure. The open source Fletcher framework addresses these challenges. It is built on top of Apache Arrow, which provides a common, hardware-friendly in-memory format to allow zero-copy communication of large tabular data, preventing (de)serialization overhead. Fletcher adds FPGA accelerators to the list of over eleven supported software languages. To deal with the hardware challenges, we present Arrow-specific components, providing easy-to-use, high-performance interfaces to accelerated kernels. The components are combined based on a generic architecture that is specialized according to the application through an extensive infrastructure generation framework that is presented in this article. All generated hardware is vendor-agnostic, and software drivers add a platform-agnostic layer, allowing users to create portable implementations.

Download Full-text

Coarrays in the Context of XcalableMP

XcalableMP PGAS Programming Language ◽

10.1007/978-981-15-7683-6_3 ◽

2020 ◽

pp. 97-122

Author(s):

Hidetoshi Iwashita ◽

Masahiro Nakao

Keyword(s):

High Performance ◽

Memory Allocation ◽

Communication Algorithms ◽

Zero Copy ◽

Ping Pong ◽

Allocation Methods

AbstractCoarray features have been implemented on the Omni XcalableMP compiler with a source-to-source translator and layered runtime libraries. Three memory allocation methods for coarrays were implemented for the GASNet and MPI-3 communication libraries and the native interface of Fujitsu. For the coarray PUT/GET communication, algorithms using DMA (zero-copy) and buffering were introduced. Important techniques for achieving high performance were the non-blocking PUT communication implemented in the runtime library and the optimization for the GET communication in the translator. Using the ping-pong benchmark and the modified version, the fundamental performance was evaluated and analyzed. The MPI version of the Himeno benchmark was ported to the coarray version and modified for fully using the non-blocking PUT. As a result of the evaluation, the non-blocking coarray version clearly outperformed the original and non-blocking MPI versions.

Download Full-text

FALCON-X: Zero-copy MPI derived datatype processing on modern CPU and GPU architectures

Journal of Parallel and Distributed Computing ◽

10.1016/j.jpdc.2020.05.008 ◽

2020 ◽

Vol 144 ◽

pp. 1-13 ◽

Cited By ~ 1

Author(s):

Jahanzeb Maqbool Hashmi ◽

Ching-Hsiang Chu ◽

Sourav Chakraborty ◽

Mohammadreza Bayatpour ◽

Hari Subramoni ◽

...

Keyword(s):

Zero Copy ◽

Gpu Architectures

Download Full-text

Zero-Copy Data Transfer for an OpenCL API Remoting System

2020 IEEE 5th International Conference on Cloud Computing and Big Data Analytics (ICCCBDA) ◽

10.1109/icccbda49378.2020.9095593 ◽

2020 ◽

Author(s):

Yunhao Zhang ◽

Bo Zhang ◽

Zhenhui Zhou

Keyword(s):

Data Transfer ◽

Zero Copy

Download Full-text

Research on A Reliable Method of Receiving Data Based on Zero-Copy Non-Drop Packet Network Card

2020 22nd International Conference on Advanced Communication Technology (ICACT) ◽

10.23919/icact48636.2020.9061402 ◽

2020 ◽

Author(s):

Yujian Gong ◽

Wei Xi ◽

Hui Han ◽

Jiawei Yan

Keyword(s):

Reliable Method ◽

Packet Network ◽

Zero Copy

Download Full-text

Feasibility tests of RoCE v2 for LHCb event building

EPJ Web of Conferences ◽

10.1051/epjconf/202024501011 ◽

2020 ◽

Vol 245 ◽

pp. 01011

Author(s):

Rafał Dominik Krawczyk ◽

Tommaso Colombo ◽

Niko Neufeld ◽

Flavio Pisani ◽

Sébastien Valat

Keyword(s):

Flow Control ◽

High Throughput ◽

Data Exchange ◽

Essential Feature ◽

Direct Memory Access ◽

Lessons Learned ◽

Bandwidth Utilization ◽

The Road ◽

Zero Copy ◽

Partial Data

This paper evaluates the utilization of Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE) for the Run 3 LHCb event building at CERN. The acquisition system of the detector will collect partial data from approximately 1000 separate detector streams. The total estimated throughput equals 32 Terabits per second. Full events will be assembled for subsequent processing and data selection in the filtering farm of the online trigger. High-throughput transmissions with up to 90% links utilization will be an essential feature of the system. The data exchange mechanism must support zero-copy transmissions. In this work, the RoCE high-throughput kernel bypass Ethernet protocol is benchmarked as a potential alternative to InfiniBand. A RoCE-based event building network is presented and two implementations are considered. The former variant combined shallow-buffered and deep-buffered switches with enabled flow control. In the latter setup, only deep-buffered devices are used, where operation relied on their memory throughput and capacity. Feasibility tests were conducted with selected Ethernet switches. Memory bandwidth utilization was investigated, in comparison with InfiniBand. Relevant utilization and interoperability issues of RoCE flow control are detailed with lessons learned along the road.

Download Full-text

AvaTar: Zero-Copy Archiving With New Kernel-Level Operations

IEEE Access ◽

10.1109/access.2020.2982688 ◽

2020 ◽

Vol 8 ◽

pp. 59315-59325

Author(s):

Hyunchan Park ◽

Youngpil Kim ◽

Seehwan Yoo

Keyword(s):

Zero Copy ◽

Kernel Level

Download Full-text

zero copy
Recently Published Documents

TOTAL DOCUMENTS

H-INDEX

mpi4jax: Zero-copy MPI communication of JAX arrays

Large graph convolutional network training with GPU-oriented data communication architecture

Efficient ROS-Compliant CPU-iGPU Communication on Embedded Platforms

Generating High-Performance FPGA Accelerator Designs for Big Data Analytics with Fletcher and Apache Arrow

Coarrays in the Context of XcalableMP

FALCON-X: Zero-copy MPI derived datatype processing on modern CPU and GPU architectures

Zero-Copy Data Transfer for an OpenCL API Remoting System

Research on A Reliable Method of Receiving Data Based on Zero-Copy Non-Drop Packet Network Card

Feasibility tests of RoCE v2 for LHCb event building

AvaTar: Zero-Copy Archiving With New Kernel-Level Operations

Export Citation Format

zero copyRecently Published Documents

TOTAL DOCUMENTS

H-INDEX

mpi4jax: Zero-copy MPI communication of JAX arrays

Large graph convolutional network training with GPU-oriented data communication architecture

Efficient ROS-Compliant CPU-iGPU Communication on Embedded Platforms

Generating High-Performance FPGA Accelerator Designs for Big Data Analytics with Fletcher and Apache Arrow

Coarrays in the Context of XcalableMP

FALCON-X: Zero-copy MPI derived datatype processing on modern CPU and GPU architectures

Zero-Copy Data Transfer for an OpenCL API Remoting System

Research on A Reliable Method of Receiving Data Based on Zero-Copy Non-Drop Packet Network Card

Feasibility tests of RoCE v2 for LHCb event building

AvaTar: Zero-Copy Archiving With New Kernel-Level Operations

zero copy
Recently Published Documents