INFerence-as-a-Service (INFaaS) has become a primary workload in the cloud. However, existing FPGA-based Deep Neural Network (DNN) accelerators are mainly optimized for the fastest speed of a single task, while the multi-tenancy of INFaaS has not been explored yet. As the demand for INFaaS keeps growing, simply increasing the number of FPGA-based DNN accelerators is not cost-effective, while merely sharing these single-task optimized DNN accelerators in a time-division multiplexing way could lead to poor isolation and high-performance loss for INFaaS. On the other hand, current cloud-based DNN accelerators have excessive compilation overhead, especially when scaling out to multi-FPGA systems for multi-tenant sharing, leading to unacceptable compilation costs for both offline deployment and online reconfiguration. Therefore, it is far from providing efficient and flexible FPGA virtualization for public and private cloud scenarios.
Aiming to solve these problems, we propose a unified virtualization framework for general-purpose deep neural networks in the cloud, enabling multi-tenant sharing for both the Convolution Neural Network (CNN), and the Recurrent Neural Network (RNN) accelerators on a single FPGA. The isolation is enabled by introducing a two-level instruction dispatch module and a multi-core based hardware resources pool. Such designs provide isolated and runtime-programmable hardware resources, which further leads to performance isolation for multi-tenant sharing. On the other hand, to overcome the heavy re-compilation overheads, a tiling-based instruction frame package design and a two-stage static-dynamic compilation, are proposed. Only the lightweight runtime information is re-compiled with ∼1 ms overhead, thus guaranteeing the private cloud’s performance. Finally, the extensive experimental results show that the proposed virtualized solutions achieve up to 3.12× and 6.18× higher throughput in the private cloud compared with the static CNN and RNN baseline designs, respectively.