Rethinking Embedded Blocks for Machine Learning Applications
The underlying goal of FPGA architecture research is to devise flexible substrates that implement a wide variety of circuits efficiently. Contemporary FPGA architectures have been optimized to support networking, signal processing, and image processing applications through high-precision digital signal processing (DSP) blocks. The recent emergence of machine learning has created a new set of demands characterized by: (1) higher computational density and (2) low precision arithmetic requirements. With the goal of exploring this new design space in a methodical manner, we first propose a problem formulation involving computing nested loops over multiply-accumulate (MAC) operations, which covers many basic linear algebra primitives and standard deep neural network (DNN) kernels. A quantitative methodology for deriving efficient coarse-grained compute block architectures from benchmarks is then proposed together with a family of new embedded blocks, called MLBlocks. An MLBlock instance includes several multiply-accumulate units connected via a flexible routing, where each configuration performs a few parallel dot-products in a systolic array fashion. This architecture is parameterized with support for different data movements, reuse, and precisions, utilizing a columnar arrangement that is compatible with existing FPGA architectures. On synthetic benchmarks, we demonstrate that for 8-bit arithmetic, MLBlocks offer 6× improved performance over the commercial Xilinx DSP48E2 architecture with smaller area and delay; and for time-multiplexed 16-bit arithmetic, achieves 2× higher performance per area with the same area and frequency. All source codes and data, along with documents to reproduce all the results in this article, are available at http://github.com/raminrasoulinezhad/MLBlocks .