High Performance and Energy Efficient Integer Matrix Multiplication for Deep Learning

Author(s):  
Pau San Juan ◽  
Pedro Alonso-Jorda ◽  
Enrique S. Quintana-Orti
2014 ◽  
Vol 29 ◽  
pp. 599-613 ◽  
Author(s):  
Li Tan ◽  
Longxiang Chen ◽  
Zizhong Chen ◽  
Ziliang Zong ◽  
Rong Ge ◽  
...  

2021 ◽  
Vol 21 (2) ◽  
pp. e09
Author(s):  
Federico Favaro ◽  
Ernesto Dufrechou ◽  
Pablo Ezzatti ◽  
Juan Pablo Oliver

The dissemination of multi-core architectures and the later irruption of massively parallel devices, led to a revolution in High-Performance Computing (HPC) platforms in the last decades. As a result, Field-Programmable Gate Arrays (FPGAs) are re-emerging as a versatile and more energy-efficient alternative to other platforms. Traditional FPGA design implies using low-level Hardware Description Languages (HDL) such as VHDL or Verilog, which follow an entirely different programming model than standard software languages, and their use requires specialized knowledge of the underlying hardware. In the last years, manufacturers started to make big efforts to provide High-Level Synthesis (HLS) tools, in order to allow a grater adoption of FPGAs in the HPC community.Our work studies the use of multi-core hardware and different FPGAs to address Numerical Linear Algebra (NLA) kernels such as the general matrix multiplication GEMM and the sparse matrix-vector multiplication SpMV. Specifically, we compare the behavior of fine-tuned kernels in a multi-core CPU processor and HLS implementations on FPGAs. We perform the experimental evaluation of our implementations on a low-end and a cutting-edge FPGA platform, in terms of runtime and energy consumption, and compare the results against the Intel MKL library in CPU.  


2015 ◽  
Vol 1 (4) ◽  
pp. 1-12
Author(s):  
Chidadala Janardhan ◽  
◽  
Bhagath Pyda ◽  
J. Manohar ◽  
K. V. Ramanaiah ◽  
...  

2019 ◽  
Vol 15 (4) ◽  
pp. 1-21
Author(s):  
Bing Li ◽  
Mengjie Mao ◽  
Xiaoxiao Liu ◽  
Tao Liu ◽  
Zihao Liu ◽  
...  

Entropy ◽  
2021 ◽  
Vol 23 (2) ◽  
pp. 223
Author(s):  
Yen-Ling Tai ◽  
Shin-Jhe Huang ◽  
Chien-Chang Chen ◽  
Henry Horng-Shing Lu

Nowadays, deep learning methods with high structural complexity and flexibility inevitably lean on the computational capability of the hardware. A platform with high-performance GPUs and large amounts of memory could support neural networks having large numbers of layers and kernels. However, naively pursuing high-cost hardware would probably drag the technical development of deep learning methods. In the article, we thus establish a new preprocessing method to reduce the computational complexity of the neural networks. Inspired by the band theory of solids in physics, we map the image space into a noninteraction physical system isomorphically and then treat image voxels as particle-like clusters. Then, we reconstruct the Fermi–Dirac distribution to be a correction function for the normalization of the voxel intensity and as a filter of insignificant cluster components. The filtered clusters at the circumstance can delineate the morphological heterogeneity of the image voxels. We used the BraTS 2019 datasets and the dimensional fusion U-net for the algorithmic validation, and the proposed Fermi–Dirac correction function exhibited comparable performance to other employed preprocessing methods. By comparing to the conventional z-score normalization function and the Gamma correction function, the proposed algorithm can save at least 38% of computational time cost under a low-cost hardware architecture. Even though the correction function of global histogram equalization has the lowest computational time among the employed correction functions, the proposed Fermi–Dirac correction function exhibits better capabilities of image augmentation and segmentation.


Sign in / Sign up

Export Citation Format

Share Document