High Performance and Energy Efficient Integer Matrix Multiplication for Deep Learning

The dissemination of multi-core architectures and the later irruption of massively parallel devices, led to a revolution in High-Performance Computing (HPC) platforms in the last decades. As a result, Field-Programmable Gate Arrays (FPGAs) are re-emerging as a versatile and more energy-efficient alternative to other platforms. Traditional FPGA design implies using low-level Hardware Description Languages (HDL) such as VHDL or Verilog, which follow an entirely different programming model than standard software languages, and their use requires specialized knowledge of the underlying hardware. In the last years, manufacturers started to make big efforts to provide High-Level Synthesis (HLS) tools, in order to allow a grater adoption of FPGAs in the HPC community.Our work studies the use of multi-core hardware and different FPGAs to address Numerical Linear Algebra (NLA) kernels such as the general matrix multiplication GEMM and the sparse matrix-vector multiplication SpMV. Specifically, we compare the behavior of fine-tuned kernels in a multi-core CPU processor and HLS implementations on FPGAs. We perform the experimental evaluation of our implementations on a low-end and a cutting-edge FPGA platform, in terms of runtime and energy consumption, and compare the results against the Intel MKL library in CPU.

Download Full-text

Challenges And Energy Efficient Techniques For High Performance Wireless Communications

i-manager s Journal on Mobile Applications and Technologies ◽

10.26634/jmt.1.4.3516 ◽

2015 ◽

Vol 1 (4) ◽

pp. 1-12

Author(s):

Chidadala Janardhan ◽

◽

Bhagath Pyda ◽

J. Manohar ◽

K. V. Ramanaiah ◽

...

Keyword(s):

Wireless Communications ◽

Energy Efficient ◽

High Performance

Download Full-text

Energy Efficient Training Task Assignment Scheme for Mobile Distributed Deep Learning Scenario Using DQN

2019 IEEE 7th International Conference on Computer Science and Network Technology (ICCSNT) ◽

10.1109/iccsnt47585.2019.8962496 ◽

2019 ◽

Author(s):

Yutong Liu ◽

Lianping Zhang ◽

Yifei Wei ◽

Zhaoying Wang

Keyword(s):

Deep Learning ◽

Energy Efficient ◽

Task Assignment ◽

Training Task ◽

Assignment Scheme

Download Full-text

Deep Learning and Edge Computing Solutions for High Performance Computing

10.1007/978-3-030-60265-9 ◽

2021 ◽

Keyword(s):

Deep Learning ◽

High Performance Computing ◽

High Performance ◽

Edge Computing ◽

Performance Computing

Download Full-text

Low-Latency Energy-Efficient Cyber-Physical Disaster System Using Edge Deep Learning

Proceedings of the 21st International Conference on Distributed Computing and Networking ◽

10.1145/3369740.3372752 ◽

2020 ◽

Cited By ~ 1

Author(s):

Yashwant Singh Patel ◽

Sourasekhar Banerjee ◽

Rajiv Misra ◽

Sajal K. Das

Keyword(s):

Deep Learning ◽

Energy Efficient ◽

Low Latency ◽

Disaster System

Download Full-text

Thread Batching for High-performance Energy-efficient GPU Memory Design

ACM Journal on Emerging Technologies in Computing Systems ◽

10.1145/3330152 ◽

2019 ◽

Vol 15 (4) ◽

pp. 1-21

Author(s):

Bing Li ◽

Mengjie Mao ◽

Xiaoxiao Liu ◽

Tao Liu ◽

Zihao Liu ◽

...

Keyword(s):

Energy Efficient ◽

High Performance ◽

Memory Design

Download Full-text

Computational Complexity Reduction of Neural Networks of Brain Tumor Image Segmentation by Introducing Fermi–Dirac Correction Functions

Entropy ◽

10.3390/e23020223 ◽

2021 ◽

Vol 23 (2) ◽

pp. 223

Author(s):

Yen-Ling Tai ◽

Shin-Jhe Huang ◽

Chien-Chang Chen ◽

Henry Horng-Shing Lu

Keyword(s):

Neural Networks ◽

Deep Learning ◽

Computational Complexity ◽

High Performance ◽

Low Cost ◽

Structural Complexity ◽

Correction Function ◽

Computational Time ◽

Learning Methods ◽

Band Theory

Nowadays, deep learning methods with high structural complexity and flexibility inevitably lean on the computational capability of the hardware. A platform with high-performance GPUs and large amounts of memory could support neural networks having large numbers of layers and kernels. However, naively pursuing high-cost hardware would probably drag the technical development of deep learning methods. In the article, we thus establish a new preprocessing method to reduce the computational complexity of the neural networks. Inspired by the band theory of solids in physics, we map the image space into a noninteraction physical system isomorphically and then treat image voxels as particle-like clusters. Then, we reconstruct the Fermi–Dirac distribution to be a correction function for the normalization of the voxel intensity and as a filter of insignificant cluster components. The filtered clusters at the circumstance can delineate the morphological heterogeneity of the image voxels. We used the BraTS 2019 datasets and the dimensional fusion U-net for the algorithmic validation, and the proposed Fermi–Dirac correction function exhibited comparable performance to other employed preprocessing methods. By comparing to the conventional z-score normalization function and the Gamma correction function, the proposed algorithm can save at least 38% of computational time cost under a low-cost hardware architecture. Even though the correction function of global histogram equalization has the lowest computational time among the employed correction functions, the proposed Fermi–Dirac correction function exhibits better capabilities of image augmentation and segmentation.

Download Full-text