GPU Domain Specialization via Composable On-Package Architecture

Yaosheng Fu; Evgeny Bolotin; Niladrish Chatterjee; David Nellans; Stephen W. Keckler

doi:10.1145/3484505

GPU Domain Specialization via Composable On-Package Architecture

ACM Transactions on Architecture and Code Optimization ◽

10.1145/3484505 ◽

2022 ◽

Vol 19 (1) ◽

pp. 1-23

Author(s):

Yaosheng Fu ◽

Evgeny Bolotin ◽

Niladrish Chatterjee ◽

David Nellans ◽

Stephen W. Keckler

Keyword(s):

Deep Learning ◽

Memory System ◽

Design Reuse ◽

Application Domain ◽

Precision Matrix ◽

Practical Solution ◽

Optimal Configurations ◽

Gpu Architecture ◽

With Memory ◽

Cache Capacity

As GPUs scale their low-precision matrix math throughput to boost deep learning (DL) performance, they upset the balance between math throughput and memory system capabilities. We demonstrate that a converged GPU design trying to address diverging architectural requirements between FP32 (or larger)-based HPC and FP16 (or smaller)-based DL workloads results in sub-optimal configurations for either of the application domains. We argue that a C omposable O n- PA ckage GPU (COPA-GPU) architecture to provide domain-specialized GPU products is the most practical solution to these diverging requirements. A COPA-GPU leverages multi-chip-module disaggregation to support maximal design reuse, along with memory system specialization per application domain. We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4× higher off-die bandwidth, 32× larger on-package cache, and 2.3× higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs. This work explores the microarchitectural design necessary to enable composable GPUs and evaluates the benefits composability can provide to HPC, DL training, and DL inference. We show that when compared to a converged GPU design, a DL-optimized COPA-GPU featuring a combination of 16× larger cache capacity and 1.6× higher DRAM bandwidth scales per-GPU training and inference performance by 31% and 35%, respectively, and reduces the number of GPU instances by 50% in scale-out training scenarios.

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

Proceedings of the 36th annual international symposium on Computer architecture - ISCA '09 ◽

10.1145/1555754.1555775 ◽

2009 ◽

Cited By ~ 256

Author(s):

Sunpyo Hong ◽

Hyesoon Kim

Keyword(s):

Analytical Model ◽

Thread Level Parallelism ◽

Level Parallelism ◽

Gpu Architecture ◽

With Memory

DeLTA: GPU Performance Model for Deep Learning Applications with In-Depth Memory System Traffic Analysis

2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) ◽

10.1109/ispass.2019.00041 ◽

2019 ◽

Cited By ~ 4

Author(s):

Sangkug Lym ◽

Donghyuk Lee ◽

Mike O'Connor ◽

Niladrish Chatterjee ◽

Mattan Erez

Keyword(s):

Deep Learning ◽

Traffic Analysis ◽

Memory System ◽

Performance Model

A Practical Solution for Non-Intrusive Type II Load Monitoring Based on Deep Learning and Post-Processing

IEEE Transactions on Smart Grid ◽

10.1109/tsg.2019.2918330 ◽

2020 ◽

Vol 11 (1) ◽

pp. 148-160 ◽

Cited By ~ 10

Author(s):

Weicong Kong ◽

Zhao Yang Dong ◽

Bo Wang ◽

Junhua Zhao ◽

Jie Huang

Keyword(s):

Deep Learning ◽

Type Ii ◽

Post Processing ◽

Practical Solution ◽

Load Monitoring

Investigation of optimal configurations of a convolutional neural network for the identification of objects in real-time

Information Technology and Nanotechnology ◽

10.18287/1613-0073-2019-2416-417-423 ◽

2019 ◽

pp. 417-423

Author(s):

M A Isayev ◽

D A Savelyev

Keyword(s):

Neural Network ◽

Neural Networks ◽

Deep Learning ◽

Convolutional Neural Network ◽

Real Time ◽

State Of The Art ◽

Average Precision ◽

The Core ◽

Particular Solution ◽

Optimal Configurations

The comparison of different convolutional neural networks which are the core of the most actual solutions in the computer vision area is considers in hhe paper. The study includes benchmarks of this state-of-the-art solutions by some criteria, such as mAP (mean average precision), FPS (frames per seconds), for the possibility of real-time usability. It is concluded on the best convolutional neural network model and deep learning methods that were used at particular solution.

A Comparative Analysis of Machine/Deep Learning Models for Parking Space Availability Prediction

Sensors ◽

10.3390/s20010322 ◽

2020 ◽

Vol 20 (1) ◽

pp. 322 ◽

Cited By ~ 9

Author(s):

Faraz Malik Awan ◽

Yasir Saleem ◽

Roberto Minerva ◽

Noel Crespi

Keyword(s):

Deep Learning ◽

Comparative Analysis ◽

Random Forest ◽

Decision Tree ◽

Multilayer Perceptron ◽

Large Data ◽

Data Sets ◽

Application Domain ◽

Parking Space ◽

Data Set

Machine/Deep Learning (ML/DL) techniques have been applied to large data sets in order to extract relevant information and for making predictions. The performance and the outcomes of different ML/DL algorithms may vary depending upon the data sets being used, as well as on the suitability of algorithms to the data and the application domain under consideration. Hence, determining which ML/DL algorithm is most suitable for a specific application domain and its related data sets would be a key advantage. To respond to this need, a comparative analysis of well-known ML/DL techniques, including Multilayer Perceptron, K-Nearest Neighbors, Decision Tree, Random Forest, and Voting Classifier (or the Ensemble Learning Approach) for the prediction of parking space availability has been conducted. This comparison utilized Santander’s parking data set, initiated while working on the H2020 WISE-IoT project. The data set was used in order to evaluate the considered algorithms and to determine the one offering the best prediction. The results of this analysis show that, regardless of the data set size, the less complex algorithms like Decision Tree, Random Forest, and KNN outperform complex algorithms such as Multilayer Perceptron, in terms of higher prediction accuracy, while providing comparable information for the prediction of parking space availability. In addition, in this paper, we are providing Top-K parking space recommendations on the basis of distance between current position of vehicles and free parking spots.

Memory system characterization of deep learning workloads

Proceedings of the International Symposium on Memory Systems - MEMSYS '19 ◽

10.1145/3357526.3357569 ◽

2019 ◽

Author(s):

Zeshan Chishti ◽

Berkin Akin

Keyword(s):

Deep Learning ◽

Memory System ◽

System Characterization

Low precision matrix multiplication for efficient deep learning in NVIDIA Carmel processors

The Journal of Supercomputing ◽

10.1007/s11227-021-03636-4 ◽

2021 ◽

Author(s):

Pablo San Juan ◽

Rafael Rodríguez-Sánchez ◽

Francisco D. Igual ◽

Pedro Alonso-Jordá ◽

Enrique S. Quintana-Ortí

Keyword(s):

Deep Learning ◽

Matrix Multiplication ◽

Precision Matrix

An analytical model for a GPU architecture with memory-level and thread-level parallelism awareness

ACM SIGARCH Computer Architecture News ◽

10.1145/1555815.1555775 ◽

2009 ◽

Vol 37 (3) ◽

pp. 152-163 ◽

Cited By ~ 105

Author(s):

Sunpyo Hong ◽

Hyesoon Kim

Keyword(s):

Analytical Model ◽

Thread Level Parallelism ◽

Level Parallelism ◽

Gpu Architecture ◽

With Memory

Optimizing datacenter power with memory system levers for guaranteed quality-of-service

Proceedings of the 21st international conference on Parallel architectures and compilation techniques - PACT '12 ◽

10.1145/2370816.2370834 ◽

2012 ◽

Cited By ~ 5

Author(s):

Kshitij Sudan ◽

Sadagopan Srinivasan ◽

Rajeev Balasubramonian ◽

Ravi Iyer

Keyword(s):

Quality Of Service ◽

Memory System ◽

Guaranteed Quality ◽

With Memory

Machine learning in medicine: what clinicians should know

Singapore Medical Journal ◽

10.11622/smedj.2021054 ◽

2021 ◽

Author(s):

JZT Sim ◽

QW Fong ◽

WM Huang ◽

CH Tan

Keyword(s):

Artificial Intelligence ◽

Machine Learning ◽

Neural Networks ◽

Decision Making ◽

Deep Learning ◽

Language Processing ◽

Decision Making Process ◽

Application Domain ◽

Basic Concepts ◽

Potential Impact

With the advent of artificial intelligence (AI), machines are increasingly being used to complete complicated tasks, yielding remarkable results. Machine learning (ML) is the most relevant subset of AI in medicine, which will soon become an integral part of our everyday practice. Therefore, physicians should acquaint themselves with ML and AI, and their role as an enabler rather than a competitor. Herein, we introduce basic concepts and terms used in AI and ML, and aim to demystify commonly used AI/ML algorithms such as learning methods including neural networks/deep learning, decision tree and application domain in computer vision and natural language processing through specific examples. We discuss how machines are already being used to augment the physician’s decision-making process, and postulate the potential impact of ML on medical practice and medical research based on its current capabilities and known limitations. Moreover, we discuss the feasibility of full machine autonomy in medicine.