precision matrix
Recently Published Documents


TOTAL DOCUMENTS

138
(FIVE YEARS 34)

H-INDEX

16
(FIVE YEARS 2)

2022 ◽  
Vol 19 (1) ◽  
pp. 1-23
Author(s):  
Yaosheng Fu ◽  
Evgeny Bolotin ◽  
Niladrish Chatterjee ◽  
David Nellans ◽  
Stephen W. Keckler

As GPUs scale their low-precision matrix math throughput to boost deep learning (DL) performance, they upset the balance between math throughput and memory system capabilities. We demonstrate that a converged GPU design trying to address diverging architectural requirements between FP32 (or larger)-based HPC and FP16 (or smaller)-based DL workloads results in sub-optimal configurations for either of the application domains. We argue that a C omposable O n- PA ckage GPU (COPA-GPU) architecture to provide domain-specialized GPU products is the most practical solution to these diverging requirements. A COPA-GPU leverages multi-chip-module disaggregation to support maximal design reuse, along with memory system specialization per application domain. We show how a COPA-GPU enables DL-specialized products by modular augmentation of the baseline GPU architecture with up to 4× higher off-die bandwidth, 32× larger on-package cache, and 2.3× higher DRAM bandwidth and capacity, while conveniently supporting scaled-down HPC-oriented designs. This work explores the microarchitectural design necessary to enable composable GPUs and evaluates the benefits composability can provide to HPC, DL training, and DL inference. We show that when compared to a converged GPU design, a DL-optimized COPA-GPU featuring a combination of 16× larger cache capacity and 1.6× higher DRAM bandwidth scales per-GPU training and inference performance by 31% and 35%, respectively, and reduces the number of GPU instances by 50% in scale-out training scenarios.


2021 ◽  
pp. 101389
Author(s):  
Aryan Eftekhari ◽  
Dimosthenis Pasadakis ◽  
Matthias Bollhöfer ◽  
Simon Scheidegger ◽  
Olaf Schenk

Author(s):  
H. Chatrabgoun ◽  
A. R. Soltanian ◽  
H. Mahjub ◽  
F. Bahreini

Large amounts of research efforts have been focused on learning gene regulatory networks (GRNs) based on gene expression data to understand the functional basis of a living organism. Under the assumption that the joint distribution of the gene expressions of interest is a multivariate normal distribution, such networks can be constructed by assessing the nonzero elements of the inverse covariance matrix, the so-called precision matrix or concentration matrix. This may not reflect the true connectivity between genes by considering just pairwise linear correlations. To relax this limitative constraint, we employ Gaussian process (GP) model which is well known as computationally efficient non-parametric Bayesian machine learning technique. GPs are among a class of methods known as kernel machines which can be used to approximate complex problems by tuning their hyperparameters. In fact, GP creates the ability to use the capacity and potential of different kernels in constructing precision matrix and GRNs. In this paper, in the first step, we choose the GP with appropriate kernel to learn the considered GRNs from the observed genetic data, and then we estimate kernel hyperparameters using rule-of-thumb technique. Using these hyperparameters, we can also control the degree of sparseness in the precision matrix. Then we obtain kernel-based precision matrix similar to GLASSO to construct kernel-based GRN. The findings of our research are used to construct GRNs with high performance, for different species of Drosophila fly rather than simply using the assumption of multivariate normal distribution, and the GPs, despite the use of the kernels capacity, have a much better performance than the multivariate Gaussian distribution assumption.


2021 ◽  
Vol 47 (2) ◽  
pp. 1-26
Author(s):  
Field G. Van Zee ◽  
Devangi N. Parikh ◽  
Robert A. Van De Geijn

We approach the problem of implementing mixed-datatype support within the general matrix multiplication ( gemm ) operation of the BLAS-like Library Instantiation Software framework, whereby each matrix operand A , B , and C may be stored as single- or double-precision real or complex values. Another factor of complexity, whereby the matrix product and accumulation are allowed to take place in a precision different from the storage precisions of either A or B , is also discussed. We first break the problem into orthogonal dimensions, considering the mixing of domains separately from mixing precisions. Support for all combinations of matrix operands stored in either the real or complex domain is mapped out by enumerating the cases and describing an implementation approach for each. Supporting all combinations of storage and computation precisions is handled by typecasting the matrices at key stages of the computation—during packing and/or accumulation, as needed. Several optional optimizations are also documented. Performance results gathered on a 56-core Marvell ThunderX2 and a 52-core Intel Xeon Platinum demonstrate that high performance is mostly preserved, with modest slowdowns incurred from unavoidable typecast instructions. The mixed-datatype implementation confirms that combinatorial intractability is avoided, with the framework relying on only two assembly microkernels to implement 128 datatype combinations.


Sign in / Sign up

Export Citation Format

Share Document