data layout
Recently Published Documents


TOTAL DOCUMENTS

252
(FIVE YEARS 47)

H-INDEX

17
(FIVE YEARS 1)

2022 ◽  
Vol 19 (1) ◽  
pp. 1-26
Author(s):  
Dennis Rieber ◽  
Axel Acosta ◽  
Holger Fröning

The success of Deep Artificial Neural Networks (DNNs) in many domains created a rich body of research concerned with hardware accelerators for compute-intensive DNN operators. However, implementing such operators efficiently with complex hardware intrinsics such as matrix multiply is a task not yet automated gracefully. Solving this task often requires joint program and data layout transformations. First solutions to this problem have been proposed, such as TVM, UNIT, or ISAMIR, which work on a loop-level representation of operators and specify data layout and possible program transformations before the embedding into the operator is performed. This top-down approach creates a tension between exploration range and search space complexity, especially when also exploring data layout transformations such as im2col, channel packing, or padding. In this work, we propose a new approach to this problem. We created a bottom-up method that allows the joint transformation of both computation and data layout based on the found embedding. By formulating the embedding as a constraint satisfaction problem over the scalar dataflow, every possible embedding solution is contained in the search space. Adding additional constraints and optimization targets to the solver generates the subset of preferable solutions. An evaluation using the VTA hardware accelerator with the Baidu DeepBench inference benchmark shows that our approach can automatically generate code competitive to reference implementations. Further, we show that dynamically determining the data layout based on intrinsic and workload is beneficial for hardware utilization and performance. In cases where the reference implementation has low hardware utilization due to its fixed deployment strategy, we achieve a geomean speedup of up to × 2.813, while individual operators can improve as much as × 170.


2021 ◽  
Vol 14 (12) ◽  
pp. 7477-7495
Author(s):  
Rafael Lago ◽  
Thomas Gastine ◽  
Tilman Dannert ◽  
Markus Rampp ◽  
Johannes Wicht

Abstract. We discuss two parallelization schemes for MagIC, an open-source, high-performance, pseudo-spectral code for the numerical solution of the magnetohydrodynamics equations in a rotating spherical shell. MagIC calculates the non-linear terms on a numerical grid in spherical coordinates, while the time step updates are performed on radial grid points with a spherical harmonic representation of the lateral directions. Several transforms are required to switch between the different representations. The established hybrid parallelization of MagIC uses message-passing interface (MPI) distribution in radius and relies on existing fast spherical transforms using OpenMP. Our new two-dimensional MPI decomposition implementation also distributes the latitudes or the azimuthal wavenumbers across the available MPI tasks and compute cores. We discuss several non-trivial algorithmic optimizations and the different data distribution layouts employed by our scheme. In particular, the two-dimensional distribution data layout yields a code that strongly scales well beyond the limit of the current one-dimensional distribution. We also show that the two-dimensional distribution implementation, although not yet fully optimized, can already be faster than the existing finely optimized hybrid parallelization when using many thousands of CPU cores. Our analysis indicates that the two-dimensional distribution variant can be further optimized to also surpass the performance of the one-dimensional distribution for a few thousand cores.


2021 ◽  
Vol 15 (1) ◽  
pp. 127-140
Author(s):  
Muhammad Adnan ◽  
Yassaman Ebrahimzadeh Maboud ◽  
Divya Mahajan ◽  
Prashant J. Nair

Recommender models are commonly used to suggest relevant items to a user for e-commerce and online advertisement-based applications. These models use massive embedding tables to store numerical representation of items' and users' categorical variables (memory intensive) and employ neural networks (compute intensive) to generate final recommendations. Training these large-scale recommendation models is evolving to require increasing data and compute resources. The highly parallel neural networks portion of these models can benefit from GPU acceleration however, large embedding tables often cannot fit in the limited-capacity GPU device memory. Hence, this paper deep dives into the semantics of training data and obtains insights about the feature access, transfer, and usage patterns of these models. We observe that, due to the popularity of certain inputs, the accesses to the embeddings are highly skewed with a few embedding entries being accessed up to 10000X more. This paper leverages this asymmetrical access pattern to offer a framework, called FAE, and proposes a hot-embedding aware data layout for training recommender models. This layout utilizes the scarce GPU memory for storing the highly accessed embeddings, thus reduces the data transfers from CPU to GPU. At the same time, FAE engages the GPU to accelerate the executions of these hot embedding entries. Experiments on production-scale recommendation models with real datasets show that FAE reduces the overall training time by 2.3X and 1.52X in comparison to XDL CPU-only and XDL CPU-GPU execution while maintaining baseline accuracy.


2021 ◽  
Author(s):  
Minxuan Zhou ◽  
Guoyang Chen ◽  
Mohsen Imani ◽  
Saransh Gupta ◽  
Weifeng Zhang ◽  
...  

2021 ◽  
Author(s):  
Rafael Lago ◽  
Thomas Gastine ◽  
Tilman Dannert ◽  
Markus Rampp ◽  
Johannes Wicht

Abstract. We discuss two parallelization schemes for MagIC, an open-source, high-performance, pseudo-spectral code for the numerical solution of the magneto hydrodynamics equations in a rotating spherical shell. MagIC calculates the non-linear terms on a numerical grid in spherical coordinates while the time step updates are performed on radial grid points with a spherical harmonic representation of the lateral directions. Several transforms are required to switch between the different representations. The established hybrid implementation of MagIC uses MPI-parallelization in radius and relies on existing fast spherical transforms using OpenMP. Our new two-dimensional MPI decomposition implementation also distributes the latitudes or the azimuthal wavenumbers across the available MPI tasks/compute cores. We discuss several non-trivial algorithmic optimizations and the different data distribution layouts employed by our scheme. In particular, the two-dimensional distribution data layout yields a code that strongly scales well beyond the limit of the current one-dimensional distribution. We also show that the two-dimensional distribution implementation, although not yet fully optimized, can already be faster than the existing finely optimized hybrid implementation when using many thousands of CPU cores. Our analysis indicates that the two-dimensional distribution variant can be further optimized to also surpass the performance of the one-dimensional distribution for a few thousand cores.


Quantum ◽  
2021 ◽  
Vol 5 ◽  
pp. 497
Author(s):  
Craig Gidney

This paper presents “Stim", a fast simulator for quantum stabilizer circuits. The paper explains how Stim works and compares it to existing tools. With no foreknowledge, Stim can analyze a distance 100 surface code circuit (20 thousand qubits, 8 million gates, 1 million measurements) in 15 seconds and then begin sampling full circuit shots at a rate of 1 kHz. Stim uses a stabilizer tableau representation, similar to Aaronson and Gottesman's CHP simulator, but with three main improvements. First, Stim improves the asymptotic complexity of deterministic measurement from quadratic to linear by tracking the inverse of the circuit's stabilizer tableau. Second, Stim improves the constant factors of the algorithm by using a cache-friendly data layout and 256 bit wide SIMD instructions. Third, Stim only uses expensive stabilizer tableau simulation to create an initial reference sample. Further samples are collected in bulk by using that sample as a reference for batches of Pauli frames propagating through the circuit.


Energies ◽  
2021 ◽  
Vol 14 (13) ◽  
pp. 4059
Author(s):  
Marek Kęsek ◽  
Romuald Ogrodnik

Mining machinery and equipment used in modern mining are equipped with sensors and measurement systems at the stage of their production. Measuring devices are most often components of a control system or a machine performance monitoring system. In the case of headers, the primary task of these systems is to ensure safe operation and to monitor its correctness. It is customary to collect information in very large databases and analyze it when a failure occurs. Data mining methods allow for analysis to be made during the operation of machinery and mining equipment, thanks to which it is possible to determine not only their technical condition but also the causes of any changes that have occurred. The purpose of this work is to present a method for discovering missing information based on other available parameters, which facilitates the subsequent analysis of machine performance. The primary data used in this paper are the currents flowing through the windings of four header motors. In the method, the original reconstruction of the data layout was performed using the R language function, and then the analysis of the operating states of the header was performed based on these data. Based on the rules used and determined in the analysis, the percentage structure of machine operation states was obtained, which allows for additional reporting and verification of parts of the process.


2021 ◽  
Vol 14 (11) ◽  
pp. 2216-2229
Author(s):  
Subhadeep Sarkar ◽  
Dimitris Staratzis ◽  
Ziehen Zhu ◽  
Manos Athanassoulis

Log-structured merge (LSM) trees offer efficient ingestion by appending incoming data, and thus, are widely used as the storage layer of production NoSQL data stores. To enable competitive read performance, LSM-trees periodically re-organize data to form a tree with levels of exponentially increasing capacity, through iterative compactions. Compactions fundamentally influence the performance of an LSM-engine in terms of write amplification, write throughput, point and range lookup performance, space amplification, and delete performance. Hence, choosing the appropriate compaction strategy is crucial and, at the same time, hard as the LSM-compaction design space is vast, largely unexplored, and has not been formally defined in the literature. As a result, most LSM-based engines use a fixed compaction strategy, typically hand-picked by an engineer, which decides how and when to compact data. In this paper, we present the design space of LSM-compactions, and evaluate state-of-the-art compaction strategies with respect to key performance metrics. Toward this goal, our first contribution is to introduce a set of four design primitives that can formally define any compaction strategy: (i) the compaction trigger, (ii) the data layout, (iii) the compaction granularity, and (iv) the data movement policy. Together, these primitives can synthesize both existing and completely new compaction strategies. Our second contribution is to experimentally analyze 10 compaction strategies. We present 12 observations and 7 high-level takeaway messages, which show how LSM systems can navigate the compaction design space.


Author(s):  
А. А. Prokurovsky ◽  
R. А. Gematudinov

The development of self-driving cars in the modern world is accelerating every year, and it is obvious that the safe movement of such cars is impossible without special fault-tolerant software tools. With the improvement of technology, such software tools increasingly include elements of artificial intelligence. Currently, on the basis of flexible mechanisms of the 1C:Enterprise 8.3 platform, an application solution is being developed that allows to analyze the behavior of an self-driving car on a public road. Such software can be used in the educational activities of students in areas related to mathematical statistics, as well as to study mathematical methods that optimize the operation of the on-board computer of a driving self-driving car. Considering the growth of educational programs, which include the study of applied solutions and the development of such solutions on the 1C:Enterprise 8.3 platform, the use of the software in question in the educational process is available to students, useful and interesting for them. The presence of a large number of reports using the data layout system will allow to analyze the movement of self-driving cars in precisely those sections that are necessary for a student or researcher to conduct research activities. 


2021 ◽  
pp. 102183
Author(s):  
Cheng Ji ◽  
Fan Wu ◽  
Zongwei Zhu ◽  
Li-pin Chang ◽  
Huanghe Liu ◽  
...  

Sign in / Sign up

Export Citation Format

Share Document